Architecture¶

This page explains the technical architecture of the character network database and processing pipeline.

System Overview¶

The Naruto character network system consists of three main layers:

graph TB
    A[Subtitle Files<br/>.ass format] -->|Parse| B[Processing Pipeline<br/>naruto_net package]
    B -->|Extract| C[Intermediate Data<br/>Events, Scenes, Mentions]
    C -->|Build| D[Network Edges<br/>CSV format]
    D -->|Import| E[Neo4j Graph DB<br/>:Character :CONNECTED]
    E -->|Query| F[Network Analysis<br/>Centrality, Communities]
    F -->|Visualize| G[Interactive Viz<br/>D3.js]

    style A fill:#FFF3E0
    style B fill:#E8F5E9
    style C fill:#F3E5F5
    style D fill:#E3F2FD
    style E fill:#FFE0B2
    style F fill:#F1F8E9
    style G fill:#FCE4EC

Neo4j Graph Database Schema¶

The character network is stored in a Neo4j graph database using a multi-label pattern for arc-specific analysis.

Complete Schema Diagram¶

Neo4j Schema

Full schema showing node structure, relationship properties, and example connection

Node Structure: `:Character`¶

Every character in the network has the base label :Character plus arc-specific labels.

Multi-Label Pattern¶

(:Character:ChūninExams:SasukeRetrieval:PainsAssault)

Why multi-labels?

Simple queries: MATCH (c:ChūninExams:SasukeRetrieval) finds characters in both arcs
Built-in counting: size(labels(c)) - 1 = number of arcs
More intuitive than separate Arc nodes with relationships

Character Properties¶

Property	Type	Description	Indexed
`character_id`	INTEGER	Unique identifier	✓ (unique)
`name`	STRING	Canonical name	✓ (unique)
`aliases`	LIST\<STRING>	Known nicknames/titles
`affiliation_primary`	STRING	Main village/organization	✓
`affiliation_detail`	STRING	Squad/team membership
`affiliation_changes`	STRING	Defection tracking (e.g., Sasuke)
`first_appearance_arc`	STRING	Debut arc
`role_type`	STRING	protagonist/antagonist/supporting	✓
`estimated_importance`	INTEGER	1-10 centrality estimate	✓

Example node:

(:Character:ChūninExams:SasukeRetrieval {
  character_id: 1,
  name: "Naruto Uzumaki",
  aliases: ["Naruto", "Nine-Tails", "Hokage"],
  affiliation_primary: "Konoha",
  affiliation_detail: "Team 7",
  role_type: "protagonist",
  estimated_importance: 10
})

Relationship Structure: `:CONNECTED`¶

Characters are connected with a single relationship type that stores semantic information in properties.

Why Single Relationship Type?¶

Simplicity: Easier to query all connections without UNION clauses Flexibility: Relationship semantics stored in relationship_types array Data lineage: source property distinguishes manual vs extracted edges

Relationship Properties¶

Property	Type	Description
`arc`	STRING	"Chunin Exams" \| "Sasuke Retrieval" \| "Pain's Assault"
`weight`	INTEGER	Scene co-appearance count (1 for manual seeds)
`source`	STRING	"manual_seed" \| "subtitle_extraction"
`relationship_types`	LIST\<STRING>	["team", "rivals", "enemies", "mentors", "siblings", "parents"]
`episodes`	LIST\<INTEGER>	Episode numbers (empty for manual seeds)
`confidence`	FLOAT	1.0 for manual, 0.0-1.0 for extracted

Controlled Vocabulary for `relationship_types`¶

team — Same squad/organization
rivals — Competitive relationship
enemies — Active antagonism
mentors — Teacher/student dynamic
siblings — Family relationship (blood or found family)
parents — Parent/child relationship
co_appearance — Generic co-occurrence (for extracted data)

Example relationship:

(naruto:Character)-[:CONNECTED {
  arc: "Chunin Exams",
  weight: 1,
  source: "manual_seed",
  relationship_types: ["team", "rivals"],
  episodes: [],
  confidence: 1.0
}]->(sasuke:Character)

Bidirectional Creation¶

For symmetric co-appearance relationships, we create both (A→B) and (B→A) edges. This ensures queries like "MATCH (n:Character)-[:CONNECTED]->(m)" work regardless of edge direction.

Processing Pipeline Architecture¶

The naruto_net Python package implements a modular pipeline:

Pipeline Stages¶

graph LR
    A[1. IO<br/>Parse .ass] --> B[2. Normalize<br/>Clean text]
    B --> C[3. Segment<br/>Detect scenes]
    C --> D[4. Detect<br/>Find mentions]
    D --> E[5. Build<br/>Create edges]
    E --> F[6. QC<br/>Validate]

    style A fill:#E3F2FD
    style B fill:#F3E5F5
    style C fill:#FFF3E0
    style D fill:#E8F5E9
    style E fill:#FFE0B2
    style F fill:#FFEBEE

Module Breakdown¶

1. IO (`src/naruto_net/io/`)¶

Purpose: Parse .ass subtitle files into structured events

Key classes:

AssReader — Loads .ass files, extracts dialogue lines
SubtitleEvent — Dataclass for (start, end, text) tuples

Output: data/intermediate/events_<episode>.csv

2. Normalize (`src/naruto_net/normalize/`)¶

Purpose: Clean ASS formatting tags and split multi-speaker lines

Functions:

strip_ass_tags() — Remove {\i1}, {\b1}, etc.
normalize_newlines() — Convert \N to space
split_multi_speaker() — Handle "Speaker A: ... Speaker B: ..." lines

Output: data/intermediate/utterances_<episode>.csv

3. Segment (`src/naruto_net/segment/`)¶

Purpose: Group utterances into scenes using timing gaps

Logic: If gap between consecutive lines > 3 seconds → scene break

Output: data/intermediate/scenes_<episode>.csv

4. Detect (`src/naruto_net/detect/`)¶

Purpose: Find character mentions using alias matching

Method: Word-boundary regex against character_aliases.json

Confidence heuristic:

Direct name match → 1.0
Alias match (e.g., "Pervy Sage" → Jiraiya) → 0.8
Honorific only (e.g., "Hokage") → 0.5

Output: data/intermediate/mentions_<episode>.csv

5. Build (`src/naruto_net/build/`)¶

Purpose: Construct co-appearance edges from scene presence

Logic:

For each scene:

Collect all characters mentioned
Create edge between every pair (with weight = 1)
If pair appears in multiple scenes, increment weight

Output: data/processed/edges.csv

6. QC (`src/naruto_net/qc/`)¶

Purpose: Generate quality control reports

Reports:

Episode QC: Events parsed, scenes created, characters detected per episode
Alias QC: Which aliases matched most frequently, any shadowing issues

Output: data/reports/episode_qc.csv, data/reports/alias_qc.csv

Data Flow Example¶

See the How It Works page for a detailed walkthrough of Episode 154 processing.

Directory Structure¶

naruto-network-graph/
├── src/naruto_net/          # Python package (pip install -e .)
│   ├── io/                  # .ass parsing
│   ├── normalize/           # Text cleaning
│   ├── segment/             # Scene detection
│   ├── detect/              # Character mention matching
│   ├── build/               # Edge construction
│   └── qc/                  # Quality control reports
├── data/
│   ├── intermediate/        # Pipeline outputs (gitignored)
│   ├── processed/           # Final edges.csv, scene_character.csv
│   ├── reports/             # QC reports (gitignored)
│   └── *.csv                # Character metadata, canonical edges
├── outputs/
│   ├── cypher_import_characters.cypher  # Neo4j import (87 chars)
│   ├── cypher_import_edges.cypher       # Neo4j import (36 edges)
│   └── analytical_queries.cypher        # 7 validation queries
├── scripts/
│   ├── 00_ass_ingest_subset.py          # End-to-end pipeline runner
│   ├── import_characters_to_neo4j.py    # Generate Cypher for chars
│   └── import_test_edges.py             # Generate Cypher for edges
└── tests/                   # 4 test suites with fixtures

Design Decisions¶

Why Neo4j?¶

Graph-native storage: Network queries (shortest path, centrality, community detection) are first-class operations

Cypher expressiveness: Queries like "find all paths between Pain and Konoha characters" are concise

Visualization: Built-in Neo4j Browser for quick exploration

Flexibility: Can mix structured properties with graph traversals

Why Subtitle Files?¶

Precision: Subtitle timing provides frame-accurate scene boundaries

Availability: Fansub .ass files widely available for popular anime

Speaker proxy: Character mentions in dialogue ≈ character presence in scene

Reproducibility: Subtitle files are the source of truth, not manual annotation

Why Multi-Label Pattern?¶

Query simplicity: MATCH (c:ChūninExams) is cleaner than MATCH (c)-[:APPEARS_IN]->(:Arc {name: "Chunin Exams"})

Storage efficiency: No need for separate Arc nodes

Built-in arc counting: labels(c) gives immediate arc participation

Tradeoff: Less flexible for many arcs (would become unwieldy for 100+ arcs), but perfect for our 3-arc scope

Performance Considerations¶

Current Scale¶

87 unique characters
150 character-arc records (87 chars × avg 1.7 arcs each)
36 manual edges (72 bidirectional)
~500 subtitle files across 3 arcs

Expected Scale (Full Pipeline)¶

~1,200 scenes per arc (30 episodes × 40 scenes/episode)
~10,000 edges per arc (estimated)
Query time: <100ms for typical centrality queries

Optimization Notes¶

Indexes on: character_id, name, affiliation_primary, role_type, estimated_importance
Cypher hints: Use USING INDEX for large traversals
Batch imports: Use UNWIND for bulk edge creation (see cypher_import_edges.cypher)

Next Steps¶

How It Works — See the pipeline in action
API Reference — Explore the naruto_net package
Troubleshooting — Common issues and solutions