Architecture¶
This page explains the technical architecture of the character network database and processing pipeline.
System Overview¶
The Naruto character network system consists of three main layers:
graph TB
A[Subtitle Files<br/>.ass format] -->|Parse| B[Processing Pipeline<br/>naruto_net package]
B -->|Extract| C[Intermediate Data<br/>Events, Scenes, Mentions]
C -->|Build| D[Network Edges<br/>CSV format]
D -->|Import| E[Neo4j Graph DB<br/>:Character :CONNECTED]
E -->|Query| F[Network Analysis<br/>Centrality, Communities]
F -->|Visualize| G[Interactive Viz<br/>D3.js]
style A fill:#FFF3E0
style B fill:#E8F5E9
style C fill:#F3E5F5
style D fill:#E3F2FD
style E fill:#FFE0B2
style F fill:#F1F8E9
style G fill:#FCE4EC
Neo4j Graph Database Schema¶
The character network is stored in a Neo4j graph database using a multi-label pattern for arc-specific analysis.
Complete Schema Diagram¶
Full schema showing node structure, relationship properties, and example connection
Node Structure: :Character¶
Every character in the network has the base label :Character plus arc-specific labels.
Multi-Label Pattern¶
Why multi-labels?
- Simple queries:
MATCH (c:ChūninExams:SasukeRetrieval)finds characters in both arcs - Built-in counting:
size(labels(c)) - 1= number of arcs - More intuitive than separate Arc nodes with relationships
Character Properties¶
| Property | Type | Description | Indexed |
|---|---|---|---|
character_id |
INTEGER | Unique identifier | ✓ (unique) |
name |
STRING | Canonical name | ✓ (unique) |
aliases |
LIST\<STRING> | Known nicknames/titles | |
affiliation_primary |
STRING | Main village/organization | ✓ |
affiliation_detail |
STRING | Squad/team membership | |
affiliation_changes |
STRING | Defection tracking (e.g., Sasuke) | |
first_appearance_arc |
STRING | Debut arc | |
role_type |
STRING | protagonist/antagonist/supporting | ✓ |
estimated_importance |
INTEGER | 1-10 centrality estimate | ✓ |
Example node:
(:Character:ChūninExams:SasukeRetrieval {
character_id: 1,
name: "Naruto Uzumaki",
aliases: ["Naruto", "Nine-Tails", "Hokage"],
affiliation_primary: "Konoha",
affiliation_detail: "Team 7",
role_type: "protagonist",
estimated_importance: 10
})
Relationship Structure: :CONNECTED¶
Characters are connected with a single relationship type that stores semantic information in properties.
Why Single Relationship Type?¶
Simplicity: Easier to query all connections without UNION clauses
Flexibility: Relationship semantics stored in relationship_types array
Data lineage: source property distinguishes manual vs extracted edges
Relationship Properties¶
| Property | Type | Description |
|---|---|---|
arc |
STRING | "Chunin Exams" | "Sasuke Retrieval" | "Pain's Assault" |
weight |
INTEGER | Scene co-appearance count (1 for manual seeds) |
source |
STRING | "manual_seed" | "subtitle_extraction" |
relationship_types |
LIST\<STRING> | ["team", "rivals", "enemies", "mentors", "siblings", "parents"] |
episodes |
LIST\<INTEGER> | Episode numbers (empty for manual seeds) |
confidence |
FLOAT | 1.0 for manual, 0.0-1.0 for extracted |
Controlled Vocabulary for relationship_types¶
team— Same squad/organizationrivals— Competitive relationshipenemies— Active antagonismmentors— Teacher/student dynamicsiblings— Family relationship (blood or found family)parents— Parent/child relationshipco_appearance— Generic co-occurrence (for extracted data)
Example relationship:
(naruto:Character)-[:CONNECTED {
arc: "Chunin Exams",
weight: 1,
source: "manual_seed",
relationship_types: ["team", "rivals"],
episodes: [],
confidence: 1.0
}]->(sasuke:Character)
Bidirectional Creation¶
For symmetric co-appearance relationships, we create both (A→B) and (B→A) edges. This ensures queries like "MATCH (n:Character)-[:CONNECTED]->(m)" work regardless of edge direction.
Processing Pipeline Architecture¶
The naruto_net Python package implements a modular pipeline:
Pipeline Stages¶
graph LR
A[1. IO<br/>Parse .ass] --> B[2. Normalize<br/>Clean text]
B --> C[3. Segment<br/>Detect scenes]
C --> D[4. Detect<br/>Find mentions]
D --> E[5. Build<br/>Create edges]
E --> F[6. QC<br/>Validate]
style A fill:#E3F2FD
style B fill:#F3E5F5
style C fill:#FFF3E0
style D fill:#E8F5E9
style E fill:#FFE0B2
style F fill:#FFEBEE
Module Breakdown¶
1. IO (src/naruto_net/io/)¶
Purpose: Parse .ass subtitle files into structured events
Key classes:
AssReader— Loads .ass files, extracts dialogue linesSubtitleEvent— Dataclass for (start, end, text) tuples
Output: data/intermediate/events_<episode>.csv
2. Normalize (src/naruto_net/normalize/)¶
Purpose: Clean ASS formatting tags and split multi-speaker lines
Functions:
strip_ass_tags()— Remove{\i1},{\b1}, etc.normalize_newlines()— Convert\Nto spacesplit_multi_speaker()— Handle "Speaker A: ... Speaker B: ..." lines
Output: data/intermediate/utterances_<episode>.csv
3. Segment (src/naruto_net/segment/)¶
Purpose: Group utterances into scenes using timing gaps
Logic: If gap between consecutive lines > 3 seconds → scene break
Output: data/intermediate/scenes_<episode>.csv
4. Detect (src/naruto_net/detect/)¶
Purpose: Find character mentions using alias matching
Method: Word-boundary regex against character_aliases.json
Confidence heuristic:
- Direct name match → 1.0
- Alias match (e.g., "Pervy Sage" → Jiraiya) → 0.8
- Honorific only (e.g., "Hokage") → 0.5
Output: data/intermediate/mentions_<episode>.csv
5. Build (src/naruto_net/build/)¶
Purpose: Construct co-appearance edges from scene presence
Logic:
For each scene:
- Collect all characters mentioned
- Create edge between every pair (with weight = 1)
- If pair appears in multiple scenes, increment weight
Output: data/processed/edges.csv
6. QC (src/naruto_net/qc/)¶
Purpose: Generate quality control reports
Reports:
- Episode QC: Events parsed, scenes created, characters detected per episode
- Alias QC: Which aliases matched most frequently, any shadowing issues
Output: data/reports/episode_qc.csv, data/reports/alias_qc.csv
Data Flow Example¶
See the How It Works page for a detailed walkthrough of Episode 154 processing.
Directory Structure¶
naruto-network-graph/
├── src/naruto_net/ # Python package (pip install -e .)
│ ├── io/ # .ass parsing
│ ├── normalize/ # Text cleaning
│ ├── segment/ # Scene detection
│ ├── detect/ # Character mention matching
│ ├── build/ # Edge construction
│ └── qc/ # Quality control reports
├── data/
│ ├── intermediate/ # Pipeline outputs (gitignored)
│ ├── processed/ # Final edges.csv, scene_character.csv
│ ├── reports/ # QC reports (gitignored)
│ └── *.csv # Character metadata, canonical edges
├── outputs/
│ ├── cypher_import_characters.cypher # Neo4j import (87 chars)
│ ├── cypher_import_edges.cypher # Neo4j import (36 edges)
│ └── analytical_queries.cypher # 7 validation queries
├── scripts/
│ ├── 00_ass_ingest_subset.py # End-to-end pipeline runner
│ ├── import_characters_to_neo4j.py # Generate Cypher for chars
│ └── import_test_edges.py # Generate Cypher for edges
└── tests/ # 4 test suites with fixtures
Design Decisions¶
Why Neo4j?¶
Graph-native storage: Network queries (shortest path, centrality, community detection) are first-class operations
Cypher expressiveness: Queries like "find all paths between Pain and Konoha characters" are concise
Visualization: Built-in Neo4j Browser for quick exploration
Flexibility: Can mix structured properties with graph traversals
Why Subtitle Files?¶
Precision: Subtitle timing provides frame-accurate scene boundaries
Availability: Fansub .ass files widely available for popular anime
Speaker proxy: Character mentions in dialogue ≈ character presence in scene
Reproducibility: Subtitle files are the source of truth, not manual annotation
Why Multi-Label Pattern?¶
Query simplicity: MATCH (c:ChūninExams) is cleaner than MATCH (c)-[:APPEARS_IN]->(:Arc {name: "Chunin Exams"})
Storage efficiency: No need for separate Arc nodes
Built-in arc counting: labels(c) gives immediate arc participation
Tradeoff: Less flexible for many arcs (would become unwieldy for 100+ arcs), but perfect for our 3-arc scope
Performance Considerations¶
Current Scale¶
- 87 unique characters
- 150 character-arc records (87 chars × avg 1.7 arcs each)
- 36 manual edges (72 bidirectional)
- ~500 subtitle files across 3 arcs
Expected Scale (Full Pipeline)¶
- ~1,200 scenes per arc (30 episodes × 40 scenes/episode)
- ~10,000 edges per arc (estimated)
- Query time: <100ms for typical centrality queries
Optimization Notes¶
- Indexes on:
character_id,name,affiliation_primary,role_type,estimated_importance - Cypher hints: Use
USING INDEXfor large traversals - Batch imports: Use
UNWINDfor bulk edge creation (seecypher_import_edges.cypher)
Next Steps¶
- How It Works — See the pipeline in action
- API Reference — Explore the
naruto_netpackage - Troubleshooting — Common issues and solutions