How It Works¶
This page traces a real scene from Naruto Shippuden Episode 154 through the entire pipeline—from raw subtitle text to character co-appearance edges.
Overview: From Subtitles to Network¶
The pipeline transforms subtitle files into character networks in 5 stages:
- Parse raw subtitle lines with timestamps
- Segment dialogue into scenes using timing gaps
- Detect character mentions via alias matching
- Map character presence per scene
- Build co-appearance edges
Let's follow Episode 154, Scene 11:16 - 14:39 step by step.
Stage 1: Raw Subtitles¶
Each .ass subtitle file contains timed dialogue lines. There are no speaker labels—just timestamps and text.
Sample from Episode 154:
0:11:16.77 So what should we do, Lady Hokage?
0:11:20.27 Shizune is heading up the autopsy...
0:11:29.03 Naruto! You get in Master Shizune's way,
0:11:35.53 I will avenge Pervy Sage!
0:11:43.23 I can't just sit around doing nothing!
0:12:48.17 Yes. Lord Jiraiya was the only one capable of using it in this village.
0:12:52.37 Pain defeated Jiraiya who had summoned Lord Fukasaku and Lady Shima.
0:13:12.93 I believe that the only one who can do that is you, Naruto-boy.
0:13:42.03 Is that fine with you, Tsunade?
0:13:45.03 Naruto, go.
0:13:52.33 Pervy Sage did it, right? Then I can't lose to him!
0:14:18.50 I'm counting on you with the code, Shikamaru.
File Format
.ass files (Advanced SubStation Alpha) include styling tags like {\i1} (italic) and \N (newline). The parser strips these before analysis.
Stage 2: Scene Segmentation¶
When consecutive subtitle lines are separated by >3 seconds of silence, we split them into separate scenes.
Detected Gap¶
Between 0:11:35.53 and 0:11:43.23 there's a 5.43 second gap → scene break.
Result: 2 Scenes¶
Scene A — 11:16 to 11:37 (4 lines)
0:11:16.77 So what should we do, Lady Hokage?
0:11:20.27 Shizune is heading up the autopsy...
0:11:29.03 Naruto! You get in Master Shizune's way,
0:11:35.53 I will avenge Pervy Sage!
Scene B — 11:43 to 14:39 (8 lines)
0:11:43.23 I can't just sit around doing nothing!
0:12:48.17 Yes. Lord Jiraiya was the only one capable of using it...
0:12:52.37 Pain defeated Jiraiya who had summoned Lord Fukasaku and Lady Shima.
0:13:12.93 I believe that the only one who can do that is you, Naruto-boy.
0:13:42.03 Is that fine with you, Tsunade?
0:13:45.03 Naruto, go.
0:13:52.33 Pervy Sage did it, right? Then I can't lose to him!
0:14:18.50 I'm counting on you with the code, Shikamaru.
Why 3 Seconds?
Testing on 20 episodes found that 3 seconds balances precision (not splitting mid-conversation) with recall (not lumping distant scenes together). Adjustable via config.
Stage 3: Character Detection¶
We scan each line for known names and aliases using word-boundary regex matching against character_aliases.json.
Alias Resolution Examples¶
| Text Found | → | Resolves To |
|---|---|---|
| Lady Hokage | → | Tsunade |
| Shizune | → | Shizune |
| Naruto, Naruto-boy | → | Naruto Uzumaki |
| Pervy Sage, Jiraiya | → | Jiraiya |
| Pain | → | Pain (Nagato) |
| Fukasaku | → | Fukasaku |
| Shima | → | Shima |
| Shikamaru | → | Shikamaru Nara |
Annotated Lines (Scene A)¶
So what should we do, [Lady Hokage]?
[Shizune] is heading up the autopsy...
[Naruto]! You get in Master [Shizune]'s way,
I will avenge [Pervy Sage]!
Annotated Lines (Scene B)¶
I can't just sit around doing nothing! ← No character mentions
Yes. Lord [Jiraiya] was the only one capable of using it...
[Pain] defeated [Jiraiya] who had summoned Lord [Fukasaku] and Lady [Shima].
I believe the only one who can do that is you, [Naruto]-boy.
Is that fine with you, [Tsunade]?
[Naruto], go.
[Pervy Sage] did it, right? Then I can't lose to him!
I'm counting on you with the code, [Shikamaru].
Longest-First Matching
Aliases are matched longest-first to avoid shadowing. "Lady Hokage" matches before "Hokage" alone, preventing false positives.
Stage 4: Scene-Character Presence¶
For each scene, we collect which characters were mentioned. This is our proxy for "who was in this scene."
Presence Matrix¶
| Naruto | Jiraiya | Tsunade | Shizune | Pain | Fukasaku | Shima | Shikamaru | |
|---|---|---|---|---|---|---|---|---|
| Scene A | ✓ | ✓ | ✓ | ✓ | — | — | — | — |
| Scene B | ✓ | ✓ | ✓ | — | ✓ | ✓ | ✓ | ✓ |
Key Insight¶
Naruto, Jiraiya, and Tsunade appear in both scenes.
- Their pairwise edges will each get
weight = 2 - Characters appearing in only one scene get
weight = 1edges with their co-present characters
Stage 5: Co-Appearance Network¶
For every pair of characters present in the same scene, we create (or increment) an edge. This is repeated across all episodes.
Resulting Edges (Partial List)¶
Weight 2 edges (both scenes):
- Naruto ↔ Jiraiya (weight: 2)
- Naruto ↔ Tsunade (weight: 2)
- Jiraiya ↔ Tsunade (weight: 2)
Weight 1 edges (Scene A only):
- Naruto ↔ Shizune (weight: 1)
- Jiraiya ↔ Shizune (weight: 1)
- Tsunade ↔ Shizune (weight: 1)
Weight 1 edges (Scene B only):
- Naruto ↔ Pain (weight: 1)
- Naruto ↔ Fukasaku (weight: 1)
- Naruto ↔ Shima (weight: 1)
- Naruto ↔ Shikamaru (weight: 1)
- Jiraiya ↔ Pain (weight: 1)
- Jiraiya ↔ Fukasaku (weight: 1)
- ... (21 more edges)
Total from these 2 scenes: 8 characters, 27 edges (3 with weight 2, 24 with weight 1)
Visual Network¶
graph TD
Naruto((Naruto))
Jiraiya((Jiraiya))
Tsunade((Tsunade))
Shizune((Shizune))
Pain((Pain))
Fukasaku((Fukasaku))
Shima((Shima))
Shikamaru((Shikamaru))
Naruto ===|w=2| Jiraiya
Naruto ===|w=2| Tsunade
Jiraiya ===|w=2| Tsunade
Naruto ---|w=1| Shizune
Jiraiya ---|w=1| Shizune
Tsunade ---|w=1| Shizune
Naruto ---|w=1| Pain
Naruto ---|w=1| Fukasaku
Naruto ---|w=1| Shima
Naruto ---|w=1| Shikamaru
Jiraiya ---|w=1| Pain
Jiraiya ---|w=1| Fukasaku
Jiraiya ---|w=1| Shima
Jiraiya ---|w=1| Shikamaru
Tsunade ---|w=1| Pain
Tsunade ---|w=1| Fukasaku
Tsunade ---|w=1| Shima
Tsunade ---|w=1| Shikamaru
Pain ---|w=1| Fukasaku
Pain ---|w=1| Shima
Pain ---|w=1| Shikamaru
Fukasaku ---|w=1| Shima
Fukasaku ---|w=1| Shikamaru
Shima ---|w=1| Shikamaru
style Naruto fill:#F66C2D
style Jiraiya fill:#C92A2A
style Tsunade fill:#D4873F
style Pain fill:#7B2D8E
style Fukasaku fill:#2D6A4F
style Shima fill:#0077B6
style Shikamaru fill:#364FC7
style Shizune fill:#868E96
Scale This Up
This is just 2 scenes from 1 episode. Across ~30 episodes with ~40 scenes each, we process ~1,200 scenes per arc—building a comprehensive network of 87 characters.
Pipeline Output Files¶
Intermediate Datasets¶
Generated in data/intermediate/ (one per episode):
events_<episode>.csv— Raw subtitle events (start, end, text)utterances_<episode>.csv— Normalized, split dialoguescenes_<episode>.csv— Scene boundaries and gap durationsmentions_<episode>.csv— Character detections with confidence scores
Final Outputs¶
Generated in data/processed/:
edges.csv— Character co-appearance edges
character_a,character_b,weight,arc,episodes,source
Naruto Uzumaki,Tsunade,2,Pain's Assault,[14],subtitle_extraction
Naruto Uzumaki,Jiraiya,2,Pain's Assault,[14],subtitle_extraction
...
scene_character.csv— Scene-level character presence
Quality Control Reports¶
Generated in data/reports/:
episode_qc.csv— Events parsed, scenes created, characters detected per episodealias_qc.csv— Alias match frequency, potential shadowing issues
Validation and Confidence¶
How We Ensure Accuracy¶
- Fixture testing: Parser validated against known-good Episode 109 excerpt
- Manual spot-checks: Compare detected edges to episode summaries
- Sanity queries: "Is Naruto the most connected character?" (should be true)
- Alias auditing: QC reports flag unexpectedly rare/common aliases
Confidence Scoring¶
Each character mention gets a confidence score:
- 1.0 — Direct canonical name match ("Naruto Uzumaki")
- 0.8 — Common alias match ("Pervy Sage" → Jiraiya)
- 0.5 — Title/honorific only ("Lord Hokage", speaker unclear)
Low-confidence mentions can be filtered in downstream analysis.
Known Limitations¶
❌ Speaker identification: We detect mentions, not speakers. "Naruto, you idiot!" attributes Naruto as present, but we don't know who said it.
❌ Narrative mentions: "Naruto is in the village" counts as presence even if he's not in the scene visually.
❌ Flashbacks: Not distinguished from current timeline (future work: episode context).
See FAQ for more details.
Running the Pipeline Yourself¶
Single Episode¶
Edit the script to specify episode numbers:
Full Arc¶
Custom Configuration¶
Adjust scene detection threshold:
# In src/naruto_net/segment/scenes.py
SCENE_GAP_THRESHOLD = 4.0 # Increase for stricter scene breaks
Next Steps¶
-
Review the Architecture
Understand the Neo4j schema and database structure.
-
Common Questions
Find answers to frequently asked questions.
-
API Reference
Dive into the
naruto_netpackage internals.
Data Source: Naruto Shippuden Episode 154 — "Decryption" (Pain's Assault Arc) Subtitle File: [HorribleSubs] Naruto Shippuuden - 154 [720p].ass