Skip to content

How It Works

This page traces a real scene from Naruto Shippuden Episode 154 through the entire pipeline—from raw subtitle text to character co-appearance edges.


Overview: From Subtitles to Network

The pipeline transforms subtitle files into character networks in 5 stages:

  1. Parse raw subtitle lines with timestamps
  2. Segment dialogue into scenes using timing gaps
  3. Detect character mentions via alias matching
  4. Map character presence per scene
  5. Build co-appearance edges

Let's follow Episode 154, Scene 11:16 - 14:39 step by step.


Stage 1: Raw Subtitles

Each .ass subtitle file contains timed dialogue lines. There are no speaker labels—just timestamps and text.

Sample from Episode 154:

0:11:16.77  So what should we do, Lady Hokage?
0:11:20.27  Shizune is heading up the autopsy...
0:11:29.03  Naruto! You get in Master Shizune's way,
0:11:35.53  I will avenge Pervy Sage!
0:11:43.23  I can't just sit around doing nothing!
0:12:48.17  Yes. Lord Jiraiya was the only one capable of using it in this village.
0:12:52.37  Pain defeated Jiraiya who had summoned Lord Fukasaku and Lady Shima.
0:13:12.93  I believe that the only one who can do that is you, Naruto-boy.
0:13:42.03  Is that fine with you, Tsunade?
0:13:45.03  Naruto, go.
0:13:52.33  Pervy Sage did it, right? Then I can't lose to him!
0:14:18.50  I'm counting on you with the code, Shikamaru.

File Format

.ass files (Advanced SubStation Alpha) include styling tags like {\i1} (italic) and \N (newline). The parser strips these before analysis.


Stage 2: Scene Segmentation

When consecutive subtitle lines are separated by >3 seconds of silence, we split them into separate scenes.

Detected Gap

Between 0:11:35.53 and 0:11:43.23 there's a 5.43 second gap → scene break.

Result: 2 Scenes

Scene A — 11:16 to 11:37 (4 lines)

0:11:16.77  So what should we do, Lady Hokage?
0:11:20.27  Shizune is heading up the autopsy...
0:11:29.03  Naruto! You get in Master Shizune's way,
0:11:35.53  I will avenge Pervy Sage!

Scene B — 11:43 to 14:39 (8 lines)

0:11:43.23  I can't just sit around doing nothing!
0:12:48.17  Yes. Lord Jiraiya was the only one capable of using it...
0:12:52.37  Pain defeated Jiraiya who had summoned Lord Fukasaku and Lady Shima.
0:13:12.93  I believe that the only one who can do that is you, Naruto-boy.
0:13:42.03  Is that fine with you, Tsunade?
0:13:45.03  Naruto, go.
0:13:52.33  Pervy Sage did it, right? Then I can't lose to him!
0:14:18.50  I'm counting on you with the code, Shikamaru.

Why 3 Seconds?

Testing on 20 episodes found that 3 seconds balances precision (not splitting mid-conversation) with recall (not lumping distant scenes together). Adjustable via config.


Stage 3: Character Detection

We scan each line for known names and aliases using word-boundary regex matching against character_aliases.json.

Alias Resolution Examples

Text Found Resolves To
Lady Hokage Tsunade
Shizune Shizune
Naruto, Naruto-boy Naruto Uzumaki
Pervy Sage, Jiraiya Jiraiya
Pain Pain (Nagato)
Fukasaku Fukasaku
Shima Shima
Shikamaru Shikamaru Nara

Annotated Lines (Scene A)

So what should we do, [Lady Hokage]?
[Shizune] is heading up the autopsy...
[Naruto]! You get in Master [Shizune]'s way,
I will avenge [Pervy Sage]!

Annotated Lines (Scene B)

I can't just sit around doing nothing!  ← No character mentions
Yes. Lord [Jiraiya] was the only one capable of using it...
[Pain] defeated [Jiraiya] who had summoned Lord [Fukasaku] and Lady [Shima].
I believe the only one who can do that is you, [Naruto]-boy.
Is that fine with you, [Tsunade]?
[Naruto], go.
[Pervy Sage] did it, right? Then I can't lose to him!
I'm counting on you with the code, [Shikamaru].

Longest-First Matching

Aliases are matched longest-first to avoid shadowing. "Lady Hokage" matches before "Hokage" alone, preventing false positives.


Stage 4: Scene-Character Presence

For each scene, we collect which characters were mentioned. This is our proxy for "who was in this scene."

Presence Matrix

Naruto Jiraiya Tsunade Shizune Pain Fukasaku Shima Shikamaru
Scene A
Scene B

Key Insight

Naruto, Jiraiya, and Tsunade appear in both scenes.

  • Their pairwise edges will each get weight = 2
  • Characters appearing in only one scene get weight = 1 edges with their co-present characters

Stage 5: Co-Appearance Network

For every pair of characters present in the same scene, we create (or increment) an edge. This is repeated across all episodes.

Resulting Edges (Partial List)

Weight 2 edges (both scenes):

  • Naruto ↔ Jiraiya (weight: 2)
  • Naruto ↔ Tsunade (weight: 2)
  • Jiraiya ↔ Tsunade (weight: 2)

Weight 1 edges (Scene A only):

  • Naruto ↔ Shizune (weight: 1)
  • Jiraiya ↔ Shizune (weight: 1)
  • Tsunade ↔ Shizune (weight: 1)

Weight 1 edges (Scene B only):

  • Naruto ↔ Pain (weight: 1)
  • Naruto ↔ Fukasaku (weight: 1)
  • Naruto ↔ Shima (weight: 1)
  • Naruto ↔ Shikamaru (weight: 1)
  • Jiraiya ↔ Pain (weight: 1)
  • Jiraiya ↔ Fukasaku (weight: 1)
  • ... (21 more edges)

Total from these 2 scenes: 8 characters, 27 edges (3 with weight 2, 24 with weight 1)

Visual Network

graph TD
    Naruto((Naruto))
    Jiraiya((Jiraiya))
    Tsunade((Tsunade))
    Shizune((Shizune))
    Pain((Pain))
    Fukasaku((Fukasaku))
    Shima((Shima))
    Shikamaru((Shikamaru))

    Naruto ===|w=2| Jiraiya
    Naruto ===|w=2| Tsunade
    Jiraiya ===|w=2| Tsunade

    Naruto ---|w=1| Shizune
    Jiraiya ---|w=1| Shizune
    Tsunade ---|w=1| Shizune

    Naruto ---|w=1| Pain
    Naruto ---|w=1| Fukasaku
    Naruto ---|w=1| Shima
    Naruto ---|w=1| Shikamaru

    Jiraiya ---|w=1| Pain
    Jiraiya ---|w=1| Fukasaku
    Jiraiya ---|w=1| Shima
    Jiraiya ---|w=1| Shikamaru

    Tsunade ---|w=1| Pain
    Tsunade ---|w=1| Fukasaku
    Tsunade ---|w=1| Shima
    Tsunade ---|w=1| Shikamaru

    Pain ---|w=1| Fukasaku
    Pain ---|w=1| Shima
    Pain ---|w=1| Shikamaru

    Fukasaku ---|w=1| Shima
    Fukasaku ---|w=1| Shikamaru

    Shima ---|w=1| Shikamaru

    style Naruto fill:#F66C2D
    style Jiraiya fill:#C92A2A
    style Tsunade fill:#D4873F
    style Pain fill:#7B2D8E
    style Fukasaku fill:#2D6A4F
    style Shima fill:#0077B6
    style Shikamaru fill:#364FC7
    style Shizune fill:#868E96

Scale This Up

This is just 2 scenes from 1 episode. Across ~30 episodes with ~40 scenes each, we process ~1,200 scenes per arc—building a comprehensive network of 87 characters.


Pipeline Output Files

Intermediate Datasets

Generated in data/intermediate/ (one per episode):

  1. events_<episode>.csv — Raw subtitle events (start, end, text)
  2. utterances_<episode>.csv — Normalized, split dialogue
  3. scenes_<episode>.csv — Scene boundaries and gap durations
  4. mentions_<episode>.csv — Character detections with confidence scores

Final Outputs

Generated in data/processed/:

  1. edges.csv — Character co-appearance edges
character_a,character_b,weight,arc,episodes,source
Naruto Uzumaki,Tsunade,2,Pain's Assault,[14],subtitle_extraction
Naruto Uzumaki,Jiraiya,2,Pain's Assault,[14],subtitle_extraction
...
  1. scene_character.csv — Scene-level character presence
scene_id,character,episode,confidence
014_001,Naruto Uzumaki,14,1.0
014_001,Tsunade,14,0.8
...

Quality Control Reports

Generated in data/reports/:

  1. episode_qc.csv — Events parsed, scenes created, characters detected per episode
  2. alias_qc.csv — Alias match frequency, potential shadowing issues

Validation and Confidence

How We Ensure Accuracy

  1. Fixture testing: Parser validated against known-good Episode 109 excerpt
  2. Manual spot-checks: Compare detected edges to episode summaries
  3. Sanity queries: "Is Naruto the most connected character?" (should be true)
  4. Alias auditing: QC reports flag unexpectedly rare/common aliases

Confidence Scoring

Each character mention gets a confidence score:

  • 1.0 — Direct canonical name match ("Naruto Uzumaki")
  • 0.8 — Common alias match ("Pervy Sage" → Jiraiya)
  • 0.5 — Title/honorific only ("Lord Hokage", speaker unclear)

Low-confidence mentions can be filtered in downstream analysis.

Known Limitations

Speaker identification: We detect mentions, not speakers. "Naruto, you idiot!" attributes Naruto as present, but we don't know who said it.

Narrative mentions: "Naruto is in the village" counts as presence even if he's not in the scene visually.

Flashbacks: Not distinguished from current timeline (future work: episode context).

See FAQ for more details.


Running the Pipeline Yourself

Single Episode

python scripts/00_ass_ingest_subset.py

Edit the script to specify episode numbers:

episodes_to_process = [14, 15, 16]  # Customize

Full Arc

# Pain's Assault: Episodes 152-175
episodes_to_process = list(range(152, 176))

Custom Configuration

Adjust scene detection threshold:

# In src/naruto_net/segment/scenes.py
SCENE_GAP_THRESHOLD = 4.0  # Increase for stricter scene breaks

Next Steps

  • Review the Architecture


    Understand the Neo4j schema and database structure.

    Architecture

  • Common Questions


    Find answers to frequently asked questions.

    FAQ

  • API Reference


    Dive into the naruto_net package internals.

    API Docs


Data Source: Naruto Shippuden Episode 154 — "Decryption" (Pain's Assault Arc) Subtitle File: [HorribleSubs] Naruto Shippuuden - 154 [720p].ass