Skip to content

Architecture

This page explains the technical architecture of the character network database and processing pipeline.


System Overview

The Naruto character network system consists of three main layers:

graph TB
    A[Subtitle Files<br/>.ass format] -->|Parse| B[Processing Pipeline<br/>naruto_net package]
    B -->|Extract| C[Intermediate Data<br/>Events, Scenes, Mentions]
    C -->|Build| D[Network Edges<br/>CSV format]
    D -->|Import| E[Neo4j Graph DB<br/>:Character :CONNECTED]
    E -->|Query| F[Network Analysis<br/>Centrality, Communities]
    F -->|Visualize| G[Interactive Viz<br/>D3.js]

    style A fill:#FFF3E0
    style B fill:#E8F5E9
    style C fill:#F3E5F5
    style D fill:#E3F2FD
    style E fill:#FFE0B2
    style F fill:#F1F8E9
    style G fill:#FCE4EC

Neo4j Graph Database Schema

The character network is stored in a Neo4j graph database using a multi-label pattern for arc-specific analysis.

Complete Schema Diagram

Neo4j Schema

Full schema showing node structure, relationship properties, and example connection

Node Structure: :Character

Every character in the network has the base label :Character plus arc-specific labels.

Multi-Label Pattern

(:Character:ChūninExams:SasukeRetrieval:PainsAssault)

Why multi-labels?

  • Simple queries: MATCH (c:ChūninExams:SasukeRetrieval) finds characters in both arcs
  • Built-in counting: size(labels(c)) - 1 = number of arcs
  • More intuitive than separate Arc nodes with relationships

Character Properties

Property Type Description Indexed
character_id INTEGER Unique identifier ✓ (unique)
name STRING Canonical name ✓ (unique)
aliases LIST\<STRING> Known nicknames/titles
affiliation_primary STRING Main village/organization
affiliation_detail STRING Squad/team membership
affiliation_changes STRING Defection tracking (e.g., Sasuke)
first_appearance_arc STRING Debut arc
role_type STRING protagonist/antagonist/supporting
estimated_importance INTEGER 1-10 centrality estimate

Example node:

(:Character:ChūninExams:SasukeRetrieval {
  character_id: 1,
  name: "Naruto Uzumaki",
  aliases: ["Naruto", "Nine-Tails", "Hokage"],
  affiliation_primary: "Konoha",
  affiliation_detail: "Team 7",
  role_type: "protagonist",
  estimated_importance: 10
})

Relationship Structure: :CONNECTED

Characters are connected with a single relationship type that stores semantic information in properties.

Why Single Relationship Type?

Simplicity: Easier to query all connections without UNION clauses Flexibility: Relationship semantics stored in relationship_types array Data lineage: source property distinguishes manual vs extracted edges

Relationship Properties

Property Type Description
arc STRING "Chunin Exams" | "Sasuke Retrieval" | "Pain's Assault"
weight INTEGER Scene co-appearance count (1 for manual seeds)
source STRING "manual_seed" | "subtitle_extraction"
relationship_types LIST\<STRING> ["team", "rivals", "enemies", "mentors", "siblings", "parents"]
episodes LIST\<INTEGER> Episode numbers (empty for manual seeds)
confidence FLOAT 1.0 for manual, 0.0-1.0 for extracted

Controlled Vocabulary for relationship_types

  • team — Same squad/organization
  • rivals — Competitive relationship
  • enemies — Active antagonism
  • mentors — Teacher/student dynamic
  • siblings — Family relationship (blood or found family)
  • parents — Parent/child relationship
  • co_appearance — Generic co-occurrence (for extracted data)

Example relationship:

(naruto:Character)-[:CONNECTED {
  arc: "Chunin Exams",
  weight: 1,
  source: "manual_seed",
  relationship_types: ["team", "rivals"],
  episodes: [],
  confidence: 1.0
}]->(sasuke:Character)

Bidirectional Creation

For symmetric co-appearance relationships, we create both (A→B) and (B→A) edges. This ensures queries like "MATCH (n:Character)-[:CONNECTED]->(m)" work regardless of edge direction.


Processing Pipeline Architecture

The naruto_net Python package implements a modular pipeline:

Pipeline Stages

graph LR
    A[1. IO<br/>Parse .ass] --> B[2. Normalize<br/>Clean text]
    B --> C[3. Segment<br/>Detect scenes]
    C --> D[4. Detect<br/>Find mentions]
    D --> E[5. Build<br/>Create edges]
    E --> F[6. QC<br/>Validate]

    style A fill:#E3F2FD
    style B fill:#F3E5F5
    style C fill:#FFF3E0
    style D fill:#E8F5E9
    style E fill:#FFE0B2
    style F fill:#FFEBEE

Module Breakdown

1. IO (src/naruto_net/io/)

Purpose: Parse .ass subtitle files into structured events

Key classes:

  • AssReader — Loads .ass files, extracts dialogue lines
  • SubtitleEvent — Dataclass for (start, end, text) tuples

Output: data/intermediate/events_<episode>.csv


2. Normalize (src/naruto_net/normalize/)

Purpose: Clean ASS formatting tags and split multi-speaker lines

Functions:

  • strip_ass_tags() — Remove {\i1}, {\b1}, etc.
  • normalize_newlines() — Convert \N to space
  • split_multi_speaker() — Handle "Speaker A: ... Speaker B: ..." lines

Output: data/intermediate/utterances_<episode>.csv


3. Segment (src/naruto_net/segment/)

Purpose: Group utterances into scenes using timing gaps

Logic: If gap between consecutive lines > 3 seconds → scene break

Output: data/intermediate/scenes_<episode>.csv


4. Detect (src/naruto_net/detect/)

Purpose: Find character mentions using alias matching

Method: Word-boundary regex against character_aliases.json

Confidence heuristic:

  • Direct name match → 1.0
  • Alias match (e.g., "Pervy Sage" → Jiraiya) → 0.8
  • Honorific only (e.g., "Hokage") → 0.5

Output: data/intermediate/mentions_<episode>.csv


5. Build (src/naruto_net/build/)

Purpose: Construct co-appearance edges from scene presence

Logic:

For each scene:

  • Collect all characters mentioned
  • Create edge between every pair (with weight = 1)
  • If pair appears in multiple scenes, increment weight

Output: data/processed/edges.csv


6. QC (src/naruto_net/qc/)

Purpose: Generate quality control reports

Reports:

  • Episode QC: Events parsed, scenes created, characters detected per episode
  • Alias QC: Which aliases matched most frequently, any shadowing issues

Output: data/reports/episode_qc.csv, data/reports/alias_qc.csv


Data Flow Example

See the How It Works page for a detailed walkthrough of Episode 154 processing.


Directory Structure

naruto-network-graph/
├── src/naruto_net/          # Python package (pip install -e .)
│   ├── io/                  # .ass parsing
│   ├── normalize/           # Text cleaning
│   ├── segment/             # Scene detection
│   ├── detect/              # Character mention matching
│   ├── build/               # Edge construction
│   └── qc/                  # Quality control reports
├── data/
│   ├── intermediate/        # Pipeline outputs (gitignored)
│   ├── processed/           # Final edges.csv, scene_character.csv
│   ├── reports/             # QC reports (gitignored)
│   └── *.csv                # Character metadata, canonical edges
├── outputs/
│   ├── cypher_import_characters.cypher  # Neo4j import (87 chars)
│   ├── cypher_import_edges.cypher       # Neo4j import (36 edges)
│   └── analytical_queries.cypher        # 7 validation queries
├── scripts/
│   ├── 00_ass_ingest_subset.py          # End-to-end pipeline runner
│   ├── import_characters_to_neo4j.py    # Generate Cypher for chars
│   └── import_test_edges.py             # Generate Cypher for edges
└── tests/                   # 4 test suites with fixtures

Design Decisions

Why Neo4j?

Graph-native storage: Network queries (shortest path, centrality, community detection) are first-class operations

Cypher expressiveness: Queries like "find all paths between Pain and Konoha characters" are concise

Visualization: Built-in Neo4j Browser for quick exploration

Flexibility: Can mix structured properties with graph traversals

Why Subtitle Files?

Precision: Subtitle timing provides frame-accurate scene boundaries

Availability: Fansub .ass files widely available for popular anime

Speaker proxy: Character mentions in dialogue ≈ character presence in scene

Reproducibility: Subtitle files are the source of truth, not manual annotation

Why Multi-Label Pattern?

Query simplicity: MATCH (c:ChūninExams) is cleaner than MATCH (c)-[:APPEARS_IN]->(:Arc {name: "Chunin Exams"})

Storage efficiency: No need for separate Arc nodes

Built-in arc counting: labels(c) gives immediate arc participation

Tradeoff: Less flexible for many arcs (would become unwieldy for 100+ arcs), but perfect for our 3-arc scope


Performance Considerations

Current Scale

  • 87 unique characters
  • 150 character-arc records (87 chars × avg 1.7 arcs each)
  • 36 manual edges (72 bidirectional)
  • ~500 subtitle files across 3 arcs

Expected Scale (Full Pipeline)

  • ~1,200 scenes per arc (30 episodes × 40 scenes/episode)
  • ~10,000 edges per arc (estimated)
  • Query time: <100ms for typical centrality queries

Optimization Notes

  • Indexes on: character_id, name, affiliation_primary, role_type, estimated_importance
  • Cypher hints: Use USING INDEX for large traversals
  • Batch imports: Use UNWIND for bulk edge creation (see cypher_import_edges.cypher)

Next Steps