Frequently Asked Questions¶

General Questions¶

What is this project about?¶

This project analyzes character relationships in Naruto using network science. By tracking when characters appear together in scenes (detected via subtitle data), we build weighted networks that answer questions about character centrality, ensemble balance, and narrative structure.

Why network analysis for anime?¶

Network science provides mathematical tools to quantify fan intuitions. When we say "Sakura becomes irrelevant" or "Naruto dominates the story," network metrics like degree centrality and Shannon entropy let us test those claims with data.

This approach has been successfully applied to Game of Thrones, Harry Potter, and Les Misérables—but systematic application to anime is still rare.

What arcs are covered?¶

Version 1.0 focuses on three S-tier arcs:

Chunin Exams (~30 episodes, Naruto original series)
Sasuke Retrieval Mission (~22 episodes, Naruto original series)
Pain's Assault (~18 episodes, Naruto Shippuden)

These arcs were chosen for narrative density, fan consensus quality, and clear story boundaries.

Can I use this for other anime?¶

Yes! The pipeline is designed to be anime-agnostic. You'll need:

.ass subtitle files for your target episodes
A character alias JSON file (map nicknames to canonical names)
Episode-to-arc mapping (define story arc boundaries)

The core pipeline (naruto_net package) should work with minimal modification.

Technical Questions¶

Where can I get subtitle files?¶

Subtitle files are not included in this repository due to copyright. You can:

Extract subtitles from your legally obtained anime files (MKV containers often include .ass tracks)
Use fansub archives (HorribleSubs, Commie, etc.) where legally available in your region
Generate subtitles using speech-to-text (lower quality, requires manual cleanup)

Respect Copyright

Only use subtitle files from sources you legally own or have permission to use.

What subtitle format is supported?¶

The pipeline currently supports .ass (Advanced SubStation Alpha) format, commonly used by fansub groups. .srt (SubRip) support is not implemented but could be added.

Why Neo4j instead of NetworkX only?¶

NetworkX is excellent for analysis (centrality calculations, community detection) but stores graphs in memory.

Neo4j provides:

Persistent storage (survive crashes, share across projects)
Complex queries (Cypher is more expressive than NetworkX for multi-hop traversals)
Built-in visualization (Neo4j Browser for quick exploration)
Scalability (handles millions of nodes/edges efficiently)

The pipeline outputs CSV files, so you can use NetworkX if you prefer.

How accurate is scene detection?¶

Scene segmentation uses a simple heuristic: gaps >3 seconds between subtitle lines suggest scene breaks.

Accuracy (estimated from manual validation):

Precision: ~85% (most detected boundaries are real scene breaks)
Recall: ~75% (some quick cuts within 3s are missed)

This is sufficient for co-appearance networks where small errors average out across thousands of scenes.

How does character detection work?¶

We use word-boundary regex matching against a curated alias JSON file.

Example:

{
  "Naruto Uzumaki": ["Naruto", "Naruto-kun", "Nine-Tails", "Hokage"]
}

The regex \b(Naruto|Naruto-kun|Nine-Tails|Hokage)\b finds matches. Longest aliases are checked first to avoid false positives (e.g., "Lady Hokage" before "Hokage").

Confidence scoring:

Canonical name match → 1.0
Common alias → 0.8
Title/honorific only → 0.5

What are the limitations?¶

Known limitations:

Speaker attribution: We detect mentions, not speakers. "Naruto, you're an idiot!" counts Naruto as present, but we don't know who spoke.
Narrative mentions: "Naruto is in the village" counts as presence even if he's not visually on-screen.
Flashbacks: Not distinguished from current timeline (requires external episode context).
Subtitle quality: Errors in subtitles (typos, missing lines) propagate to the network.
Alias shadowing: Common words that are also character names (e.g., "Sensei") can cause false positives.

Despite these, the method produces networks that align well with fan knowledge (validated via sanity checks like "Naruto should be most central").

Data Questions¶

How many characters are in the dataset?¶

87 unique characters across all three arcs.

24 characters appear in all 3 arcs (e.g., Naruto, Sasuke, Sakura)
50 characters per arc are tracked in detail (see data/*_characters.csv)

What's the difference between manual and extracted edges?¶

Manual edges (source: "manual_seed"):

Hand-coded canonical relationships (36 total)
Used for validation (e.g., "Rock Lee must connect to Gaara")
Weight = 1, confidence = 1.0

Extracted edges (source: "subtitle_extraction"):

Automatically detected from subtitle co-appearances
Weight = number of scenes shared
Confidence = 0.5 to 1.0 (based on alias matching)

Both are stored in Neo4j with the source property distinguishing them.

Can I access the Neo4j database directly?¶

The repository provides Cypher import scripts (outputs/*.cypher) to build the database yourself.

Steps:

Install Neo4j Desktop
Create a new database (version 5.x)
Paste outputs/cypher_import_characters.cypher into Neo4j Browser
Paste outputs/cypher_import_edges.cypher into Neo4j Browser
Run validation queries from outputs/analytical_queries.cypher

What's the license for the data?¶

Code: MIT License (free to use, modify, distribute)

Data (character networks, alias mappings): CC BY 4.0 (free to use with attribution)

Subtitle files: Not included (you must source separately, respecting copyright)

Analysis Questions¶

What questions does the analysis answer?¶

Version 1.0 focuses on three core questions:

Character balance entropy: "At what point did side characters become irrelevant?"
Community detection: "Do arcs form natural communities matching geography/allegiances?"
Naruto's centralization: "When did Naruto become too central to the story?"

See Home for full details.

What network metrics are calculated?¶

Planned metrics (not all implemented yet):

Degree centrality: Number of connections (who interacts with most characters?)
Betweenness centrality: Bridge characters connecting disparate groups
Eigenvector centrality: Connected to other important characters
PageRank: Google's algorithm applied to character networks
Shannon entropy: Character balance (how evenly distributed is screen time?)
Community detection: Louvain algorithm to find natural clusters

How do I run my own analysis?¶

Using NetworkX (no Neo4j required):

import pandas as pd
import networkx as nx

# Load edges
edges = pd.read_csv('data/processed/edges.csv')

# Build graph
G = nx.Graph()
for _, row in edges.iterrows():
    G.add_edge(row['character_a'], row['character_b'], weight=row['weight'])

# Calculate centrality
degree = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)

# Top 10 most connected
sorted_degree = sorted(degree.items(), key=lambda x: x[1], reverse=True)[:10]
print(sorted_degree)

Using Neo4j (for complex queries):

// Top 10 most connected characters in Chunin Exams
MATCH (c:ChūninExams)
RETURN c.name, size((c)-[:CONNECTED]->()) AS degree
ORDER BY degree DESC
LIMIT 10;

Is there a visualization?¶

Current status: Visualization is in the design phase. The subtitle pipeline and database are complete.

Planned: Interactive D3.js force-directed graph with:

Arc comparison slider
Sketchnote-inspired aesthetic (hand-drawn annotations)
Character filtering by village/affiliation
Temporal evolution animations

Contribution Questions¶

Can I contribute?¶

Yes! Contributions welcome in several areas:

Data validation: Spot-check extracted edges against episode summaries
Alias expansion: Add missing character nicknames to character_aliases.json
Testing: Validate pipeline on additional episodes
Visualization: Help design/implement the D3.js interface
Documentation: Improve guides, add examples

See GitHub Issues for current tasks.

How do I report bugs?¶

Open an issue on GitHub with:

Python version (python --version)
Error message (full traceback)
Episode number (if processing subtitles)
Expected vs actual behavior

Can I request features?¶

Yes! Feature requests are tracked on GitHub. Please describe:

Use case: What are you trying to accomplish?
Current limitation: What prevents you from doing it now?
Proposed solution: How should it work?

How do I cite this project?¶

@software{naruto_network_2025,
  author = {Hidalgo-Sotelo, Barbara},
  title = {Naruto Character Network Analysis},
  year = {2025},
  url = {https://github.com/dagny099/naruto-network-graph},
  note = {Applying network science to anime storytelling}
}

Methodology Questions¶

Why these three arcs specifically?¶

Selection criteria:

S-tier quality: Fan consensus masterpieces
Narrative density: Kai versions remove filler padding
Clear boundaries: Arcs have definitive start/end points
Character diversity: Mix of protagonist-focused and ensemble arcs
Comparative potential: Different narrative structures to contrast

How is this different from "Network of Thrones"?¶

Similarities:

Co-occurrence methodology (characters in same scene/chapter)
Multiple centrality measures
Community detection validation
Question-driven analysis

Differences:

Medium: Visual (anime) vs text (novels)
Arc comparison: Multiple discrete units vs single book
Temporal tracking: Evolution across series vs snapshot
Speaker detection: Subtitle mentions vs text proximity
Dataset release: Full network data published vs paper-only

What's the validation strategy?¶

Three-tier validation:

Fixture tests: Parser verified against known-good subtitle excerpt
Sanity checks: "Is Naruto most central?" (should be true)
Canonical edges: 36 hand-coded relationships to compare against

Example sanity queries:

Naruto should have highest degree centrality in both arcs ✓
Rock Lee and Gaara should connect in Chunin Exams ✓
Pain should connect to most Konoha characters ✓

Troubleshooting¶

For technical issues, see the Troubleshooting page.

For questions not covered here, ask on GitHub Discussions or email barbs@balex.com.