Build Module¶
The build module constructs co-appearance edges from scene-level character presence.
Overview¶
This module handles:
- Identifying characters present in each scene
- Creating pairwise edges for co-present characters
- Aggregating edge weights across scenes
- Formatting edges for network analysis
Functions¶
build_coappearance_edges(scene_to_characters)
¶
Tier-1 co-appearance: increment edge weight by 1 per scene co-presence.
Source code in src/naruto_net/build/edges.py
Usage Examples¶
Building Edges from Scenes¶
from naruto_net.build.edges import build_edges_from_scenes
# Assuming scenes already have detected characters
# (from detect module)
edges = build_edges_from_scenes(scenes)
print(f"Created {len(edges)} co-appearance edges")
# Inspect first edge
print(edges[0])
# {'character_a': 'Naruto Uzumaki', 'character_b': 'Tsunade', 'weight': 2, 'episodes': [14]}
Aggregating Across Episodes¶
from collections import defaultdict
# Combine edges from multiple episodes
all_edges = defaultdict(lambda: {'weight': 0, 'episodes': []})
for episode in [14, 15, 16]:
episode_edges = process_episode(episode) # Returns edge list
for edge in episode_edges:
key = tuple(sorted([edge['character_a'], edge['character_b']]))
all_edges[key]['weight'] += edge['weight']
all_edges[key]['episodes'].extend(edge['episodes'])
# Convert to list
final_edges = [
{
'character_a': key[0],
'character_b': key[1],
'weight': data['weight'],
'episodes': data['episodes']
}
for key, data in all_edges.items()
]
Exporting to CSV¶
import pandas as pd
# Convert to DataFrame
df = pd.DataFrame(edges)
# Add metadata
df['arc'] = 'Pain\'s Assault'
df['source'] = 'subtitle_extraction'
# Export
df.to_csv('data/processed/edges.csv', index=False)
Exporting to NetworkX¶
import networkx as nx
# Build graph
G = nx.Graph()
for edge in edges:
G.add_edge(
edge['character_a'],
edge['character_b'],
weight=edge['weight'],
episodes=edge['episodes']
)
# Calculate centrality
degree = nx.degree_centrality(G)
sorted_degree = sorted(degree.items(), key=lambda x: x[1], reverse=True)
print("Top 5 most connected characters:")
for char, centrality in sorted_degree[:5]:
print(f" {char}: {centrality:.3f}")
Edge Construction Logic¶
Co-Presence Definition¶
Two characters have an edge if they appear in the same scene:
# Scene contains: [Naruto, Tsunade, Jiraiya]
# Create 3 edges (all pairwise combinations)
Naruto ↔ Tsunade (weight: 1)
Naruto ↔ Jiraiya (weight: 1)
Tsunade ↔ Jiraiya (weight: 1)
Weight Aggregation¶
If the same pair appears in multiple scenes, weights sum:
# Scene 1: [Naruto, Tsunade] → Naruto-Tsunade (weight: 1)
# Scene 2: [Naruto, Tsunade] → Naruto-Tsunade (weight: 1)
# Scene 3: [Naruto, Sasuke] → Naruto-Sasuke (weight: 1)
# Final edges:
Naruto ↔ Tsunade (weight: 2) # Sum of Scene 1 + Scene 2
Naruto ↔ Sasuke (weight: 1)
Bidirectional Creation¶
For symmetric relationships, create both directions:
# Scene: [A, B, C]
# Create both directions for each pair
A → B (weight: 1)
B → A (weight: 1)
A → C (weight: 1)
C → A (weight: 1)
B → C (weight: 1)
C → B (weight: 1)
Why bidirectional?
Ensures queries like MATCH (a)-[:CONNECTED]->(b) work regardless of edge direction.
Edge Attributes¶
Each edge dictionary contains:
{
'character_a': str, # First character (alphabetically sorted)
'character_b': str, # Second character
'weight': int, # Number of scenes shared
'episodes': List[int], # Episode numbers
'arc': str, # "Chunin Exams", etc. (added manually)
'source': str # "subtitle_extraction" vs "manual_seed"
}
Alphabetical sorting: Ensures (Naruto, Sasuke) and (Sasuke, Naruto) are treated as the same edge.
Validation¶
Check for Orphan Characters¶
Characters mentioned but with no edges:
# All characters detected
all_characters = set()
for scene in scenes:
all_characters.update(scene.characters)
# Characters in edges
edge_characters = set()
for edge in edges:
edge_characters.add(edge['character_a'])
edge_characters.add(edge['character_b'])
# Orphans
orphans = all_characters - edge_characters
if orphans:
print(f"Warning: {len(orphans)} characters have no connections:")
print(orphans)
Cause: Character appeared in only one scene alone (no co-presence).
Sanity Check: Expected Edges¶
# Rock Lee and Gaara should connect in Chunin Exams
expected_pair = ('Gaara', 'Rock Lee')
found = any(
set([e['character_a'], e['character_b']]) == set(expected_pair)
for e in edges
)
assert found, "Expected edge missing!"
Integration with Pipeline¶
Edge building is the final step before export:
# 1. Parse & normalize
events = AssReader('episode.ass').read_events()
for event in events:
event.text = strip_ass_tags(event.text)
# 2. Segment
scenes = segment_scenes(events)
# 3. Detect characters
alias_dict = load_alias_dict('character_aliases.json')
for scene in scenes:
scene_chars = set()
for event in scene.events:
mentions = detect_characters(event.text, alias_dict)
scene_chars.update(m['character'] for m in mentions)
scene.characters = list(scene_chars)
# 4. Build edges
edges = build_edges_from_scenes(scenes)
# 5. Export
pd.DataFrame(edges).to_csv('edges.csv', index=False)
Performance Notes¶
- Complexity: O(n × k²) where n = scenes, k = avg characters per scene
- Speed: Building edges for 50 scenes with avg 5 characters takes <20ms
Advanced: Multi-Arc Aggregation¶
Combine edges across all three arcs:
arc_edges = {
'Chunin Exams': process_arc('chunin_exams', episodes=range(1, 31)),
'Sasuke Retrieval': process_arc('sasuke_retrieval', episodes=range(107, 135)),
'Pain\'s Assault': process_arc('pains_assault', episodes=range(152, 176))
}
# Global edge aggregation
global_edges = defaultdict(lambda: {'weight': 0, 'arcs': []})
for arc, edges in arc_edges.items():
for edge in edges:
key = tuple(sorted([edge['character_a'], edge['character_b']]))
global_edges[key]['weight'] += edge['weight']
global_edges[key]['arcs'].append(arc)
# Characters in all 3 arcs
cross_arc_chars = {
char
for key, data in global_edges.items()
if len(data['arcs']) == 3
for char in key
}
print(f"{len(cross_arc_chars)} characters appear in all 3 arcs")