Detect Module¶
The detect module provides character mention detection using alias-based regex matching.
Overview¶
This module handles:
- Loading character alias dictionaries
- Building word-boundary regex patterns
- Detecting character mentions in dialogue text
- Assigning confidence scores to matches
Functions¶
load_characters_csv(path)
¶
Expected columns (exact): character_id, character_name, aliases, affiliation_primary, affiliation_detail, affiliation_changes, first_appearance_arc, role_type, estimated_importance
Source code in src/naruto_net/detect/mentions.py
build_alias_index(characters)
¶
Return (alias, character_id) pairs sorted longest-first (prevents shadowing).
Source code in src/naruto_net/detect/mentions.py
detect_mentions(text, utterance_id, alias_index)
¶
Deterministic alias matching with word boundaries + simple confidence heuristic.
Source code in src/naruto_net/detect/mentions.py
Usage Examples¶
Basic Character Detection¶
from naruto_net.detect.mentions import detect_characters, load_alias_dict
# Load alias mapping
alias_dict = load_alias_dict('data/character_aliases.json')
# Detect characters in dialogue
text = "I will avenge Pervy Sage!"
mentions = detect_characters(text, alias_dict)
for mention in mentions:
print(f"Found: {mention['character']} (alias: '{mention['alias_matched']}', confidence: {mention['confidence']})")
# Output: Found: Jiraiya (alias: 'Pervy Sage', confidence: 0.8)
Processing Multiple Events¶
from naruto_net.io.subtitles import AssReader
reader = AssReader('episode_014.ass')
events = reader.read_events()
all_mentions = []
for event in events:
mentions = detect_characters(event.text, alias_dict)
for mention in mentions:
mention['start_time'] = event.start_time
mention['end_time'] = event.end_time
all_mentions.append(mention)
print(f"Total mentions: {len(all_mentions)}")
Filtering by Confidence¶
# Only high-confidence matches
high_conf = [m for m in all_mentions if m['confidence'] >= 0.8]
print(f"High-confidence mentions: {len(high_conf)} / {len(all_mentions)}")
Alias Dictionary Format¶
The alias dictionary is a JSON file mapping canonical names to lists of aliases:
{
"Naruto Uzumaki": [
"Naruto",
"Naruto-kun",
"Naruto-boy",
"Nine-Tails",
"Seventh Hokage"
],
"Jiraiya": [
"Jiraiya",
"Pervy Sage",
"Ero-sennin",
"Toad Sage",
"Lord Jiraiya"
],
"Tsunade": [
"Tsunade",
"Lady Hokage",
"Fifth Hokage",
"Lady Tsunade",
"Granny Tsunade"
]
}
Key requirements:
- Canonical name is the key (used in network nodes)
- Aliases are case-insensitive matched
- Longer aliases should be listed first (prevents shadowing)
Matching Logic¶
Word-Boundary Regex¶
Each alias is wrapped with \b word boundaries:
# For alias "Pervy Sage"
pattern = r"\b(Pervy Sage)\b"
# Matches
"I will avenge Pervy Sage!" # ✓
"Where is Pervy Sage?" # ✓
# Doesn't match
"Pervysage" # ✗ (no word boundary)
"Pervy Sagebrush" # ✗ (boundary at start only)
Why word boundaries?
Prevents false positives from partial matches:
# Without word boundaries
"Sage" would match "Sagebrush", "Sage Mode", "Message"
# With word boundaries
"Sage" only matches when it's a standalone word
Longest-First Matching¶
Aliases are sorted by length (longest first) to prevent shadowing:
# Alias list for Tsunade
["Lady Hokage", "Hokage", "Lady Tsunade", "Tsunade"]
# Sorted for matching
["Lady Hokage", "Lady Tsunade", "Hokage", "Tsunade"]
# Text: "So what should we do, Lady Hokage?"
# Matches "Lady Hokage" (not just "Hokage")
Confidence Scoring¶
Each match gets a confidence score:
| Match Type | Confidence | Example |
|---|---|---|
| Canonical name | 1.0 | "Naruto Uzumaki" |
| Common alias | 0.8 | "Pervy Sage" → Jiraiya |
| Title/honorific only | 0.5 | "Lady Hokage" (could be Tsunade or another) |
Heuristic rules:
- If alias contains full first/last name → 1.0
- If alias is character-specific nickname → 0.8
- If alias is generic title → 0.5
Future improvement: Learn confidence from training data.
Validation¶
Check for Shadowing¶
Shadowing occurs when a shorter alias prevents a longer, more specific alias from matching:
# Bad alias order
aliases = ["Sage", "Pervy Sage"] # ❌ "Sage" matches first!
# Good alias order
aliases = ["Pervy Sage", "Sage"] # ✓ "Pervy Sage" matches first
The build_regex_patterns() function automatically sorts to prevent this.
Inspect Match Frequency¶
from collections import Counter
# Count how often each alias is matched
alias_counts = Counter(m['alias_matched'] for m in all_mentions)
print("Top 10 most common aliases:")
for alias, count in alias_counts.most_common(10):
print(f" {alias}: {count}")
Known Limitations¶
Speaker Disambiguation¶
We detect mentions, not speakers:
text = "Naruto, you're an idiot!"
# Detects: Naruto is present
# Does NOT detect: Who said it (could be Sakura, Sasuke, anyone)
Narrative vs Visual Presence¶
text = "Naruto is in the village right now."
# Detects: Naruto mention
# Issue: Speaker is talking ABOUT Naruto, not TO him
# We don't distinguish whether Naruto is visually on-screen
Ambiguous Titles¶
text = "The Hokage will decide."
# Could be: Tsunade (Fifth), Kakashi (Sixth), or historical reference
# Current: Matched with lower confidence (0.5)
Integration with Pipeline¶
Detection happens after segmentation:
# 1. Parse & normalize
events = AssReader('episode.ass').read_events()
for event in events:
event.text = strip_ass_tags(event.text)
# 2. Segment
scenes = segment_scenes(events)
# 3. Detect characters in each scene
alias_dict = load_alias_dict('character_aliases.json')
for scene in scenes:
scene_characters = set()
for event in scene.events:
mentions = detect_characters(event.text, alias_dict)
for mention in mentions:
scene_characters.add(mention['character'])
scene.characters = list(scene_characters)
# 4. Build edges from scene co-presence
Performance Notes¶
- Regex compilation: Patterns compiled once, reused
- Speed: Matching 1000 dialogue lines against 87 character aliases takes ~50ms
Customization¶
Adding New Characters¶
Edit data/character_aliases.json:
{
"Kakashi Hatake": [
"Kakashi",
"Kakashi-sensei",
"Copy Ninja",
"Sixth Hokage",
"Kakashi of the Sharingan"
]
}
Then reload:
Custom Confidence Scoring¶
Override the default heuristic:
def custom_confidence(alias: str, canonical: str) -> float:
"""Custom confidence scoring logic."""
if canonical.lower() in alias.lower():
return 1.0 # Alias contains name
elif len(alias.split()) > 1:
return 0.8 # Multi-word nickname
else:
return 0.5 # Single-word title
# Apply when building mentions
mention['confidence'] = custom_confidence(alias, character)