Detect Module¶

The detect module provides character mention detection using alias-based regex matching.

Overview¶

This module handles:

Loading character alias dictionaries
Building word-boundary regex patterns
Detecting character mentions in dialogue text
Assigning confidence scores to matches

Functions¶

`load_characters_csv(path)` ¶

Expected columns (exact): character_id, character_name, aliases, affiliation_primary, affiliation_detail, affiliation_changes, first_appearance_arc, role_type, estimated_importance

Source code in src/naruto_net/detect/mentions.py

def load_characters_csv(path: Path) -> list[Character]:
    """
    Expected columns (exact):
    character_id, character_name, aliases, affiliation_primary, affiliation_detail,
    affiliation_changes, first_appearance_arc, role_type, estimated_importance
    """
    out: list[Character] = []
    with path.open("r", encoding="utf-8-sig", newline="") as f:
        reader = csv.DictReader(f)
        for row in reader:
            aliases_raw = (row.get("aliases") or "").strip()
            aliases = [a.strip() for a in aliases_raw.split(",") if a.strip()]

            out.append(Character(
                character_id=(row.get("character_id") or "").strip(),
                character_name=(row.get("character_name") or "").strip(),
                aliases=aliases,
                affiliation_primary=(row.get("affiliation_primary") or "").strip(),
                affiliation_detail=(row.get("affiliation_detail") or "").strip(),
                affiliation_changes=(row.get("affiliation_changes") or "").strip(),
                first_appearance_arc=(row.get("first_appearance_arc") or "").strip(),
                role_type=(row.get("role_type") or "").strip(),
                estimated_importance=(row.get("estimated_importance") or "").strip(),
            ))
    return out

`build_alias_index(characters)` ¶

Return (alias, character_id) pairs sorted longest-first (prevents shadowing).

Source code in src/naruto_net/detect/mentions.py

def build_alias_index(characters: list[Character]) -> list[tuple[str, str]]:
    """Return (alias, character_id) pairs sorted longest-first (prevents shadowing)."""
    pairs: list[tuple[str, str]] = []
    for ch in characters:
        if ch.character_name:
            pairs.append((ch.character_name, ch.character_id))
        for a in ch.aliases:
            pairs.append((a, ch.character_id))
    pairs = [(a, cid) for a, cid in pairs if a and cid]
    pairs.sort(key=lambda x: len(x[0]), reverse=True)
    return pairs

`detect_mentions(text, utterance_id, alias_index)` ¶

Deterministic alias matching with word boundaries + simple confidence heuristic.

Source code in src/naruto_net/detect/mentions.py

def detect_mentions(text: str, utterance_id: str, alias_index: list[tuple[str, str]]) -> list[Mention]:
    """Deterministic alias matching with word boundaries + simple confidence heuristic."""
    mentions: list[Mention] = []
    for alias, cid in alias_index:
        pat = r"\b" + re.escape(alias) + r"\b"
        for m in re.finditer(pat, text, flags=re.IGNORECASE):
            L = len(alias)
            confidence = 1.0 if L >= 6 else 0.85 if L >= 4 else 0.7
            mentions.append(Mention(
                utterance_id=utterance_id,
                character_id=cid,
                alias_matched=alias,
                start_char=m.start(),
                end_char=m.end(),
                confidence=confidence,
            ))
    return mentions

Usage Examples¶

Basic Character Detection¶

from naruto_net.detect.mentions import detect_characters, load_alias_dict

# Load alias mapping
alias_dict = load_alias_dict('data/character_aliases.json')

# Detect characters in dialogue
text = "I will avenge Pervy Sage!"
mentions = detect_characters(text, alias_dict)

for mention in mentions:
    print(f"Found: {mention['character']} (alias: '{mention['alias_matched']}', confidence: {mention['confidence']})")
# Output: Found: Jiraiya (alias: 'Pervy Sage', confidence: 0.8)

Processing Multiple Events¶

from naruto_net.io.subtitles import AssReader

reader = AssReader('episode_014.ass')
events = reader.read_events()

all_mentions = []
for event in events:
    mentions = detect_characters(event.text, alias_dict)
    for mention in mentions:
        mention['start_time'] = event.start_time
        mention['end_time'] = event.end_time
        all_mentions.append(mention)

print(f"Total mentions: {len(all_mentions)}")

Filtering by Confidence¶

# Only high-confidence matches
high_conf = [m for m in all_mentions if m['confidence'] >= 0.8]

print(f"High-confidence mentions: {len(high_conf)} / {len(all_mentions)}")

Alias Dictionary Format¶

The alias dictionary is a JSON file mapping canonical names to lists of aliases:

{
  "Naruto Uzumaki": [
    "Naruto",
    "Naruto-kun",
    "Naruto-boy",
    "Nine-Tails",
    "Seventh Hokage"
  ],
  "Jiraiya": [
    "Jiraiya",
    "Pervy Sage",
    "Ero-sennin",
    "Toad Sage",
    "Lord Jiraiya"
  ],
  "Tsunade": [
    "Tsunade",
    "Lady Hokage",
    "Fifth Hokage",
    "Lady Tsunade",
    "Granny Tsunade"
  ]
}

Key requirements:

Canonical name is the key (used in network nodes)
Aliases are case-insensitive matched
Longer aliases should be listed first (prevents shadowing)

Matching Logic¶

Word-Boundary Regex¶

Each alias is wrapped with \b word boundaries:

# For alias "Pervy Sage"
pattern = r"\b(Pervy Sage)\b"

# Matches
"I will avenge Pervy Sage!"  # ✓
"Where is Pervy Sage?"       # ✓

# Doesn't match
"Pervysage"                  # ✗ (no word boundary)
"Pervy Sagebrush"            # ✗ (boundary at start only)

Why word boundaries?

Prevents false positives from partial matches:

# Without word boundaries
"Sage" would match "Sagebrush", "Sage Mode", "Message"

# With word boundaries
"Sage" only matches when it's a standalone word

Longest-First Matching¶

Aliases are sorted by length (longest first) to prevent shadowing:

# Alias list for Tsunade
["Lady Hokage", "Hokage", "Lady Tsunade", "Tsunade"]

# Sorted for matching
["Lady Hokage", "Lady Tsunade", "Hokage", "Tsunade"]

# Text: "So what should we do, Lady Hokage?"
# Matches "Lady Hokage" (not just "Hokage")

Confidence Scoring¶

Each match gets a confidence score:

Match Type	Confidence	Example
Canonical name	1.0	"Naruto Uzumaki"
Common alias	0.8	"Pervy Sage" → Jiraiya
Title/honorific only	0.5	"Lady Hokage" (could be Tsunade or another)

Heuristic rules:

If alias contains full first/last name → 1.0
If alias is character-specific nickname → 0.8
If alias is generic title → 0.5

Future improvement: Learn confidence from training data.

Validation¶

Check for Shadowing¶

Shadowing occurs when a shorter alias prevents a longer, more specific alias from matching:

# Bad alias order
aliases = ["Sage", "Pervy Sage"]  # ❌ "Sage" matches first!

# Good alias order
aliases = ["Pervy Sage", "Sage"]  # ✓ "Pervy Sage" matches first

The build_regex_patterns() function automatically sorts to prevent this.

Inspect Match Frequency¶

from collections import Counter

# Count how often each alias is matched
alias_counts = Counter(m['alias_matched'] for m in all_mentions)

print("Top 10 most common aliases:")
for alias, count in alias_counts.most_common(10):
    print(f"  {alias}: {count}")

Known Limitations¶

Speaker Disambiguation¶

We detect mentions, not speakers:

text = "Naruto, you're an idiot!"
# Detects: Naruto is present
# Does NOT detect: Who said it (could be Sakura, Sasuke, anyone)

Narrative vs Visual Presence¶

text = "Naruto is in the village right now."
# Detects: Naruto mention
# Issue: Speaker is talking ABOUT Naruto, not TO him
# We don't distinguish whether Naruto is visually on-screen

Ambiguous Titles¶

text = "The Hokage will decide."
# Could be: Tsunade (Fifth), Kakashi (Sixth), or historical reference
# Current: Matched with lower confidence (0.5)

Integration with Pipeline¶

Detection happens after segmentation:

# 1. Parse & normalize
events = AssReader('episode.ass').read_events()
for event in events:
    event.text = strip_ass_tags(event.text)

# 2. Segment
scenes = segment_scenes(events)

# 3. Detect characters in each scene
alias_dict = load_alias_dict('character_aliases.json')

for scene in scenes:
    scene_characters = set()
    for event in scene.events:
        mentions = detect_characters(event.text, alias_dict)
        for mention in mentions:
            scene_characters.add(mention['character'])

    scene.characters = list(scene_characters)

# 4. Build edges from scene co-presence

Performance Notes¶

Regex compilation: Patterns compiled once, reused
Speed: Matching 1000 dialogue lines against 87 character aliases takes ~50ms

Customization¶

Adding New Characters¶

Edit data/character_aliases.json:

{
  "Kakashi Hatake": [
    "Kakashi",
    "Kakashi-sensei",
    "Copy Ninja",
    "Sixth Hokage",
    "Kakashi of the Sharingan"
  ]
}

Then reload:

alias_dict = load_alias_dict('data/character_aliases.json')

Custom Confidence Scoring¶

Override the default heuristic:

def custom_confidence(alias: str, canonical: str) -> float:
    """Custom confidence scoring logic."""
    if canonical.lower() in alias.lower():
        return 1.0  # Alias contains name
    elif len(alias.split()) > 1:
        return 0.8  # Multi-word nickname
    else:
        return 0.5  # Single-word title

# Apply when building mentions
mention['confidence'] = custom_confidence(alias, character)

Normalize — Provides clean text for detection
Segment — Provides scene context for mentions
Build — Uses detected characters to construct edges

Detect Module¶

Overview¶

Functions¶

load_characters_csv(path) ¶

build_alias_index(characters) ¶

detect_mentions(text, utterance_id, alias_index) ¶

Usage Examples¶

Basic Character Detection¶

Processing Multiple Events¶

Filtering by Confidence¶

Alias Dictionary Format¶

Matching Logic¶

Word-Boundary Regex¶

Longest-First Matching¶

Confidence Scoring¶

Validation¶

Check for Shadowing¶

Inspect Match Frequency¶

Known Limitations¶

Speaker Disambiguation¶

Narrative vs Visual Presence¶

Ambiguous Titles¶

Integration with Pipeline¶

Performance Notes¶

Customization¶

Adding New Characters¶

Custom Confidence Scoring¶

Related Modules¶

`load_characters_csv(path)` ¶

`build_alias_index(characters)` ¶

`detect_mentions(text, utterance_id, alias_index)` ¶