Skip to content

Detect Module

The detect module provides character mention detection using alias-based regex matching.


Overview

This module handles:

  • Loading character alias dictionaries
  • Building word-boundary regex patterns
  • Detecting character mentions in dialogue text
  • Assigning confidence scores to matches

Functions

load_characters_csv(path)

Expected columns (exact): character_id, character_name, aliases, affiliation_primary, affiliation_detail, affiliation_changes, first_appearance_arc, role_type, estimated_importance

Source code in src/naruto_net/detect/mentions.py
def load_characters_csv(path: Path) -> list[Character]:
    """
    Expected columns (exact):
    character_id, character_name, aliases, affiliation_primary, affiliation_detail,
    affiliation_changes, first_appearance_arc, role_type, estimated_importance
    """
    out: list[Character] = []
    with path.open("r", encoding="utf-8-sig", newline="") as f:
        reader = csv.DictReader(f)
        for row in reader:
            aliases_raw = (row.get("aliases") or "").strip()
            aliases = [a.strip() for a in aliases_raw.split(",") if a.strip()]

            out.append(Character(
                character_id=(row.get("character_id") or "").strip(),
                character_name=(row.get("character_name") or "").strip(),
                aliases=aliases,
                affiliation_primary=(row.get("affiliation_primary") or "").strip(),
                affiliation_detail=(row.get("affiliation_detail") or "").strip(),
                affiliation_changes=(row.get("affiliation_changes") or "").strip(),
                first_appearance_arc=(row.get("first_appearance_arc") or "").strip(),
                role_type=(row.get("role_type") or "").strip(),
                estimated_importance=(row.get("estimated_importance") or "").strip(),
            ))
    return out

build_alias_index(characters)

Return (alias, character_id) pairs sorted longest-first (prevents shadowing).

Source code in src/naruto_net/detect/mentions.py
def build_alias_index(characters: list[Character]) -> list[tuple[str, str]]:
    """Return (alias, character_id) pairs sorted longest-first (prevents shadowing)."""
    pairs: list[tuple[str, str]] = []
    for ch in characters:
        if ch.character_name:
            pairs.append((ch.character_name, ch.character_id))
        for a in ch.aliases:
            pairs.append((a, ch.character_id))
    pairs = [(a, cid) for a, cid in pairs if a and cid]
    pairs.sort(key=lambda x: len(x[0]), reverse=True)
    return pairs

detect_mentions(text, utterance_id, alias_index)

Deterministic alias matching with word boundaries + simple confidence heuristic.

Source code in src/naruto_net/detect/mentions.py
def detect_mentions(text: str, utterance_id: str, alias_index: list[tuple[str, str]]) -> list[Mention]:
    """Deterministic alias matching with word boundaries + simple confidence heuristic."""
    mentions: list[Mention] = []
    for alias, cid in alias_index:
        pat = r"\b" + re.escape(alias) + r"\b"
        for m in re.finditer(pat, text, flags=re.IGNORECASE):
            L = len(alias)
            confidence = 1.0 if L >= 6 else 0.85 if L >= 4 else 0.7
            mentions.append(Mention(
                utterance_id=utterance_id,
                character_id=cid,
                alias_matched=alias,
                start_char=m.start(),
                end_char=m.end(),
                confidence=confidence,
            ))
    return mentions

Usage Examples

Basic Character Detection

from naruto_net.detect.mentions import detect_characters, load_alias_dict

# Load alias mapping
alias_dict = load_alias_dict('data/character_aliases.json')

# Detect characters in dialogue
text = "I will avenge Pervy Sage!"
mentions = detect_characters(text, alias_dict)

for mention in mentions:
    print(f"Found: {mention['character']} (alias: '{mention['alias_matched']}', confidence: {mention['confidence']})")
# Output: Found: Jiraiya (alias: 'Pervy Sage', confidence: 0.8)

Processing Multiple Events

from naruto_net.io.subtitles import AssReader

reader = AssReader('episode_014.ass')
events = reader.read_events()

all_mentions = []
for event in events:
    mentions = detect_characters(event.text, alias_dict)
    for mention in mentions:
        mention['start_time'] = event.start_time
        mention['end_time'] = event.end_time
        all_mentions.append(mention)

print(f"Total mentions: {len(all_mentions)}")

Filtering by Confidence

# Only high-confidence matches
high_conf = [m for m in all_mentions if m['confidence'] >= 0.8]

print(f"High-confidence mentions: {len(high_conf)} / {len(all_mentions)}")

Alias Dictionary Format

The alias dictionary is a JSON file mapping canonical names to lists of aliases:

{
  "Naruto Uzumaki": [
    "Naruto",
    "Naruto-kun",
    "Naruto-boy",
    "Nine-Tails",
    "Seventh Hokage"
  ],
  "Jiraiya": [
    "Jiraiya",
    "Pervy Sage",
    "Ero-sennin",
    "Toad Sage",
    "Lord Jiraiya"
  ],
  "Tsunade": [
    "Tsunade",
    "Lady Hokage",
    "Fifth Hokage",
    "Lady Tsunade",
    "Granny Tsunade"
  ]
}

Key requirements:

  • Canonical name is the key (used in network nodes)
  • Aliases are case-insensitive matched
  • Longer aliases should be listed first (prevents shadowing)

Matching Logic

Word-Boundary Regex

Each alias is wrapped with \b word boundaries:

# For alias "Pervy Sage"
pattern = r"\b(Pervy Sage)\b"

# Matches
"I will avenge Pervy Sage!"  # ✓
"Where is Pervy Sage?"       # ✓

# Doesn't match
"Pervysage"                  # ✗ (no word boundary)
"Pervy Sagebrush"            # ✗ (boundary at start only)

Why word boundaries?

Prevents false positives from partial matches:

# Without word boundaries
"Sage" would match "Sagebrush", "Sage Mode", "Message"

# With word boundaries
"Sage" only matches when it's a standalone word

Longest-First Matching

Aliases are sorted by length (longest first) to prevent shadowing:

# Alias list for Tsunade
["Lady Hokage", "Hokage", "Lady Tsunade", "Tsunade"]

# Sorted for matching
["Lady Hokage", "Lady Tsunade", "Hokage", "Tsunade"]

# Text: "So what should we do, Lady Hokage?"
# Matches "Lady Hokage" (not just "Hokage")

Confidence Scoring

Each match gets a confidence score:

Match Type Confidence Example
Canonical name 1.0 "Naruto Uzumaki"
Common alias 0.8 "Pervy Sage" → Jiraiya
Title/honorific only 0.5 "Lady Hokage" (could be Tsunade or another)

Heuristic rules:

  • If alias contains full first/last name → 1.0
  • If alias is character-specific nickname → 0.8
  • If alias is generic title → 0.5

Future improvement: Learn confidence from training data.


Validation

Check for Shadowing

Shadowing occurs when a shorter alias prevents a longer, more specific alias from matching:

# Bad alias order
aliases = ["Sage", "Pervy Sage"]  # ❌ "Sage" matches first!

# Good alias order
aliases = ["Pervy Sage", "Sage"]  # ✓ "Pervy Sage" matches first

The build_regex_patterns() function automatically sorts to prevent this.

Inspect Match Frequency

from collections import Counter

# Count how often each alias is matched
alias_counts = Counter(m['alias_matched'] for m in all_mentions)

print("Top 10 most common aliases:")
for alias, count in alias_counts.most_common(10):
    print(f"  {alias}: {count}")

Known Limitations

Speaker Disambiguation

We detect mentions, not speakers:

text = "Naruto, you're an idiot!"
# Detects: Naruto is present
# Does NOT detect: Who said it (could be Sakura, Sasuke, anyone)

Narrative vs Visual Presence

text = "Naruto is in the village right now."
# Detects: Naruto mention
# Issue: Speaker is talking ABOUT Naruto, not TO him
# We don't distinguish whether Naruto is visually on-screen

Ambiguous Titles

text = "The Hokage will decide."
# Could be: Tsunade (Fifth), Kakashi (Sixth), or historical reference
# Current: Matched with lower confidence (0.5)

Integration with Pipeline

Detection happens after segmentation:

# 1. Parse & normalize
events = AssReader('episode.ass').read_events()
for event in events:
    event.text = strip_ass_tags(event.text)

# 2. Segment
scenes = segment_scenes(events)

# 3. Detect characters in each scene
alias_dict = load_alias_dict('character_aliases.json')

for scene in scenes:
    scene_characters = set()
    for event in scene.events:
        mentions = detect_characters(event.text, alias_dict)
        for mention in mentions:
            scene_characters.add(mention['character'])

    scene.characters = list(scene_characters)

# 4. Build edges from scene co-presence

Performance Notes

  • Regex compilation: Patterns compiled once, reused
  • Speed: Matching 1000 dialogue lines against 87 character aliases takes ~50ms

Customization

Adding New Characters

Edit data/character_aliases.json:

{
  "Kakashi Hatake": [
    "Kakashi",
    "Kakashi-sensei",
    "Copy Ninja",
    "Sixth Hokage",
    "Kakashi of the Sharingan"
  ]
}

Then reload:

alias_dict = load_alias_dict('data/character_aliases.json')

Custom Confidence Scoring

Override the default heuristic:

def custom_confidence(alias: str, canonical: str) -> float:
    """Custom confidence scoring logic."""
    if canonical.lower() in alias.lower():
        return 1.0  # Alias contains name
    elif len(alias.split()) > 1:
        return 0.8  # Multi-word nickname
    else:
        return 0.5  # Single-word title

# Apply when building mentions
mention['confidence'] = custom_confidence(alias, character)

  • Normalize — Provides clean text for detection
  • Segment — Provides scene context for mentions
  • Build — Uses detected characters to construct edges