Skip to content

QC Module

The qc module provides quality control and validation reporting for the subtitle processing pipeline.


Overview

This module handles:

  • Generating per-episode processing reports
  • Validating alias match frequency
  • Detecting potential alias shadowing issues
  • Summarizing pipeline coverage

Functions

alias_qc(mentions_df, *, top_n=50)

Summarize alias matches to support ambiguity debugging.

Source code in src/naruto_net/qc/reports.py
def alias_qc(mentions_df: pd.DataFrame, *, top_n: int = 50) -> pd.DataFrame:
    """Summarize alias matches to support ambiguity debugging."""
    if mentions_df.empty:
        return pd.DataFrame(columns=["alias_matched", "count"])
    c = Counter(mentions_df["alias_matched"].astype(str).tolist())
    rows = [{"alias_matched": a, "count": n} for a, n in c.most_common(top_n)]
    return pd.DataFrame(rows)

Usage Examples

Episode-Level QC Report

from naruto_net.qc.reports import generate_episode_qc

# After processing multiple episodes
episode_stats = []
for episode in [14, 15, 16]:
    stats = process_episode_with_stats(episode)
    episode_stats.append({
        'episode': episode,
        'events_parsed': stats['events'],
        'scenes_detected': stats['scenes'],
        'characters_found': stats['characters'],
        'edges_created': stats['edges']
    })

# Generate report
report = pd.DataFrame(episode_stats)
report.to_csv('data/reports/episode_qc.csv', index=False)

print(report)
#    episode  events_parsed  scenes_detected  characters_found  edges_created
# 0       14            160               55                 8              5
# 1       15            182               47                12              9
# 2       16            201               52                15             12

Alias Match Frequency Report

from collections import Counter

# After detecting characters across all episodes
all_mentions = []  # List of all character mentions

# Count alias usage
alias_counts = Counter(m['alias_matched'] for m in all_mentions)

# Generate report
alias_report = pd.DataFrame([
    {'alias': alias, 'count': count, 'character': mentions[0]['character']}
    for alias, count in alias_counts.items()
])

alias_report = alias_report.sort_values('count', ascending=False)
alias_report.to_csv('data/reports/alias_qc.csv', index=False)

print(alias_report.head(10))
#         alias  count         character
# 0      Naruto    342   Naruto Uzumaki
# 1     Tsunade    178          Tsunade
# 2     Jiraiya    156          Jiraiya
# 3        Pain     89  Pain (Nagato)
# 4  Pervy Sage     67          Jiraiya

Detecting Alias Shadowing

def detect_shadowing(alias_dict):
    """Find aliases that may shadow each other."""
    issues = []

    for character, aliases in alias_dict.items():
        for i, alias_a in enumerate(aliases):
            for alias_b in aliases[i+1:]:
                # Check if one alias contains the other
                if alias_a.lower() in alias_b.lower():
                    issues.append({
                        'character': character,
                        'shorter': alias_a,
                        'longer': alias_b,
                        'issue': f"'{alias_a}' may shadow '{alias_b}'"
                    })

    return pd.DataFrame(issues)

# Run check
shadowing_report = detect_shadowing(alias_dict)
if not shadowing_report.empty:
    print("⚠️ Potential shadowing issues:")
    print(shadowing_report)

QC Metrics

Per-Episode Metrics

Track these for each processed episode:

Metric Description Expected Range
events_parsed Dialogue lines extracted 100-300
scenes_detected Scene boundaries found 30-60
characters_found Unique characters detected 5-20
edges_created Co-appearance edges 10-50
avg_scene_duration Mean scene length (seconds) 10-30

Red flags:

  • scenes_detected < 10 → Subtitle file may be corrupted
  • characters_found == 0 → Alias matching failed
  • edges_created == 0 → No co-appearances (check scene segmentation)

Alias Quality Metrics

Track alias match patterns:

Metric Description Action Threshold
match_count Times alias was matched > 0 (all aliases should match at least once)
confidence_avg Average confidence score < 0.6 → Review alias specificity
false_positive_rate Manual validation sampling > 5% → Refine alias

Validation Workflows

Pre-Processing Validation

Before running the full pipeline:

# Check alias dictionary completeness
alias_dict = load_alias_dict('character_aliases.json')

expected_characters = pd.read_csv('data/chunin_exams_characters.csv')['name'].tolist()
missing = set(expected_characters) - set(alias_dict.keys())

if missing:
    print(f"⚠️ {len(missing)} characters missing from alias dict:")
    print(missing)

Post-Processing Validation

After processing:

# Sanity check: Naruto should be most connected
edge_df = pd.read_csv('data/processed/edges.csv')

degree_counts = pd.concat([
    edge_df['character_a'].value_counts(),
    edge_df['character_b'].value_counts()
], axis=1).sum(axis=1)

top_character = degree_counts.idxmax()

assert 'Naruto' in top_character, f"Expected Naruto most connected, got {top_character}"
print(f"✓ Sanity check passed: {top_character} is most connected")

Common Issues Detected

Issue: No Characters Detected

Symptoms: characters_found == 0 in episode QC report

Causes:

  1. Alias dictionary not loaded correctly
  2. Subtitle encoding issues (characters garbled)
  3. ASS tags not stripped (interfering with regex)

Debug:

# Print raw vs cleaned text
for event in events[:5]:
    print(f"Raw: {event.text}")
    cleaned = strip_ass_tags(event.text)
    print(f"Clean: {cleaned}")
    print()

Issue: Extremely High Edge Count

Symptoms: edges_created >> expected (e.g., 500 edges from 50 scenes)

Causes:

  1. Scene segmentation threshold too loose (too many small scenes)
  2. Alias matching too permissive (false positives)

Debug:

# Check scene size distribution
scene_sizes = [len(scene.events) for scene in scenes]

print(f"Min events/scene: {min(scene_sizes)}")
print(f"Max events/scene: {max(scene_sizes)}")
print(f"Mean events/scene: {np.mean(scene_sizes):.1f}")

# Too many tiny scenes → tighten gap threshold
if min(scene_sizes) == 1 and np.mean(scene_sizes) < 3:
    print("⚠️ Many single-event scenes, consider increasing gap_threshold")

Issue: Expected Edge Missing

Symptoms: Canonical relationship (e.g., Naruto-Sasuke) not found in edges

Causes:

  1. Characters never appeared together in processed episodes
  2. One character name not in alias dictionary
  3. Alias matching failed to detect one character

Debug:

# Check if both characters were detected
chars_detected = set(
    m['character']
    for scene in scenes
    for m in scene.mentions
)

for char in ['Naruto Uzumaki', 'Sasuke Uchiha']:
    if char not in chars_detected:
        print(f"⚠️ {char} never detected in episodes")

# Check co-presence
for scene in scenes:
    chars_in_scene = set(m['character'] for m in scene.mentions)
    if 'Naruto Uzumaki' in chars_in_scene and 'Sasuke Uchiha' in chars_in_scene:
        print(f"✓ Both appeared in scene {scene.scene_id}")

Integration with Pipeline

QC typically runs after each major stage:

# 1. Parse
events = AssReader('episode.ass').read_events()
print(f"QC: Parsed {len(events)} events")

# 2. Normalize
for event in events:
    event.text = strip_ass_tags(event.text)
print("QC: Normalized text")

# 3. Segment
scenes = segment_scenes(events)
print(f"QC: Detected {len(scenes)} scenes")

# 4. Detect
all_characters = set()
for scene in scenes:
    mentions = detect_characters_in_scene(scene)
    all_characters.update(m['character'] for m in mentions)
print(f"QC: Found {len(all_characters)} unique characters")

# 5. Build
edges = build_edges_from_scenes(scenes)
print(f"QC: Created {len(edges)} edges")

# 6. Validate
assert len(edges) > 0, "No edges created!"
print("QC: Validation passed")

Exporting Reports

Episode Processing Summary

# Save pipeline stats
stats_df = pd.DataFrame([{
    'episode': ep,
    'events': len(events),
    'scenes': len(scenes),
    'characters': len(all_characters),
    'edges': len(edges),
    'timestamp': datetime.now().isoformat()
}])

stats_df.to_csv('data/reports/episode_qc.csv', mode='a', header=False, index=False)

Alias Match Report

# Save alias usage
alias_df = pd.DataFrame([
    {'alias': alias, 'count': count}
    for alias, count in alias_counts.items()
])

alias_df.to_csv('data/reports/alias_qc.csv', index=False)

Performance Notes

  • QC overhead: <1% of total pipeline runtime
  • Reports are lightweight (CSV files < 100KB)

  • Build — Edges validated by QC reports
  • Detect — Alias matching validated by QC
  • Segment — Scene counts validated by QC