QC Module¶
The qc module provides quality control and validation reporting for the subtitle processing pipeline.
Overview¶
This module handles:
- Generating per-episode processing reports
- Validating alias match frequency
- Detecting potential alias shadowing issues
- Summarizing pipeline coverage
Functions¶
alias_qc(mentions_df, *, top_n=50)
¶
Summarize alias matches to support ambiguity debugging.
Source code in src/naruto_net/qc/reports.py
Usage Examples¶
Episode-Level QC Report¶
from naruto_net.qc.reports import generate_episode_qc
# After processing multiple episodes
episode_stats = []
for episode in [14, 15, 16]:
stats = process_episode_with_stats(episode)
episode_stats.append({
'episode': episode,
'events_parsed': stats['events'],
'scenes_detected': stats['scenes'],
'characters_found': stats['characters'],
'edges_created': stats['edges']
})
# Generate report
report = pd.DataFrame(episode_stats)
report.to_csv('data/reports/episode_qc.csv', index=False)
print(report)
# episode events_parsed scenes_detected characters_found edges_created
# 0 14 160 55 8 5
# 1 15 182 47 12 9
# 2 16 201 52 15 12
Alias Match Frequency Report¶
from collections import Counter
# After detecting characters across all episodes
all_mentions = [] # List of all character mentions
# Count alias usage
alias_counts = Counter(m['alias_matched'] for m in all_mentions)
# Generate report
alias_report = pd.DataFrame([
{'alias': alias, 'count': count, 'character': mentions[0]['character']}
for alias, count in alias_counts.items()
])
alias_report = alias_report.sort_values('count', ascending=False)
alias_report.to_csv('data/reports/alias_qc.csv', index=False)
print(alias_report.head(10))
# alias count character
# 0 Naruto 342 Naruto Uzumaki
# 1 Tsunade 178 Tsunade
# 2 Jiraiya 156 Jiraiya
# 3 Pain 89 Pain (Nagato)
# 4 Pervy Sage 67 Jiraiya
Detecting Alias Shadowing¶
def detect_shadowing(alias_dict):
"""Find aliases that may shadow each other."""
issues = []
for character, aliases in alias_dict.items():
for i, alias_a in enumerate(aliases):
for alias_b in aliases[i+1:]:
# Check if one alias contains the other
if alias_a.lower() in alias_b.lower():
issues.append({
'character': character,
'shorter': alias_a,
'longer': alias_b,
'issue': f"'{alias_a}' may shadow '{alias_b}'"
})
return pd.DataFrame(issues)
# Run check
shadowing_report = detect_shadowing(alias_dict)
if not shadowing_report.empty:
print("⚠️ Potential shadowing issues:")
print(shadowing_report)
QC Metrics¶
Per-Episode Metrics¶
Track these for each processed episode:
| Metric | Description | Expected Range |
|---|---|---|
events_parsed |
Dialogue lines extracted | 100-300 |
scenes_detected |
Scene boundaries found | 30-60 |
characters_found |
Unique characters detected | 5-20 |
edges_created |
Co-appearance edges | 10-50 |
avg_scene_duration |
Mean scene length (seconds) | 10-30 |
Red flags:
scenes_detected< 10 → Subtitle file may be corruptedcharacters_found== 0 → Alias matching failededges_created== 0 → No co-appearances (check scene segmentation)
Alias Quality Metrics¶
Track alias match patterns:
| Metric | Description | Action Threshold |
|---|---|---|
match_count |
Times alias was matched | > 0 (all aliases should match at least once) |
confidence_avg |
Average confidence score | < 0.6 → Review alias specificity |
false_positive_rate |
Manual validation sampling | > 5% → Refine alias |
Validation Workflows¶
Pre-Processing Validation¶
Before running the full pipeline:
# Check alias dictionary completeness
alias_dict = load_alias_dict('character_aliases.json')
expected_characters = pd.read_csv('data/chunin_exams_characters.csv')['name'].tolist()
missing = set(expected_characters) - set(alias_dict.keys())
if missing:
print(f"⚠️ {len(missing)} characters missing from alias dict:")
print(missing)
Post-Processing Validation¶
After processing:
# Sanity check: Naruto should be most connected
edge_df = pd.read_csv('data/processed/edges.csv')
degree_counts = pd.concat([
edge_df['character_a'].value_counts(),
edge_df['character_b'].value_counts()
], axis=1).sum(axis=1)
top_character = degree_counts.idxmax()
assert 'Naruto' in top_character, f"Expected Naruto most connected, got {top_character}"
print(f"✓ Sanity check passed: {top_character} is most connected")
Common Issues Detected¶
Issue: No Characters Detected¶
Symptoms: characters_found == 0 in episode QC report
Causes:
- Alias dictionary not loaded correctly
- Subtitle encoding issues (characters garbled)
- ASS tags not stripped (interfering with regex)
Debug:
# Print raw vs cleaned text
for event in events[:5]:
print(f"Raw: {event.text}")
cleaned = strip_ass_tags(event.text)
print(f"Clean: {cleaned}")
print()
Issue: Extremely High Edge Count¶
Symptoms: edges_created >> expected (e.g., 500 edges from 50 scenes)
Causes:
- Scene segmentation threshold too loose (too many small scenes)
- Alias matching too permissive (false positives)
Debug:
# Check scene size distribution
scene_sizes = [len(scene.events) for scene in scenes]
print(f"Min events/scene: {min(scene_sizes)}")
print(f"Max events/scene: {max(scene_sizes)}")
print(f"Mean events/scene: {np.mean(scene_sizes):.1f}")
# Too many tiny scenes → tighten gap threshold
if min(scene_sizes) == 1 and np.mean(scene_sizes) < 3:
print("⚠️ Many single-event scenes, consider increasing gap_threshold")
Issue: Expected Edge Missing¶
Symptoms: Canonical relationship (e.g., Naruto-Sasuke) not found in edges
Causes:
- Characters never appeared together in processed episodes
- One character name not in alias dictionary
- Alias matching failed to detect one character
Debug:
# Check if both characters were detected
chars_detected = set(
m['character']
for scene in scenes
for m in scene.mentions
)
for char in ['Naruto Uzumaki', 'Sasuke Uchiha']:
if char not in chars_detected:
print(f"⚠️ {char} never detected in episodes")
# Check co-presence
for scene in scenes:
chars_in_scene = set(m['character'] for m in scene.mentions)
if 'Naruto Uzumaki' in chars_in_scene and 'Sasuke Uchiha' in chars_in_scene:
print(f"✓ Both appeared in scene {scene.scene_id}")
Integration with Pipeline¶
QC typically runs after each major stage:
# 1. Parse
events = AssReader('episode.ass').read_events()
print(f"QC: Parsed {len(events)} events")
# 2. Normalize
for event in events:
event.text = strip_ass_tags(event.text)
print("QC: Normalized text")
# 3. Segment
scenes = segment_scenes(events)
print(f"QC: Detected {len(scenes)} scenes")
# 4. Detect
all_characters = set()
for scene in scenes:
mentions = detect_characters_in_scene(scene)
all_characters.update(m['character'] for m in mentions)
print(f"QC: Found {len(all_characters)} unique characters")
# 5. Build
edges = build_edges_from_scenes(scenes)
print(f"QC: Created {len(edges)} edges")
# 6. Validate
assert len(edges) > 0, "No edges created!"
print("QC: Validation passed")
Exporting Reports¶
Episode Processing Summary¶
# Save pipeline stats
stats_df = pd.DataFrame([{
'episode': ep,
'events': len(events),
'scenes': len(scenes),
'characters': len(all_characters),
'edges': len(edges),
'timestamp': datetime.now().isoformat()
}])
stats_df.to_csv('data/reports/episode_qc.csv', mode='a', header=False, index=False)
Alias Match Report¶
# Save alias usage
alias_df = pd.DataFrame([
{'alias': alias, 'count': count}
for alias, count in alias_counts.items()
])
alias_df.to_csv('data/reports/alias_qc.csv', index=False)
Performance Notes¶
- QC overhead: <1% of total pipeline runtime
- Reports are lightweight (CSV files < 100KB)