Normalize Module¶
The normalize module provides text cleaning utilities for subtitle data.
Overview¶
This module handles:
- Stripping ASS formatting tags (
{\i1},{\b1}, etc.) - Normalizing newlines (
\N→ space) - Splitting multi-speaker dialogue lines
- Cleaning whitespace
Functions¶
strip_ass_tags(text)
¶
normalize_ass_newlines(text, *, newline=' ')
¶
clean_dialogue_text(text_raw, *, newline=' ')
¶
Clean ASS dialogue text for matching (no tags, normalized newlines, collapsed whitespace).
Source code in src/naruto_net/normalize/ass_text.py
split_event_to_utterances(ev)
¶
Convert one subtitle event into 1..N utterances (not speaker attribution).
Source code in src/naruto_net/normalize/utterances.py
Usage Examples¶
Stripping ASS Tags¶
from naruto_net.normalize.ass_text import strip_ass_tags
# Remove italic/bold/color tags
text = r"{\i1}Naruto!{\i0} You're {\b1}late{\b0} again!"
clean = strip_ass_tags(text)
print(clean)
# Output: "Naruto! You're late again!"
Normalizing Newlines¶
from naruto_net.normalize.ass_text import normalize_newlines
# ASS uses \N for line breaks
text = "First line\NSecond line\NThird line"
normalized = normalize_newlines(text)
print(normalized)
# Output: "First line Second line Third line"
Combined Cleaning¶
from naruto_net.normalize.ass_text import strip_ass_tags, normalize_newlines
def clean_subtitle_text(text: str) -> str:
"""Apply all normalization steps."""
text = strip_ass_tags(text)
text = normalize_newlines(text)
return text.strip()
raw = r"{\i1}I will avenge\N{\i0}Pervy Sage!"
clean = clean_subtitle_text(raw)
print(clean)
# Output: "I will avenge Pervy Sage!"
Splitting Multi-Speaker Lines¶
from naruto_net.normalize.utterances import split_multi_speaker
# Some subtitles combine multiple speakers
text = "Naruto: I'm hungry! Sakura: You're always hungry."
utterances = split_multi_speaker(text)
for speaker, line in utterances:
print(f"{speaker}: {line}")
# Output:
# Naruto: I'm hungry!
# Sakura: You're always hungry.
ASS Formatting Tags¶
Common tags removed by strip_ass_tags():
| Tag | Meaning |
|---|---|
{\i1}...{\i0} |
Italic text |
{\b1}...{\b0} |
Bold text |
{\u1}...{\u0} |
Underline |
{\c&HFFFFFF&} |
Text color (hex) |
{\fs24} |
Font size |
{\pos(x,y)} |
Screen position |
{\fad(in,out)} |
Fade in/out |
Why remove them?
Character detection uses regex on plain text. Formatting tags interfere with pattern matching:
# With tags (won't match "Naruto")
text = r"{\i1}Naruto{\i0} is late"
# After cleaning (matches "Naruto")
text = "Naruto is late"
Newline Handling¶
ASS uses \N (backslash-N) for line breaks, not \n:
# ASS format
text = "Line 1\NLine 2" # \N is literal characters
# After normalization
text = "Line 1 Line 2" # Converted to space
Why convert to space?
For character detection, preserving semantic meaning is more important than visual layout. Newlines within a single subtitle event usually don't indicate scene boundaries.
Multi-Speaker Detection¶
Some fansub groups use speaker labels:
The split_multi_speaker() function detects these patterns and separates them.
Supported formats:
Name:prefix (e.g., "Naruto: Let's go!")- Multiple speakers in one line (e.g., "A: ... B: ...")
Limitations:
- Doesn't handle unlabeled speakers (most subtitles)
- Heuristic-based (may split false positives like "Time: 3pm")
Integration with Pipeline¶
Normalization typically happens after parsing:
from naruto_net.io.subtitles import AssReader
from naruto_net.normalize.ass_text import strip_ass_tags, normalize_newlines
# 1. Parse
reader = AssReader('episode.ass')
events = reader.read_events()
# 2. Normalize
for event in events:
event.text = strip_ass_tags(event.text)
event.text = normalize_newlines(event.text)
event.text = event.text.strip()
# 3. Continue to detection...
Performance Notes¶
- Regex compilation: Patterns are compiled once, reused across all events
- Speed: Cleaning 1000 subtitle lines takes <10ms