Normalize Module¶

The normalize module provides text cleaning utilities for subtitle data.

Overview¶

This module handles:

Stripping ASS formatting tags ({\i1}, {\b1}, etc.)
Normalizing newlines (\N → space)
Splitting multi-speaker dialogue lines
Cleaning whitespace

Functions¶

`strip_ass_tags(text)` ¶

Remove ASS override tag blocks like {\i1}.

Source code in src/naruto_net/normalize/ass_text.py

def strip_ass_tags(text: str) -> str:
    r"""Remove ASS override tag blocks like `{\i1}`."""
    return _TAG_BLOCK_RE.sub("", text)

`normalize_ass_newlines(text, *, newline=' ')` ¶

Convert ASS \N to a provided newline token.

Source code in src/naruto_net/normalize/ass_text.py

def normalize_ass_newlines(text: str, *, newline: str = " ") -> str:
    r"""Convert ASS `\N` to a provided newline token."""
    return _ASS_NEWLINE_RE.sub(newline, text)

`clean_dialogue_text(text_raw, *, newline=' ')` ¶

Clean ASS dialogue text for matching (no tags, normalized newlines, collapsed whitespace).

Source code in src/naruto_net/normalize/ass_text.py

def clean_dialogue_text(text_raw: str, *, newline: str = " ") -> str:
    """Clean ASS dialogue text for matching (no tags, normalized newlines, collapsed whitespace)."""
    text = strip_ass_tags(text_raw)
    text = normalize_ass_newlines(text, newline=newline)
    text = " ".join(text.split())
    return text.strip()

`split_event_to_utterances(ev)` ¶

Convert one subtitle event into 1..N utterances (not speaker attribution).

Source code in src/naruto_net/normalize/utterances.py

def split_event_to_utterances(ev: SubtitleEvent) -> list[Utterance]:
    """Convert one subtitle event into 1..N utterances (not speaker attribution)."""
    text_raw = ev.text_raw
    display = normalize_ass_newlines(strip_ass_tags(text_raw), newline="\n").strip()
    clean_flat = clean_dialogue_text(text_raw, newline=" ")

    lines = [ln.strip() for ln in display.split("\n") if ln.strip()]
    is_multi = any(_MULTI_SPEAKER_RE.search("\n" + ln) for ln in lines) and len(lines) >= 2

    out: list[Utterance] = []
    if is_multi:
        for i, ln in enumerate(lines):
            ln2 = re.sub(r"^\s*-\s+", "", ln).strip()
            out.append(Utterance(
                utterance_id=f"{ev.event_id}:{i}",
                event_id=ev.event_id,
                utterance_index=i,
                start_ms=ev.start_ms,
                end_ms=ev.end_ms,
                text_clean=" ".join(ln2.split()),
                text_display=ln2,
                is_multi_speaker_line=True,
            ))
    else:
        out.append(Utterance(
            utterance_id=f"{ev.event_id}:0",
            event_id=ev.event_id,
            utterance_index=0,
            start_ms=ev.start_ms,
            end_ms=ev.end_ms,
            text_clean=clean_flat,
            text_display=display.replace("\n", " "),
            is_multi_speaker_line=False,
        ))
    return out

Usage Examples¶

Stripping ASS Tags¶

from naruto_net.normalize.ass_text import strip_ass_tags

# Remove italic/bold/color tags
text = r"{\i1}Naruto!{\i0} You're {\b1}late{\b0} again!"
clean = strip_ass_tags(text)
print(clean)
# Output: "Naruto! You're late again!"

Normalizing Newlines¶

from naruto_net.normalize.ass_text import normalize_newlines

# ASS uses \N for line breaks
text = "First line\NSecond line\NThird line"
normalized = normalize_newlines(text)
print(normalized)
# Output: "First line Second line Third line"

Combined Cleaning¶

from naruto_net.normalize.ass_text import strip_ass_tags, normalize_newlines

def clean_subtitle_text(text: str) -> str:
    """Apply all normalization steps."""
    text = strip_ass_tags(text)
    text = normalize_newlines(text)
    return text.strip()

raw = r"{\i1}I will avenge\N{\i0}Pervy Sage!"
clean = clean_subtitle_text(raw)
print(clean)
# Output: "I will avenge Pervy Sage!"

Splitting Multi-Speaker Lines¶

from naruto_net.normalize.utterances import split_multi_speaker

# Some subtitles combine multiple speakers
text = "Naruto: I'm hungry! Sakura: You're always hungry."
utterances = split_multi_speaker(text)

for speaker, line in utterances:
    print(f"{speaker}: {line}")
# Output:
# Naruto: I'm hungry!
# Sakura: You're always hungry.

ASS Formatting Tags¶

Common tags removed by strip_ass_tags():

Tag	Meaning
`{\i1}...{\i0}`	Italic text
`{\b1}...{\b0}`	Bold text
`{\u1}...{\u0}`	Underline
`{\c&HFFFFFF&}`	Text color (hex)
`{\fs24}`	Font size
`{\pos(x,y)}`	Screen position
`{\fad(in,out)}`	Fade in/out

Why remove them?

Character detection uses regex on plain text. Formatting tags interfere with pattern matching:

# With tags (won't match "Naruto")
text = r"{\i1}Naruto{\i0} is late"
# After cleaning (matches "Naruto")
text = "Naruto is late"

Newline Handling¶

ASS uses \N (backslash-N) for line breaks, not \n:

# ASS format
text = "Line 1\NLine 2"  # \N is literal characters

# After normalization
text = "Line 1 Line 2"   # Converted to space

Why convert to space?

For character detection, preserving semantic meaning is more important than visual layout. Newlines within a single subtitle event usually don't indicate scene boundaries.

Multi-Speaker Detection¶

Some fansub groups use speaker labels:

Speaker A: Dialogue here. Speaker B: Response.

The split_multi_speaker() function detects these patterns and separates them.

Supported formats:

Name: prefix (e.g., "Naruto: Let's go!")
Multiple speakers in one line (e.g., "A: ... B: ...")

Limitations:

Doesn't handle unlabeled speakers (most subtitles)
Heuristic-based (may split false positives like "Time: 3pm")

Integration with Pipeline¶

Normalization typically happens after parsing:

from naruto_net.io.subtitles import AssReader
from naruto_net.normalize.ass_text import strip_ass_tags, normalize_newlines

# 1. Parse
reader = AssReader('episode.ass')
events = reader.read_events()

# 2. Normalize
for event in events:
    event.text = strip_ass_tags(event.text)
    event.text = normalize_newlines(event.text)
    event.text = event.text.strip()

# 3. Continue to detection...

Performance Notes¶

Regex compilation: Patterns are compiled once, reused across all events
Speed: Cleaning 1000 subtitle lines takes <10ms

IO — Provides raw subtitle events to normalize
Detect — Uses normalized text for character matching

Normalize Module¶

Overview¶

Functions¶

strip_ass_tags(text) ¶

normalize_ass_newlines(text, *, newline=' ') ¶

clean_dialogue_text(text_raw, *, newline=' ') ¶

split_event_to_utterances(ev) ¶

Usage Examples¶

Stripping ASS Tags¶

Normalizing Newlines¶

Combined Cleaning¶

Splitting Multi-Speaker Lines¶

ASS Formatting Tags¶

Newline Handling¶

Multi-Speaker Detection¶

Integration with Pipeline¶

Performance Notes¶

Related Modules¶

`strip_ass_tags(text)` ¶

`normalize_ass_newlines(text, *, newline=' ')` ¶

`clean_dialogue_text(text_raw, *, newline=' ')` ¶

`split_event_to_utterances(ev)` ¶