Skip to content

Normalize Module

The normalize module provides text cleaning utilities for subtitle data.


Overview

This module handles:

  • Stripping ASS formatting tags ({\i1}, {\b1}, etc.)
  • Normalizing newlines (\N → space)
  • Splitting multi-speaker dialogue lines
  • Cleaning whitespace

Functions

strip_ass_tags(text)

Remove ASS override tag blocks like {\i1}.

Source code in src/naruto_net/normalize/ass_text.py
def strip_ass_tags(text: str) -> str:
    r"""Remove ASS override tag blocks like `{\i1}`."""
    return _TAG_BLOCK_RE.sub("", text)

normalize_ass_newlines(text, *, newline=' ')

Convert ASS \N to a provided newline token.

Source code in src/naruto_net/normalize/ass_text.py
def normalize_ass_newlines(text: str, *, newline: str = " ") -> str:
    r"""Convert ASS `\N` to a provided newline token."""
    return _ASS_NEWLINE_RE.sub(newline, text)

clean_dialogue_text(text_raw, *, newline=' ')

Clean ASS dialogue text for matching (no tags, normalized newlines, collapsed whitespace).

Source code in src/naruto_net/normalize/ass_text.py
def clean_dialogue_text(text_raw: str, *, newline: str = " ") -> str:
    """Clean ASS dialogue text for matching (no tags, normalized newlines, collapsed whitespace)."""
    text = strip_ass_tags(text_raw)
    text = normalize_ass_newlines(text, newline=newline)
    text = " ".join(text.split())
    return text.strip()

split_event_to_utterances(ev)

Convert one subtitle event into 1..N utterances (not speaker attribution).

Source code in src/naruto_net/normalize/utterances.py
def split_event_to_utterances(ev: SubtitleEvent) -> list[Utterance]:
    """Convert one subtitle event into 1..N utterances (not speaker attribution)."""
    text_raw = ev.text_raw
    display = normalize_ass_newlines(strip_ass_tags(text_raw), newline="\n").strip()
    clean_flat = clean_dialogue_text(text_raw, newline=" ")

    lines = [ln.strip() for ln in display.split("\n") if ln.strip()]
    is_multi = any(_MULTI_SPEAKER_RE.search("\n" + ln) for ln in lines) and len(lines) >= 2

    out: list[Utterance] = []
    if is_multi:
        for i, ln in enumerate(lines):
            ln2 = re.sub(r"^\s*-\s+", "", ln).strip()
            out.append(Utterance(
                utterance_id=f"{ev.event_id}:{i}",
                event_id=ev.event_id,
                utterance_index=i,
                start_ms=ev.start_ms,
                end_ms=ev.end_ms,
                text_clean=" ".join(ln2.split()),
                text_display=ln2,
                is_multi_speaker_line=True,
            ))
    else:
        out.append(Utterance(
            utterance_id=f"{ev.event_id}:0",
            event_id=ev.event_id,
            utterance_index=0,
            start_ms=ev.start_ms,
            end_ms=ev.end_ms,
            text_clean=clean_flat,
            text_display=display.replace("\n", " "),
            is_multi_speaker_line=False,
        ))
    return out

Usage Examples

Stripping ASS Tags

from naruto_net.normalize.ass_text import strip_ass_tags

# Remove italic/bold/color tags
text = r"{\i1}Naruto!{\i0} You're {\b1}late{\b0} again!"
clean = strip_ass_tags(text)
print(clean)
# Output: "Naruto! You're late again!"

Normalizing Newlines

from naruto_net.normalize.ass_text import normalize_newlines

# ASS uses \N for line breaks
text = "First line\NSecond line\NThird line"
normalized = normalize_newlines(text)
print(normalized)
# Output: "First line Second line Third line"

Combined Cleaning

from naruto_net.normalize.ass_text import strip_ass_tags, normalize_newlines

def clean_subtitle_text(text: str) -> str:
    """Apply all normalization steps."""
    text = strip_ass_tags(text)
    text = normalize_newlines(text)
    return text.strip()

raw = r"{\i1}I will avenge\N{\i0}Pervy Sage!"
clean = clean_subtitle_text(raw)
print(clean)
# Output: "I will avenge Pervy Sage!"

Splitting Multi-Speaker Lines

from naruto_net.normalize.utterances import split_multi_speaker

# Some subtitles combine multiple speakers
text = "Naruto: I'm hungry! Sakura: You're always hungry."
utterances = split_multi_speaker(text)

for speaker, line in utterances:
    print(f"{speaker}: {line}")
# Output:
# Naruto: I'm hungry!
# Sakura: You're always hungry.

ASS Formatting Tags

Common tags removed by strip_ass_tags():

Tag Meaning
{\i1}...{\i0} Italic text
{\b1}...{\b0} Bold text
{\u1}...{\u0} Underline
{\c&HFFFFFF&} Text color (hex)
{\fs24} Font size
{\pos(x,y)} Screen position
{\fad(in,out)} Fade in/out

Why remove them?

Character detection uses regex on plain text. Formatting tags interfere with pattern matching:

# With tags (won't match "Naruto")
text = r"{\i1}Naruto{\i0} is late"
# After cleaning (matches "Naruto")
text = "Naruto is late"

Newline Handling

ASS uses \N (backslash-N) for line breaks, not \n:

# ASS format
text = "Line 1\NLine 2"  # \N is literal characters

# After normalization
text = "Line 1 Line 2"   # Converted to space

Why convert to space?

For character detection, preserving semantic meaning is more important than visual layout. Newlines within a single subtitle event usually don't indicate scene boundaries.


Multi-Speaker Detection

Some fansub groups use speaker labels:

Speaker A: Dialogue here. Speaker B: Response.

The split_multi_speaker() function detects these patterns and separates them.

Supported formats:

  • Name: prefix (e.g., "Naruto: Let's go!")
  • Multiple speakers in one line (e.g., "A: ... B: ...")

Limitations:

  • Doesn't handle unlabeled speakers (most subtitles)
  • Heuristic-based (may split false positives like "Time: 3pm")

Integration with Pipeline

Normalization typically happens after parsing:

from naruto_net.io.subtitles import AssReader
from naruto_net.normalize.ass_text import strip_ass_tags, normalize_newlines

# 1. Parse
reader = AssReader('episode.ass')
events = reader.read_events()

# 2. Normalize
for event in events:
    event.text = strip_ass_tags(event.text)
    event.text = normalize_newlines(event.text)
    event.text = event.text.strip()

# 3. Continue to detection...

Performance Notes

  • Regex compilation: Patterns are compiled once, reused across all events
  • Speed: Cleaning 1000 subtitle lines takes <10ms

  • IO — Provides raw subtitle events to normalize
  • Detect — Uses normalized text for character matching