Skip to content

IO Module

The io module provides utilities for parsing .ass (Advanced SubStation Alpha) subtitle files.


Overview

This module handles:

  • Loading .ass files with proper encoding
  • Extracting Dialogue: lines
  • Parsing timestamps and text content
  • Returning structured SubtitleEvent objects

Classes

AssReader

Minimal but robust ASS/SSA (.ass) Dialogue parser.

Correctness requirements: - Only reads the [Events] section - Respects the Format: field order - Splits Dialogue payload with maxsplit=(n-1) to avoid breaking on commas in Text

Source code in src/naruto_net/io/subtitles.py
class AssReader:
    """Minimal but robust ASS/SSA (.ass) Dialogue parser.

    Correctness requirements:
    - Only reads the [Events] section
    - Respects the Format: field order
    - Splits Dialogue payload with maxsplit=(n-1) to avoid breaking on commas in Text
    """

    def read_text(self, text: str, episode_id: str) -> list[SubtitleEvent]:
        lines = text.splitlines()
        in_events = False
        fmt: Optional[list[str]] = None
        out: list[SubtitleEvent] = []

        for raw_line in lines:
            line = raw_line.strip()
            if not line:
                continue

            if line.lower() == "[events]":
                in_events = True
                continue

            if not in_events:
                continue

            if line.startswith("[") and line.endswith("]") and line.lower() != "[events]":
                break

            if line.startswith("Format:"):
                fmt = [c.strip() for c in line[len("Format:") :].split(",")]
                continue

            if not line.startswith("Dialogue:"):
                continue

            if not fmt:
                raise ValueError("ASS parser error: Dialogue encountered before Format.")

            payload = line[len("Dialogue:") :].lstrip()
            parts = payload.split(",", len(fmt) - 1)
            if len(parts) != len(fmt):
                parts += [""] * (len(fmt) - len(parts))

            record = dict(zip(fmt, parts))
            start_ms = ass_time_to_ms(record.get("Start", "0:00:00.00"))
            end_ms = ass_time_to_ms(record.get("End", "0:00:00.00"))
            text_raw = record.get("Text", "")

            event_index = len(out)
            event_id = self._make_event_id(
                episode_id=episode_id,
                event_index=event_index,
                start_ms=start_ms,
                end_ms=end_ms,
                text_raw=text_raw,
            )

            out.append(
                SubtitleEvent(
                    episode_id=episode_id,
                    event_index=event_index,
                    event_id=event_id,
                    layer=str(record.get("Layer", "")).strip(),
                    start_ms=start_ms,
                    end_ms=end_ms,
                    style=str(record.get("Style", "")).strip(),
                    name=str(record.get("Name", "")).strip(),
                    margin_l=str(record.get("MarginL", "")).strip(),
                    margin_r=str(record.get("MarginR", "")).strip(),
                    margin_v=str(record.get("MarginV", "")).strip(),
                    effect=str(record.get("Effect", "")).strip(),
                    text_raw=text_raw,
                )
            )

        return out

    def read_file(self, path: Path, episode_id: str) -> list[SubtitleEvent]:
        text = path.read_text(encoding="utf-8", errors="replace")
        return self.read_text(text=text, episode_id=episode_id)

    @staticmethod
    def _make_event_id(*, episode_id: str, event_index: int, start_ms: int, end_ms: int, text_raw: str) -> str:
        key = f"{episode_id}|{event_index}|{start_ms}|{end_ms}|{text_raw}"
        return hashlib.sha1(key.encode("utf-8")).hexdigest()[:16]

SubtitleEvent dataclass

Source code in src/naruto_net/io/subtitles.py
@dataclass(frozen=True)
class SubtitleEvent:
    episode_id: str
    event_index: int
    event_id: str
    layer: str
    start_ms: int
    end_ms: int
    style: str
    name: str
    margin_l: str
    margin_r: str
    margin_v: str
    effect: str
    text_raw: str

    @property
    def duration_ms(self) -> int:
        return max(0, self.end_ms - self.start_ms)

Usage Examples

Basic Parsing

from naruto_net.io.subtitles import AssReader

# Load subtitle file
reader = AssReader('data/naruto-subtitle-files/episode_014.ass')

# Parse all dialogue events
events = reader.read_events()

# Inspect first event
print(events[0])
# SubtitleEvent(start_time=676.77, end_time=680.27, text="So what should we do, Lady Hokage?")

Filtering by Time Range

# Get events between 10:00 and 15:00
target_start = 10 * 60  # 10 minutes in seconds
target_end = 15 * 60    # 15 minutes

filtered = [
    e for e in events
    if target_start <= e.start_time <= target_end
]

print(f"Found {len(filtered)} events in time range")

Converting to DataFrame

import pandas as pd

# Convert to pandas for analysis
df = pd.DataFrame([
    {
        'start': e.start_time,
        'end': e.end_time,
        'duration': e.end_time - e.start_time,
        'text': e.text
    }
    for e in events
])

print(df.describe())

File Format

.ass files follow the Advanced SubStation Alpha specification. Dialogue lines look like:

Dialogue: 0,0:11:16.77,0:11:20.27,Default,,0,0,0,,So what should we do, Lady Hokage?

Parsed fields:

  • 0:11:16.77start_time (converted to seconds: 676.77)
  • 0:11:20.27end_time (converted to seconds: 680.27)
  • So what should we do, Lady Hokage?text

Ignored fields: Layer, style, margin, effects (not needed for character detection)


Error Handling

Encoding Issues

If you encounter UnicodeDecodeError:

# Try UTF-8 with BOM
reader = AssReader('file.ass', encoding='utf-8-sig')

# Or auto-detect
import chardet
with open('file.ass', 'rb') as f:
    encoding = chardet.detect(f.read())['encoding']

reader = AssReader('file.ass', encoding=encoding)

Malformed Lines

Invalid dialogue lines are skipped with a warning:

import logging
logging.basicConfig(level=logging.WARNING)

# Lines that don't match the Dialogue format are logged but not processed
reader = AssReader('messy_file.ass')
events = reader.read_events()

Performance Notes

  • Memory: Events are loaded into memory. For large files (>10MB), consider streaming.
  • Speed: Parsing ~500 subtitle lines takes <50ms on typical hardware.

  • Normalize — Clean ASS formatting tags from parsed text
  • Segment — Group events into scenes using timing gaps