Subtitle Text Analysis with subtools

knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  comment = "#>"
)

library(subtools)

Overview

subtools reads and manipulates video subtitle files from a variety of formats (SubRip .srt, WebVTT .vtt, SubStation Alpha .ass/.ssa, SubViewer .sub, MicroDVD .sub) and exposes them as tidy tibbles ready for text analysis.

This vignette walks through:

  1. Reading subtitle files
  2. Exploring and cleaning subtitle objects
  3. Combining subtitles from multiple files
  4. Adjusting timecodes
  5. Tokenising and analysing text with tidytext
  6. Analysing dialogue across a TV series

1. Reading subtitles

From a file

read_subtitles() is the main entry point. It auto-detects the file format from the extension and returns a subtitles object — a tibble with four core columns: ID, Timecode_in, Timecode_out, and Text_content.

f_srt <- system.file("extdata", "ex_subrip.srt", package = "subtools")
subs <- read_subtitles(file = f_srt)
subs

The same call works for every supported format. Use format = "auto" (default) or supply the format explicitly.

f_vtt <- system.file("extdata", "ex_webvtt.vtt", package = "subtools")
read_subtitles(file = f_vtt, format = "webvtt")
f_ass <- system.file("extdata", "ex_substation.ass", package = "subtools")
read_subtitles(file = f_ass, format = "substation")

Attaching metadata at read time

Any descriptive information — season, episode, source, language — can be attached as a one-row tibble via the metadata argument. The values are repeated for every subtitle line, keeping the tidy structure intact.

subs_meta <- read_subtitles(
  file = f_srt,
  metadata = tibble::tibble(Season = 1L, Episode = 3L, Language = "en")
)
subs_meta

Metadata columns travel with the object through all subtools operations.

From a character vector

as_subtitle() parses an in-memory character vector, which is useful when the subtitle text is already loaded or generated programmatically.

raw <- c(
  "1",
  "00:00:01,000 --> 00:00:03,500",
  "Hello, world.",
  "",
  "2",
  "00:00:04,000 --> 00:00:06,000",
  "This is subtools."
)
as_subtitle(x = raw, format = "srt")

2. Exploring the subtitles object

Quick summary

get_subtitles_info() prints a compact summary: line count, overall duration, and attached metadata fields.

s <- read_subtitles(
  file = system.file("extdata", "ex_subrip.srt", package = "subtools")
)
get_subtitles_info(x = s)

Raw text extraction

get_raw_text() collapses all subtitle lines into a single character string, useful when passing the whole transcript to external Natural Language Processing tools.

transcript <- get_raw_text(x = s)
transcript

# One line per subtitle, separated by newlines
cat(get_raw_text(x = s, collapse = "\n"))

Accessing individual columns

Because a subtitles object is a tibble, all dplyr verbs work directly:

library(dplyr)

# Lines spoken after the first 30 seconds
s |>
  filter(Timecode_in > hms::as_hms("00:00:30"))

# Duration of each subtitle cue (in seconds)
s |>
  mutate(duration_s = as.numeric(Timecode_out - Timecode_in)) |>
  select(ID, Text_content, duration_s)

3. Cleaning subtitles

Subtitle files frequently contain formatting tags, closed-caption descriptions, and other non-speech artefacts that should be removed before text analysis.

Remove formatting tags

clean_tags() strips HTML-style tags (used in SRT and WebVTT) and curly-brace override blocks (used in SubStation Alpha).

tagged <- as_subtitle(
  x = c(
    "1",
    "00:00:01,000 --> 00:00:03,000",
    "<i>This is <b>important</b>.</i>",
    "",
    "2",
    "00:00:04,000 --> 00:00:06,000",
    "<font color=\"red\">Warning!</font>"
  ),
  format = "srt",
  clean.tags = FALSE   # keep tags so we can demonstrate cleaning
)
tagged$Text_content

clean_tags(x = tagged)$Text_content

Remove closed captions

clean_captions() removes text enclosed in parentheses or square brackets — typically sound descriptions and speaker identifiers used in accessibility captions.

bb <- read_subtitles(
  file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"),
  clean.tags = FALSE
)
bb$Text_content

clean_captions(x = bb)$Text_content

Remove arbitrary patterns

clean_patterns() accepts any regular expression, giving full flexibility for project-specific cleaning.

# Remove speaker labels such as "WALTER:" or "JESSE:"
s_labeled <- as_subtitle(
  x = c(
    "1", "00:00:01,000 --> 00:00:03,000", "WALTER: We need to cook.",
    "",
    "2", "00:00:04,000 --> 00:00:06,000", "JESSE: Yeah, Mr. White!"
  ),
  format = "srt", clean.tags = FALSE
)

clean_patterns(x = s_labeled, pattern = "^[A-Z]+: ")$Text_content

Chaining cleaning steps

Because each cleaning function returns a subtitles object, steps can be piped:

s_clean <- read_subtitles(file = f_srt, clean.tags = FALSE) |>
  clean_tags() |>
  clean_captions() |>
  clean_patterns(pattern = "^-\\s*")   # remove leading dialogue dashes

s_clean$Text_content

4. Combining subtitles

Collapsing multiple objects into one

bind_subtitles() merges any number of subtitles (or multisubtitles) objects. With collapse = TRUE (default), timecodes are shifted so that each file follows the previous one sequentially.

s1 <- read_subtitles(
  file = system.file("extdata", "ex_subrip.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 1L)
)
s2 <- read_subtitles(
  file = system.file("extdata", "ex_rushmore.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 2L)
)

combined <- bind_subtitles(s1, s2)
nrow(combined)
range(combined$Timecode_in)

Keeping a list structure

Set collapse = FALSE to get a multisubtitles object — a named list of subtitles — when you want to process episodes independently before merging.

multi <- bind_subtitles(s1, s2, collapse = FALSE)
class(multi)
print(multi)

get_subtitles_info() also works on multisubtitles:

get_subtitles_info(x = multi)

5. Reading an entire series

For TV series organised in a standard directory tree, subtools provides convenience readers that handle the hierarchy automatically and extract Season/Episode metadata from folder and file names.

Series_Collection/
|-- BreakingBad/
|   |-- Season_01/
|   |   |-- S01E01.srt
|   |   |-- S01E02.srt
|   |-- Season_02/
|       |-- S02E01.srt
# Read a single season
season1 <- read_subtitles_season(dir = "BreakingBad/Season_01/")

# Read an entire series (all seasons)
bb_all <- read_subtitles_serie(dir = "BreakingBad/")

# Read multiple series at once
collection <- read_subtitles_multiseries(dir = "Series_Collection/")

Each function returns a single collapsed subtitles object by default (bind = TRUE), with Serie, Season, and Episode columns populated from the directory structure. Pass bind = FALSE to get a multisubtitles list instead.


6. Adjusting timecodes

move_subtitles() shifts all timecodes by a fixed number of seconds. Positive values shift forward; negative values shift backward. This is useful when the subtitle file is out of sync with the video.

subs_shifted <- move_subtitles(x = subs, lag = 2.5)

# Compare first cue before and after
subs$Timecode_in[1]
subs_shifted$Timecode_in[1]

move_subtitles() also works on multisubtitles:

multi_shifted <- move_subtitles(x = multi, lag = -1.0)
multi_shifted[[1]]$Timecode_in[1]

7. Writing subtitles back to disk

write_subtitles() serialises a subtitles object to a SubRip .srt file.

write_subtitles(x = subs_shifted, file = "synced_episode.srt")

8. Text analysis with tidytext

Tokenising into words

unnest_tokens() extends tidytext::unnest_tokens() with subtitle-aware timecode remapping: each token inherits a proportional slice of the original cue's time window, enabling timeline-based analyses.

words <- unnest_tokens(tbl = subs)
words

The Timecode_in / Timecode_out columns now reflect the estimated position of each word within its cue.

Tokenising into sentences or n-grams

# Bigrams
bigrams <- unnest_tokens(tbl = subs, output = Word, input = Text_content,
                         token = "ngrams", n = 2)
bigrams$Word

Word frequency

library(dplyr)

words |>
  count(Text_content, sort = TRUE) |>
  head(10)

9. Advanced: cross-episode analysis

The metadata columns added at read time make it straightforward to compare episodes or seasons. The example below simulates a two-episode corpus and computes per-episode word counts — a pattern that scales directly to a full series loaded with read_subtitles_serie().

ep1 <- read_subtitles(
  file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 1L)
)
ep2 <- read_subtitles(
  file = system.file("extdata", "ex_rushmore.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 2L)
)
ep3 <- read_subtitles(
  file = system.file("extdata", "ex_webvtt.vtt", package = "subtools"),
  metadata = tibble::tibble(Episode = 3L)
)

corpus <- bind_subtitles(ep1, ep2, ep3)

token_counts <- unnest_tokens(corpus) |>
  count(Episode, Text_content, sort = TRUE)

token_counts |>
  slice_max(n, n = 5, by = Episode)

TF-IDF across episodes

TF-IDF highlights words that are distinctive to each episode compared with the rest of the corpus.

token_counts |>
  tidytext::bind_tf_idf(Text_content, Episode, n) |>
  arrange(Episode, desc(tf_idf)) |>
  slice_max(tf_idf, n = 5, by = Episode)

Dialogue timeline

Because timecodes are preserved through unnest_tokens(), words can be plotted along a timeline, e.g. to visualise how vocabulary density evolves across a film.

words_ep1 <- unnest_tokens(tbl = ep1) |>
  mutate(minute = as.numeric(Timecode_in) / 60)

if (requireNamespace("ggplot2", quietly = TRUE)) {
  library(ggplot2)
  ggplot(words_ep1, aes(x = minute)) +
    geom_histogram(binwidth = 0.5, fill = "steelblue", colour = "white") +
    labs(
      title = "Word density over time",
      x     = "Time (minutes)",
      y     = "Word count"
    ) +
    theme_minimal()
}

Summary

| Task | Function | |------|----------| | Read a subtitle file | read_subtitles() | | Parse in-memory text | as_subtitle() | | Read a full season/series | read_subtitles_season() / read_subtitles_serie() / read_subtitles_multiseries() | | Print a summary | get_subtitles_info() | | Extract plain text | get_raw_text() | | Remove HTML/ASS tags | clean_tags() | | Remove closed captions | clean_captions() | | Remove custom patterns | clean_patterns() | | Merge subtitle objects | bind_subtitles() | | Shift timecodes | move_subtitles() | | Write to .srt | write_subtitles() | | Tokenise (words, n-grams, …) | unnest_tokens() |



Try the subtools package in your browser

Any scripts or data that you put into this service are public.

subtools documentation built on March 24, 2026, 5:07 p.m.