Subtitle Text Analysis with subtools
In subtools: Read and Manipulate Video Subtitles

knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  comment = "#>"
)

library(subtools)

Overview

subtools reads and manipulates video subtitle files from a variety of formats (SubRip .srt, WebVTT .vtt, SubStation Alpha .ass/.ssa, SubViewer .sub, MicroDVD .sub) and exposes them as tidy tibbles ready for text analysis.

This vignette walks through:

Reading subtitle files
Exploring and cleaning subtitle objects
Combining subtitles from multiple files
Adjusting timecodes
Tokenising and analysing text with tidytext
Analysing dialogue across a TV series

1. Reading subtitles

From a file

read_subtitles() is the main entry point. It auto-detects the file format from the extension and returns a subtitles object — a tibble with four core columns: ID, Timecode_in, Timecode_out, and Text_content.

f_srt <- system.file("extdata", "ex_subrip.srt", package = "subtools")
subs <- read_subtitles(file = f_srt)
subs

The same call works for every supported format. Use format = "auto" (default) or supply the format explicitly.

f_vtt <- system.file("extdata", "ex_webvtt.vtt", package = "subtools")
read_subtitles(file = f_vtt, format = "webvtt")

f_ass <- system.file("extdata", "ex_substation.ass", package = "subtools")
read_subtitles(file = f_ass, format = "substation")

Attaching metadata at read time

Any descriptive information — season, episode, source, language — can be attached as a one-row tibble via the metadata argument. The values are repeated for every subtitle line, keeping the tidy structure intact.

subs_meta <- read_subtitles(
  file = f_srt,
  metadata = tibble::tibble(Season = 1L, Episode = 3L, Language = "en")
)
subs_meta

Metadata columns travel with the object through all subtools operations.

From a character vector

as_subtitle() parses an in-memory character vector, which is useful when the subtitle text is already loaded or generated programmatically.

raw <- c(
  "1",
  "00:00:01,000 --> 00:00:03,500",
  "Hello, world.",
  "",
  "2",
  "00:00:04,000 --> 00:00:06,000",
  "This is subtools."
)
as_subtitle(x = raw, format = "srt")

2. Exploring the subtitles object

Quick summary

get_subtitles_info() prints a compact summary: line count, overall duration, and attached metadata fields.

s <- read_subtitles(
  file = system.file("extdata", "ex_subrip.srt", package = "subtools")
)
get_subtitles_info(x = s)

Raw text extraction

get_raw_text() collapses all subtitle lines into a single character string, useful when passing the whole transcript to external Natural Language Processing tools.

transcript <- get_raw_text(x = s)
transcript

# One line per subtitle, separated by newlines
cat(get_raw_text(x = s, collapse = "\n"))

Accessing individual columns

Because a subtitles object is a tibble, all dplyr verbs work directly:

library(dplyr)

# Lines spoken after the first 30 seconds
s |>
  filter(Timecode_in > hms::as_hms("00:00:30"))

# Duration of each subtitle cue (in seconds)
s |>
  mutate(duration_s = as.numeric(Timecode_out - Timecode_in)) |>
  select(ID, Text_content, duration_s)

3. Cleaning subtitles

Subtitle files frequently contain formatting tags, closed-caption descriptions, and other non-speech artefacts that should be removed before text analysis.

Remove formatting tags

clean_tags() strips HTML-style tags (used in SRT and WebVTT) and curly-brace override blocks (used in SubStation Alpha).

tagged <- as_subtitle(
  x = c(
    "1",
    "00:00:01,000 --> 00:00:03,000",
    "<i>This is <b>important</b>.</i>",
    "",
    "2",
    "00:00:04,000 --> 00:00:06,000",
    "<font color=\"red\">Warning!</font>"
  ),
  format = "srt",
  clean.tags = FALSE   # keep tags so we can demonstrate cleaning
)
tagged$Text_content

clean_tags(x = tagged)$Text_content

Remove closed captions

clean_captions() removes text enclosed in parentheses or square brackets — typically sound descriptions and speaker identifiers used in accessibility captions.

bb <- read_subtitles(
  file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"),
  clean.tags = FALSE
)
bb$Text_content

clean_captions(x = bb)$Text_content

Remove arbitrary patterns

clean_patterns() accepts any regular expression, giving full flexibility for project-specific cleaning.

# Remove speaker labels such as "WALTER:" or "JESSE:"
s_labeled <- as_subtitle(
  x = c(
    "1", "00:00:01,000 --> 00:00:03,000", "WALTER: We need to cook.",
    "",
    "2", "00:00:04,000 --> 00:00:06,000", "JESSE: Yeah, Mr. White!"
  ),
  format = "srt", clean.tags = FALSE
)

clean_patterns(x = s_labeled, pattern = "^[A-Z]+: ")$Text_content

Chaining cleaning steps

Because each cleaning function returns a subtitles object, steps can be piped:

s_clean <- read_subtitles(file = f_srt, clean.tags = FALSE) |>
  clean_tags() |>
  clean_captions() |>
  clean_patterns(pattern = "^-\\s*")   # remove leading dialogue dashes

s_clean$Text_content

4. Combining subtitles

Collapsing multiple objects into one

bind_subtitles() merges any number of subtitles (or multisubtitles) objects. With collapse = TRUE (default), timecodes are shifted so that each file follows the previous one sequentially.

s1 <- read_subtitles(
  file = system.file("extdata", "ex_subrip.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 1L)
)
s2 <- read_subtitles(
  file = system.file("extdata", "ex_rushmore.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 2L)
)

combined <- bind_subtitles(s1, s2)
nrow(combined)
range(combined$Timecode_in)

Keeping a list structure

Set collapse = FALSE to get a multisubtitles object — a named list of subtitles — when you want to process episodes independently before merging.

multi <- bind_subtitles(s1, s2, collapse = FALSE)
class(multi)
print(multi)

get_subtitles_info() also works on multisubtitles:

get_subtitles_info(x = multi)

5. Reading an entire series

For TV series organised in a standard directory tree, subtools provides convenience readers that handle the hierarchy automatically and extract Season/Episode metadata from folder and file names.

Series_Collection/
|-- BreakingBad/
|   |-- Season_01/
|   |   |-- S01E01.srt
|   |   |-- S01E02.srt
|   |-- Season_02/
|       |-- S02E01.srt

# Read a single season
season1 <- read_subtitles_season(dir = "BreakingBad/Season_01/")

# Read an entire series (all seasons)
bb_all <- read_subtitles_serie(dir = "BreakingBad/")

# Read multiple series at once
collection <- read_subtitles_multiseries(dir = "Series_Collection/")

Each function returns a single collapsed subtitles object by default (bind = TRUE), with Serie, Season, and Episode columns populated from the directory structure. Pass bind = FALSE to get a multisubtitles list instead.

6. Adjusting timecodes

move_subtitles() shifts all timecodes by a fixed number of seconds. Positive values shift forward; negative values shift backward. This is useful when the subtitle file is out of sync with the video.

subs_shifted <- move_subtitles(x = subs, lag = 2.5)

# Compare first cue before and after
subs$Timecode_in[1]
subs_shifted$Timecode_in[1]

move_subtitles() also works on multisubtitles:

multi_shifted <- move_subtitles(x = multi, lag = -1.0)
multi_shifted[[1]]$Timecode_in[1]

7. Writing subtitles back to disk

write_subtitles() serialises a subtitles object to a SubRip .srt file.

write_subtitles(x = subs_shifted, file = "synced_episode.srt")

8. Text analysis with tidytext

Tokenising into words

unnest_tokens() extends tidytext::unnest_tokens() with subtitle-aware timecode remapping: each token inherits a proportional slice of the original cue's time window, enabling timeline-based analyses.

words <- unnest_tokens(tbl = subs)
words

The Timecode_in / Timecode_out columns now reflect the estimated position of each word within its cue.

Tokenising into sentences or n-grams

# Bigrams
bigrams <- unnest_tokens(tbl = subs, output = Word, input = Text_content,
                         token = "ngrams", n = 2)
bigrams$Word

Word frequency

library(dplyr)

words |>
  count(Text_content, sort = TRUE) |>
  head(10)

9. Advanced: cross-episode analysis

The metadata columns added at read time make it straightforward to compare episodes or seasons. The example below simulates a two-episode corpus and computes per-episode word counts — a pattern that scales directly to a full series loaded with read_subtitles_serie().

ep1 <- read_subtitles(
  file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 1L)
)
ep2 <- read_subtitles(
  file = system.file("extdata", "ex_rushmore.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 2L)
)
ep3 <- read_subtitles(
  file = system.file("extdata", "ex_webvtt.vtt", package = "subtools"),
  metadata = tibble::tibble(Episode = 3L)
)

corpus <- bind_subtitles(ep1, ep2, ep3)

token_counts <- unnest_tokens(corpus) |>
  count(Episode, Text_content, sort = TRUE)

token_counts |>
  slice_max(n, n = 5, by = Episode)

TF-IDF across episodes

TF-IDF highlights words that are distinctive to each episode compared with the rest of the corpus.

token_counts |>
  tidytext::bind_tf_idf(Text_content, Episode, n) |>
  arrange(Episode, desc(tf_idf)) |>
  slice_max(tf_idf, n = 5, by = Episode)

Dialogue timeline

Because timecodes are preserved through unnest_tokens(), words can be plotted along a timeline, e.g. to visualise how vocabulary density evolves across a film.

words_ep1 <- unnest_tokens(tbl = ep1) |>
  mutate(minute = as.numeric(Timecode_in) / 60)

if (requireNamespace("ggplot2", quietly = TRUE)) {
  library(ggplot2)
  ggplot(words_ep1, aes(x = minute)) +
    geom_histogram(binwidth = 0.5, fill = "steelblue", colour = "white") +
    labs(
      title = "Word density over time",
      x     = "Time (minutes)",
      y     = "Word count"
    ) +
    theme_minimal()
}

Summary

| Task | Function | |------|----------| | Read a subtitle file | read_subtitles() | | Parse in-memory text | as_subtitle() | | Read a full season/series | read_subtitles_season() / read_subtitles_serie() / read_subtitles_multiseries() | | Print a summary | get_subtitles_info() | | Extract plain text | get_raw_text() | | Remove HTML/ASS tags | clean_tags() | | Remove closed captions | clean_captions() | | Remove custom patterns | clean_patterns() | | Merge subtitle objects | bind_subtitles() | | Shift timecodes | move_subtitles() | | Write to .srt | write_subtitles() | | Tokenise (words, n-grams, …) | unnest_tokens() |

Any scripts or data that you put into this service are public.

subtools documentation built on March 24, 2026, 5:07 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

subtools
Read and Manipulate Video Subtitles

Subtitle Text Analysis with subtools
In subtools: Read and Manipulate Video Subtitles

Overview

1. Reading subtitles

From a file

Attaching metadata at read time

From a character vector

2. Exploring the subtitles object

Quick summary

Raw text extraction

Accessing individual columns

3. Cleaning subtitles

Remove formatting tags

Remove closed captions

Remove arbitrary patterns

Chaining cleaning steps

4. Combining subtitles

Collapsing multiple objects into one

Keeping a list structure

5. Reading an entire series

6. Adjusting timecodes

7. Writing subtitles back to disk

8. Text analysis with tidytext

Tokenising into words

Tokenising into sentences or n-grams

Word frequency

9. Advanced: cross-episode analysis

TF-IDF across episodes

Dialogue timeline

Summary

Try the subtools package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

subtools Read and Manipulate Video Subtitles

Subtitle Text Analysis with subtools In subtools: Read and Manipulate Video Subtitles

Overview

1. Reading subtitles

From a file

Attaching metadata at read time

From a character vector

2. Exploring the subtitles object

Quick summary

Raw text extraction

Accessing individual columns

3. Cleaning subtitles

Remove formatting tags

Remove closed captions

Remove arbitrary patterns

Chaining cleaning steps

4. Combining subtitles

Collapsing multiple objects into one

Keeping a list structure

5. Reading an entire series

6. Adjusting timecodes

7. Writing subtitles back to disk

8. Text analysis with tidytext

Tokenising into words

Tokenising into sentences or n-grams

Word frequency

9. Advanced: cross-episode analysis

TF-IDF across episodes

Dialogue timeline

Summary

Try the subtools package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

subtools
Read and Manipulate Video Subtitles

Subtitle Text Analysis with subtools
In subtools: Read and Manipulate Video Subtitles