knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, message = FALSE, comment = "#>" ) library(subtools)
subtools reads and manipulates video subtitle files from a variety of formats
(SubRip .srt, WebVTT .vtt, SubStation Alpha .ass/.ssa, SubViewer .sub,
MicroDVD .sub) and exposes them as tidy tibbles ready for text analysis.
This vignette walks through:
tidytextread_subtitles() is the main entry point. It auto-detects the file format from
the extension and returns a subtitles object — a tibble with four core
columns: ID, Timecode_in, Timecode_out, and Text_content.
f_srt <- system.file("extdata", "ex_subrip.srt", package = "subtools") subs <- read_subtitles(file = f_srt) subs
The same call works for every supported format. Use format = "auto" (default)
or supply the format explicitly.
f_vtt <- system.file("extdata", "ex_webvtt.vtt", package = "subtools") read_subtitles(file = f_vtt, format = "webvtt")
f_ass <- system.file("extdata", "ex_substation.ass", package = "subtools") read_subtitles(file = f_ass, format = "substation")
Any descriptive information — season, episode, source, language — can be
attached as a one-row tibble via the metadata argument. The values are
repeated for every subtitle line, keeping the tidy structure intact.
subs_meta <- read_subtitles( file = f_srt, metadata = tibble::tibble(Season = 1L, Episode = 3L, Language = "en") ) subs_meta
Metadata columns travel with the object through all subtools operations.
as_subtitle() parses an in-memory character vector, which is useful when the
subtitle text is already loaded or generated programmatically.
raw <- c( "1", "00:00:01,000 --> 00:00:03,500", "Hello, world.", "", "2", "00:00:04,000 --> 00:00:06,000", "This is subtools." ) as_subtitle(x = raw, format = "srt")
get_subtitles_info() prints a compact summary: line count, overall duration,
and attached metadata fields.
s <- read_subtitles( file = system.file("extdata", "ex_subrip.srt", package = "subtools") ) get_subtitles_info(x = s)
get_raw_text() collapses all subtitle lines into a single character string,
useful when passing the whole transcript to external Natural Language Processing tools.
transcript <- get_raw_text(x = s) transcript # One line per subtitle, separated by newlines cat(get_raw_text(x = s, collapse = "\n"))
Because a subtitles object is a tibble, all dplyr verbs work directly:
library(dplyr) # Lines spoken after the first 30 seconds s |> filter(Timecode_in > hms::as_hms("00:00:30")) # Duration of each subtitle cue (in seconds) s |> mutate(duration_s = as.numeric(Timecode_out - Timecode_in)) |> select(ID, Text_content, duration_s)
Subtitle files frequently contain formatting tags, closed-caption descriptions, and other non-speech artefacts that should be removed before text analysis.
clean_tags() strips HTML-style tags (used in SRT and WebVTT) and curly-brace
override blocks (used in SubStation Alpha).
tagged <- as_subtitle( x = c( "1", "00:00:01,000 --> 00:00:03,000", "<i>This is <b>important</b>.</i>", "", "2", "00:00:04,000 --> 00:00:06,000", "<font color=\"red\">Warning!</font>" ), format = "srt", clean.tags = FALSE # keep tags so we can demonstrate cleaning ) tagged$Text_content clean_tags(x = tagged)$Text_content
clean_captions() removes text enclosed in parentheses or square brackets —
typically sound descriptions and speaker identifiers used in accessibility
captions.
bb <- read_subtitles( file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"), clean.tags = FALSE ) bb$Text_content clean_captions(x = bb)$Text_content
clean_patterns() accepts any regular expression, giving full flexibility for
project-specific cleaning.
# Remove speaker labels such as "WALTER:" or "JESSE:" s_labeled <- as_subtitle( x = c( "1", "00:00:01,000 --> 00:00:03,000", "WALTER: We need to cook.", "", "2", "00:00:04,000 --> 00:00:06,000", "JESSE: Yeah, Mr. White!" ), format = "srt", clean.tags = FALSE ) clean_patterns(x = s_labeled, pattern = "^[A-Z]+: ")$Text_content
Because each cleaning function returns a subtitles object, steps can be piped:
s_clean <- read_subtitles(file = f_srt, clean.tags = FALSE) |> clean_tags() |> clean_captions() |> clean_patterns(pattern = "^-\\s*") # remove leading dialogue dashes s_clean$Text_content
bind_subtitles() merges any number of subtitles (or multisubtitles)
objects. With collapse = TRUE (default), timecodes are shifted so that each
file follows the previous one sequentially.
s1 <- read_subtitles( file = system.file("extdata", "ex_subrip.srt", package = "subtools"), metadata = tibble::tibble(Episode = 1L) ) s2 <- read_subtitles( file = system.file("extdata", "ex_rushmore.srt", package = "subtools"), metadata = tibble::tibble(Episode = 2L) ) combined <- bind_subtitles(s1, s2) nrow(combined) range(combined$Timecode_in)
Set collapse = FALSE to get a multisubtitles object — a named list of
subtitles — when you want to process episodes independently before merging.
multi <- bind_subtitles(s1, s2, collapse = FALSE) class(multi) print(multi)
get_subtitles_info() also works on multisubtitles:
get_subtitles_info(x = multi)
For TV series organised in a standard directory tree, subtools provides
convenience readers that handle the hierarchy automatically and extract
Season/Episode metadata from folder and file names.
Series_Collection/ |-- BreakingBad/ | |-- Season_01/ | | |-- S01E01.srt | | |-- S01E02.srt | |-- Season_02/ | |-- S02E01.srt
# Read a single season season1 <- read_subtitles_season(dir = "BreakingBad/Season_01/") # Read an entire series (all seasons) bb_all <- read_subtitles_serie(dir = "BreakingBad/") # Read multiple series at once collection <- read_subtitles_multiseries(dir = "Series_Collection/")
Each function returns a single collapsed subtitles object by default
(bind = TRUE), with Serie, Season, and Episode columns populated from
the directory structure. Pass bind = FALSE to get a multisubtitles list
instead.
move_subtitles() shifts all timecodes by a fixed number of seconds. Positive
values shift forward; negative values shift backward. This is useful when the
subtitle file is out of sync with the video.
subs_shifted <- move_subtitles(x = subs, lag = 2.5) # Compare first cue before and after subs$Timecode_in[1] subs_shifted$Timecode_in[1]
move_subtitles() also works on multisubtitles:
multi_shifted <- move_subtitles(x = multi, lag = -1.0) multi_shifted[[1]]$Timecode_in[1]
write_subtitles() serialises a subtitles object to a SubRip .srt file.
write_subtitles(x = subs_shifted, file = "synced_episode.srt")
unnest_tokens() extends tidytext::unnest_tokens() with subtitle-aware
timecode remapping: each token inherits a proportional slice of the original
cue's time window, enabling timeline-based analyses.
words <- unnest_tokens(tbl = subs) words
The Timecode_in / Timecode_out columns now reflect the estimated position
of each word within its cue.
# Bigrams bigrams <- unnest_tokens(tbl = subs, output = Word, input = Text_content, token = "ngrams", n = 2) bigrams$Word
library(dplyr) words |> count(Text_content, sort = TRUE) |> head(10)
The metadata columns added at read time make it straightforward to compare
episodes or seasons. The example below simulates a two-episode corpus and
computes per-episode word counts — a pattern that scales directly to a full
series loaded with read_subtitles_serie().
ep1 <- read_subtitles( file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"), metadata = tibble::tibble(Episode = 1L) ) ep2 <- read_subtitles( file = system.file("extdata", "ex_rushmore.srt", package = "subtools"), metadata = tibble::tibble(Episode = 2L) ) ep3 <- read_subtitles( file = system.file("extdata", "ex_webvtt.vtt", package = "subtools"), metadata = tibble::tibble(Episode = 3L) ) corpus <- bind_subtitles(ep1, ep2, ep3) token_counts <- unnest_tokens(corpus) |> count(Episode, Text_content, sort = TRUE) token_counts |> slice_max(n, n = 5, by = Episode)
TF-IDF highlights words that are distinctive to each episode compared with the rest of the corpus.
token_counts |> tidytext::bind_tf_idf(Text_content, Episode, n) |> arrange(Episode, desc(tf_idf)) |> slice_max(tf_idf, n = 5, by = Episode)
Because timecodes are preserved through unnest_tokens(), words can be plotted
along a timeline, e.g. to visualise how vocabulary density evolves across a
film.
words_ep1 <- unnest_tokens(tbl = ep1) |> mutate(minute = as.numeric(Timecode_in) / 60) if (requireNamespace("ggplot2", quietly = TRUE)) { library(ggplot2) ggplot(words_ep1, aes(x = minute)) + geom_histogram(binwidth = 0.5, fill = "steelblue", colour = "white") + labs( title = "Word density over time", x = "Time (minutes)", y = "Word count" ) + theme_minimal() }
| Task | Function |
|------|----------|
| Read a subtitle file | read_subtitles() |
| Parse in-memory text | as_subtitle() |
| Read a full season/series | read_subtitles_season() / read_subtitles_serie() / read_subtitles_multiseries() |
| Print a summary | get_subtitles_info() |
| Extract plain text | get_raw_text() |
| Remove HTML/ASS tags | clean_tags() |
| Remove closed captions | clean_captions() |
| Remove custom patterns | clean_patterns() |
| Merge subtitle objects | bind_subtitles() |
| Shift timecodes | move_subtitles() |
| Write to .srt | write_subtitles() |
| Tokenise (words, n-grams, …) | unnest_tokens() |
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.