Hi! Here, you will find some basic informations to get started with
subtools
. For more details, you can check the package documentation.
Subtools is a R package to read, write and manipulate subtitles in R.
This then allows the full range of tools offered by the R ecosystem to
be used for the analysis of subtitles. With version 1.0
, subtools
integrates the main principles of the tidyverse and integrates directly
with tidytext
for a tidy approach of subtitle text mining.
To install the package from Github you can use devtools:
devtools::install_github("fkeck/subtools")
library(subtools)
library(tidytext)
The main goal of subtools is to provide a seamless way to import
subtitle files directly into R. This task can be performed with the
function read_subtitles()
:
rushmore_sub <- read_subtitles("ex_Rushmore.srt")
oss_sub <- read_subtitles("ex_OSS_117.srt")
rushmore_sub
#> # A tibble: 4 x 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 180 20'40.969" 20'48.269" Rushmore deserves an aquarium. A first cl…
#> 2 181 20'48.269" 20'50.870" - I don't know. What do you think, Ernie …
#> 3 182 20'50.946" 20'57.370" - What kind of fish? - Barracudas. Stingr…
#> 4 183 20'58.051" 21'01.770" - Piranhas? Really? - Yes, I'm talking to…
oss_sub
#> # A tibble: 3 x 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 264 20'22.967" 20'27.427" Si vous voulez. Ça sera surtout l'occasio…
#> 2 265 20'30.347" 20'32.297" Et non pas le gratin de pommes de terre.
#> 3 266 20'35.587" 20'37.697" Parce que ça ressemble à carotte, cairote.
The function read_subtitles()
returns an object of class subtitles
.
This is a simple tibble
with at least four columns (“ID
”,
“Timecode_in
”, “Timecode_out
” and “Text_content
”).
The metadata are handled by adding extra-columns which can be used
during the analysis. You can add metadata by adding columns manually
(e.g. using mutate()
). You can also provide a 1-row data.frame of
metadata to the function read_subtitles()
.
bb_meta <- data.frame(Name = "Breaking Bad", Season = 1, Episode = 1)
bb_sub <- read_subtitles("ex_Breaking_Bad.srt", metadata = bb_meta)
bb_sub
#> # A tibble: 5 x 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <fct> <dbl> <dbl>
#> 1 5 01'09.236" 01'12.780" Oh, my God. Christ! Break… 1 1
#> 2 6 01'15.993" 01'18.661" Shit. Break… 1 1
#> 3 7 01'18.829" 01'21.205" [SIRENS WAILING IN … Break… 1 1
#> 4 8 01'24.918" 01'27.378" Oh, God. Oh, my God. Break… 1 1
#> 5 9 01'27.546" 01'30.840" Oh, my God. Oh, my … Break… 1 1
If you want to analyze subtitles of series with different seasons and
episodes, you will have to import many files at once. The
read_subtitles_season()
, read_subtitles_serie()
and
read_subtitles_multiseries()
functions can make your life much easier,
by making it possible to automatically import files and extract metadata
from a structured directory. You can check the manual for more details.
Finally if you have a collection of movies in .mkv format, you can
extract the subtitle tracks of MKV files with read_subtitles_mkv()
.
Often, the workflow begins with a cleaning step to get rid of irrelevant
information that might be present in text content. Three functions can
be used for this task. First, clean_tags()
cleans formatting tags. By
default, this function is automatically executed by the
read_subtitles*()
functions, so you probably don’t need to run it
again. Second, clean_captions()
can be used to supress closed
captions, i.e. descriptions of non-speech elements in parentheses or
squared brackets. Finally, clean_patterns()
is a more general function
to clean subtitles based on regex pattern matching.
bb_sub
#> # A tibble: 5 x 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <fct> <dbl> <dbl>
#> 1 5 01'09.236" 01'12.780" Oh, my God. Christ! Break… 1 1
#> 2 6 01'15.993" 01'18.661" Shit. Break… 1 1
#> 3 7 01'18.829" 01'21.205" [SIRENS WAILING IN … Break… 1 1
#> 4 8 01'24.918" 01'27.378" Oh, God. Oh, my God. Break… 1 1
#> 5 9 01'27.546" 01'30.840" Oh, my God. Oh, my … Break… 1 1
bb_sub_clean <- clean_captions(bb_sub)
bb_sub_clean
#> # A tibble: 4 x 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <fct> <dbl> <dbl>
#> 1 5 01'09.236" 01'12.780" Oh, my God. Christ! Break… 1 1
#> 2 6 01'15.993" 01'18.661" Shit. Break… 1 1
#> 3 8 01'24.918" 01'27.378" Oh, God. Oh, my God. Break… 1 1
#> 4 9 01'27.546" 01'30.840" Oh, my God. Oh, my … Break… 1 1
Sometimes you will need to bind several subtitle objects together. This
can be achieved with the function bind_subtitles()
. This function is
very similar to bind_rows
from dplyr
(they both bind rows of
tibbles), but bind_subtitles()
allows to recalculate timecodes to
follow concatenation order (this can be disabled by setting sequential
to FALSE
).
bind_subtitles(rushmore_sub, oss_sub, bb_sub_clean)
#> # A tibble: 11 x 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <fct> <dbl> <dbl>
#> 1 180 20'40.969" 20'48.269" Rushmore deserves a… <NA> NA NA
#> 2 181 20'48.269" 20'50.870" - I don't know. Wha… <NA> NA NA
#> 3 182 20'50.946" 20'57.370" - What kind of fish… <NA> NA NA
#> 4 183 20'58.051" 21'01.770" - Piranhas? Really?… <NA> NA NA
#> 5 264 41'24.737" 41'29.197" Si vous voulez. Ça … <NA> NA NA
#> 6 265 41'32.117" 41'34.067" Et non pas le grati… <NA> NA NA
#> 7 266 41'37.357" 41'39.467" Parce que ça ressem… <NA> NA NA
#> 8 5 42'48.703" 42'52.247" Oh, my God. Christ! Brea… 1 1
#> 9 6 42'55.460" 42'58.128" Shit. Brea… 1 1
#> 10 8 43'04.385" 43'06.845" Oh, God. Oh, my God. Brea… 1 1
#> 11 9 43'07.013" 43'10.307" Oh, my God. Oh, my … Brea… 1 1
Some functions under certain conditions can also return a list of
subtitle objects (class multisubtitles
). The function
bind_subtitles()
can also be used on such object to bind each elements
into a new subtitle object, i.e. something similar to do.call(rbind,
x)
.
multi_sub <- bind_subtitles(rushmore_sub, bb_sub_clean, collapse = FALSE, sequential = FALSE)
multi_sub
#> A multisubtitles object with 2 elements
#> subtitles object [[1]]
#> # A tibble: 4 x 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 180 20'40.969" 20'48.269" Rushmore deserves an aquarium. A first cl…
#> 2 181 20'48.269" 20'50.870" - I don't know. What do you think, Ernie …
#> 3 182 20'50.946" 20'57.370" - What kind of fish? - Barracudas. Stingr…
#> 4 183 20'58.051" 21'01.770" - Piranhas? Really? - Yes, I'm talking to…
#>
#>
#> subtitles object [[2]]
#> # A tibble: 4 x 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <fct> <dbl> <dbl>
#> 1 5 01'09.236" 01'12.780" Oh, my God. Christ! Break… 1 1
#> 2 6 01'15.993" 01'18.661" Shit. Break… 1 1
#> 3 8 01'24.918" 01'27.378" Oh, God. Oh, my God. Break… 1 1
#> 4 9 01'27.546" 01'30.840" Oh, my God. Oh, my … Break… 1 1
bind_subtitles(multi_sub)
#> # A tibble: 8 x 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <fct> <dbl> <dbl>
#> 1 180 20'40.969" 20'48.269" Rushmore deserves an… <NA> NA NA
#> 2 181 20'48.269" 20'50.870" - I don't know. What… <NA> NA NA
#> 3 182 20'50.946" 20'57.370" - What kind of fish?… <NA> NA NA
#> 4 183 20'58.051" 21'01.770" - Piranhas? Really? … <NA> NA NA
#> 5 5 22'11.006" 22'14.550" Oh, my God. Christ! Brea… 1 1
#> 6 6 22'17.763" 22'20.431" Shit. Brea… 1 1
#> 7 8 22'26.688" 22'29.148" Oh, God. Oh, my God. Brea… 1 1
#> 8 9 22'29.316" 22'32.610" Oh, my God. Oh, my G… Brea… 1 1
The tidy text format as
defined by Julia Silge and David Robinson is a table with
one-token-per-row, a token being a meaningful unit of text, such as a
word or a sentence. The objects returned by read_subtitles*()
are in
some ways already tidy (each row being a subtitle block associated with
a timecode). However, this unit is not always the most relevant for data
analysis. To perform tokenization, the tidytext
package provides the
generic function unnest_tokens()
. The package subtools
adds a new
method to unnest_tokens()
to handle subtitles objects. The main
difference with the data.frame
method is the possibility to perform
timecode remapping according to the tokenisation process.
rushmore_sub
#> # A tibble: 4 x 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 180 20'40.969" 20'48.269" Rushmore deserves an aquarium. A first cl…
#> 2 181 20'48.269" 20'50.870" - I don't know. What do you think, Ernie …
#> 3 182 20'50.946" 20'57.370" - What kind of fish? - Barracudas. Stingr…
#> 4 183 20'58.051" 21'01.770" - Piranhas? Really? - Yes, I'm talking to…
unnest_tokens(rushmore_sub)
#> # A tibble: 49 x 4
#> ID Timecode_in Timecode_out Text_content
#> <chr> <time> <time> <chr>
#> 1 180 20'40.9700" 20'41.4858" rushmore
#> 2 180 20'41.4868" 20'42.0026" deserves
#> 3 180 20'42.0036" 20'42.1318" an
#> 4 180 20'42.1328" 20'42.6486" aquarium
#> 5 180 20'42.6496" 20'42.7132" a
#> 6 180 20'42.7142" 20'43.0363" first
#> 7 180 20'43.0373" 20'43.3593" class
#> 8 180 20'43.3603" 20'43.8761" aquarium
#> 9 180 20'43.8771" 20'44.1991" where
#> 10 180 20'44.2001" 20'44.8451" scientists
#> # … with 39 more rows
unnest_tokens(bb_sub_clean, token = "sentences")
#> # A tibble: 8 x 7
#> ID Timecode_in Timecode_out Text_content Name Season Episode
#> <chr> <time> <time> <chr> <fct> <dbl> <dbl>
#> 1 5 01'09.2370" 01'11.4018" oh, my god. Breaking… 1 1
#> 2 5 01'11.4028" 01'12.7800" christ! Breaking… 1 1
#> 3 6 01'15.9940" 01'18.6610" shit. Breaking… 1 1
#> 4 8 01'24.9190" 01'25.9538" oh, god. Breaking… 1 1
#> 5 8 01'25.9548" 01'27.3780" oh, my god. Breaking… 1 1
#> 6 9 01'27.5470" 01'28.4087" oh, my god. Breaking… 1 1
#> 7 9 01'28.4097" 01'29.2714" oh, my god. Breaking… 1 1
#> 8 9 01'29.2724" 01'30.8400" think, think, th… Breaking… 1 1
Note that unlike the data.frame
method, the input
and output
arguments are optional. This is because here the Text_content
column
can be assumed to be the column of interest.
Once your data are ready, you can analyze them. I recommend you to have a look at Text Mining with R: A Tidy Approach by Julia Silge and David Robinson. This is a great place to get started with text mining in R.
A list of cool projects using subtools
.
Note that these project used the branch 0.x of subtools
. The API is
totally different in subtools 1.0
.
You beautiful, naïve, sophisticated newborn series by ma_salmon
A tidy text analysis of Rick and Morty by tudosgar
Rick and Morty and Tidy Data Principles (part 1) (part 2) (part 3) by pachamaltese
Term Frequencies by Season by tdawry
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.