unnest_tokens.subtitles: Split a column into tokens
In fkeck/subtools: Read and Manipulate Video Subtitles

Description Usage Arguments Value Examples

This function extends unnest_tokens to subtitles objects. The main difference with the data.frame method is the possibility to perform timecode remapping according to the split of the input column.

## S3 method for class 'subtitles'
unnest_tokens(tbl, output, input, token = "words",
  format = c("text", "man", "latex", "html", "xml"),
  time.remapping = TRUE, to_lower = TRUE, drop = TRUE,
  collapse = NULL, ...)

`tbl`	A data frame
`output`	Output column to be created as string or symbol.
`input`	Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols.
`token`	Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), and "ptb" (Penn Treebank). If a function, should take a character vector and return a list of character vectors of the same length.
`format`	Either "text", "man", "latex", "html", or "xml". If not text, this uses the hunspell tokenizer, and can tokenize only by "word"
`time.remapping`	a logical. If `TRUE` (default), subtitle timecodes are recalculated to take into account the split of the input column.
`to_lower`	Whether to convert tokens to lowercase. If tokens include URLS (such as with `token = "tweets"`), such converted URLs may no longer be correct.
`drop`	Whether original input column should get dropped. Ignored if the original input and new output column have the same name.
`collapse`	Whether to combine text with newlines first in case tokens (such as sentences or paragraphs) span multiple lines. If NULL, collapses when token method is "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".
`...`	Extra arguments passed on to tokenizers, such as `strip_punct` for "words" and "tweets", `n` and `k` for "ngrams" and "skip_ngrams", `strip_url` for "tweets", and `pattern` for "regex".

A tibble.

f <- system.file("extdata", "ex_webvtt.vtt", package = "subtools")
s <- read_subtitles(f, metadata = data.frame(test = "Test"))

require(tidytext)
unnest_tokens(s)
unnest_tokens(s, Word, Text_content, drop = FALSE)
unnest_tokens(s, Word, Text_content, token = "lines")