Description Usage Arguments Value Examples
View source: R/unnest_tokens.R
This function extends unnest_tokens to subtitles objects. The main difference with the data.frame method is the possibility to perform timecode remapping according to the split of the input column.
1 2 3 4 5 |
tbl |
A data frame |
output |
Output column to be created as string or symbol. |
input |
Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols. |
token |
Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), and "ptb" (Penn Treebank). If a function, should take a character vector and return a list of character vectors of the same length. |
format |
Either "text", "man", "latex", "html", or "xml". If not text, this uses the hunspell tokenizer, and can tokenize only by "word" |
time.remapping |
a logical. If |
to_lower |
Whether to convert tokens to lowercase. If tokens include
URLS (such as with |
drop |
Whether original input column should get dropped. Ignored if the original input and new output column have the same name. |
collapse |
Whether to combine text with newlines first in case tokens (such as sentences or paragraphs) span multiple lines. If NULL, collapses when token method is "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex". |
... |
Extra arguments passed on to tokenizers, such
as |
A tibble.
1 2 3 4 5 6 7 | f <- system.file("extdata", "ex_webvtt.vtt", package = "subtools")
s <- read_subtitles(f, metadata = data.frame(test = "Test"))
require(tidytext)
unnest_tokens(s)
unnest_tokens(s, Word, Text_content, drop = FALSE)
unnest_tokens(s, Word, Text_content, token = "lines")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.