Description Usage Arguments Details Examples
View source: R/unnest_tokens.R
Split a column into tokens using the tokenizers package, splitting the table into one-token-per-row. This function supports non-standard evaluation through the tidyeval framework.
1 2 3 4 5 6 7 8 9 10 11  | 
tbl | 
 A data frame  | 
output | 
 Output column to be created as string or symbol.  | 
input | 
 Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols.  | 
token | 
 Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), and "ptb" (Penn Treebank). If a function, should take a character vector and return a list of character vectors of the same length.  | 
format | 
 Either "text", "man", "latex", "html", or "xml". If not text, this uses the hunspell tokenizer, and can tokenize only by "word"  | 
to_lower | 
 Whether to convert tokens to lowercase. If tokens include
URLS (such as with   | 
drop | 
 Whether original input column should get dropped. Ignored if the original input and new output column have the same name.  | 
collapse | 
 Whether to combine text with newlines first in case tokens (such as sentences or paragraphs) span multiple lines. If NULL, collapses when token method is "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".  | 
... | 
 Extra arguments passed on to tokenizers, such
as   | 
If the unit for tokenizing is ngrams, skip_ngrams, sentences, lines,
paragraphs, or regex, the entire input will be collapsed together before
tokenizing unless collapse = FALSE.
If format is anything other than "text", this uses the
hunspell_parse tokenizer instead of the tokenizers package.
This does not yet have support for tokenizing by any unit other than words.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  | library(dplyr)
library(janeaustenr)
d <- tibble(txt = prideprejudice)
d
d %>%
  unnest_tokens(word, txt)
d %>%
  unnest_tokens(sentence, txt, token = "sentences")
d %>%
  unnest_tokens(ngram, txt, token = "ngrams", n = 2)
d %>%
  unnest_tokens(chapter, txt, token = "regex", pattern = "Chapter [\\\\d]")
d %>%
  unnest_tokens(shingle, txt, token = "character_shingles", n = 4)
# custom function
d %>%
  unnest_tokens(word, txt, token = stringr::str_split, pattern = " ")
# tokenize HTML
h <- tibble(row = 1:2,
                text = c("<h1>Text <b>is</b>", "<a href='example.com'>here</a>"))
h %>%
  unnest_tokens(word, text, format = "html")
 | 
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.