unnest_tokens: Split a column into tokens using the tokenizers package
In igorscarvalho/tidytext: Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

Description Usage Arguments Details Examples

Split a column into tokens using the tokenizers package, splitting the table into one-token-per-row. This function supports non-standard evaluation through the tidyeval framework.

unnest_tokens(
  tbl,
  output,
  input,
  token = "words",
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

`tbl`	A data frame
`output`	Output column to be created as string or symbol.
`input`	Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols.
`token`	Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), and "ptb" (Penn Treebank). If a function, should take a character vector and return a list of character vectors of the same length.
`format`	Either "text", "man", "latex", "html", or "xml". If not text, this uses the hunspell tokenizer, and can tokenize only by "word"
`to_lower`	Whether to convert tokens to lowercase. If tokens include URLS (such as with `token = "tweets"`), such converted URLs may no longer be correct.
`drop`	Whether original input column should get dropped. Ignored if the original input and new output column have the same name.
`collapse`	Whether to combine text with newlines first in case tokens (such as sentences or paragraphs) span multiple lines. If NULL, collapses when token method is "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".
`...`	Extra arguments passed on to tokenizers, such as `strip_punct` for "words" and "tweets", `n` and `k` for "ngrams" and "skip_ngrams", `strip_url` for "tweets", and `pattern` for "regex".

If the unit for tokenizing is ngrams, skip_ngrams, sentences, lines, paragraphs, or regex, the entire input will be collapsed together before tokenizing unless collapse = FALSE.

If format is anything other than "text", this uses the hunspell_parse tokenizer instead of the tokenizers package. This does not yet have support for tokenizing by any unit other than words.

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)
d

d %>%
  unnest_tokens(word, txt)

d %>%
  unnest_tokens(sentence, txt, token = "sentences")

d %>%
  unnest_tokens(ngram, txt, token = "ngrams", n = 2)

d %>%
  unnest_tokens(chapter, txt, token = "regex", pattern = "Chapter [\\\\d]")

d %>%
  unnest_tokens(shingle, txt, token = "character_shingles", n = 4)

# custom function
d %>%
  unnest_tokens(word, txt, token = stringr::str_split, pattern = " ")

# tokenize HTML
h <- tibble(row = 1:2,
                text = c("<h1>Text <b>is</b>", "<a href='example.com'>here</a>"))

h %>%
  unnest_tokens(word, text, format = "html")

igorscarvalho/tidytext documentation built on Aug. 23, 2020, 12:44 a.m.

igorscarvalho/tidytext index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

igorscarvalho/tidytext
Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

unnest_tokens: Split a column into tokens using the tokenizers package
In igorscarvalho/tidytext: Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

Description

Usage

Arguments

Details

Examples

Related to unnest_tokens in igorscarvalho/tidytext...

R Package Documentation

Browse R Packages

We want your feedback!

igorscarvalho/tidytext Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

unnest_tokens: Split a column into tokens using the tokenizers package In igorscarvalho/tidytext: Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

Description

Usage

Arguments

Details

Examples

Related to unnest_tokens in igorscarvalho/tidytext...

R Package Documentation

Browse R Packages

We want your feedback!

igorscarvalho/tidytext
Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

unnest_tokens: Split a column into tokens using the tokenizers package
In igorscarvalho/tidytext: Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools