step_tokenize: Tokenization of character variables

Description Usage Arguments Details Value See Also Examples

Description

'step_tokenize' creates a *specification* of a recipe step that will convert a character predictor into a list of tokens.

Usage

1
2
3
4
5
6
step_tokenize(recipe, ..., role = NA, trained = FALSE,
  columns = NULL, options = list(), token = "words",
  custom_token = NULL, skip = FALSE, id = rand_id("tokenize"))

## S3 method for class 'step_tokenize'
tidy(x, ...)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose variables. For 'step_tokenize', this indicates the variables to be encoded into a list column. See [recipes::selections()] for more details. For the 'tidy' method, these are not currently used.

role

Not used by this step since no new variables are created.

trained

A logical to indicate if the recipe has been baked.

columns

A list of tibble results that define the encoding. This is 'NULL' until the step is trained by [recipes::prep.recipe()].

options

A list of options passed to the tokenizer.

token

Unit for tokenizing. Built-in options from the [tokenizers] package are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), "ptb" (Penn Treebank), "skip_ngrams" and "word_stems".

custom_token

User supplied tokenizer. use of this argument will overwrite the token argument. Must take a character vector as input and output a list of character vectors.

skip

A logical. Should the step be skipped when the recipe is baked by [recipes::bake.recipe()]? While all operations are baked when [recipes::prep.recipe()] is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using 'skip = TRUE' as it may affect the computations for subsequent operations.

id

A character string that is unique to this step to identify it

x

A 'step_tokenize' object.

Details

Tokenization is the act of splitting a character string into smaller parts to be further analysed. This step uses the 'tokenizers' package which includes heuristics to split the text into paragraphs tokens, word tokens amough others. 'textrecipes' keeps the tokens in a list-column and other steps will do their tasks on those list-columns before transforming them back to numeric.

Working will 'textrecipes' will always start by calling 'step_tokenize' followed by modifying and filtering steps.

Value

An updated version of 'recipe' with the new step added to the sequence of existing steps (if any).

See Also

[step_untokenize]

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
library(recipes)

data(okc_text)

okc_rec <- recipe(~ ., data = okc_text) %>%
  step_tokenize(essay0) 
  
okc_obj <- okc_rec %>%
  prep(training = okc_text, retain = TRUE)

juice(okc_obj, essay0) %>%
  slice(1:2)

juice(okc_obj) %>%
  slice(2) %>%
  pull(essay0)
  
tidy(okc_rec, number = 1)
tidy(okc_obj, number = 1)

okc_obj_chars <- recipe(~ ., data = okc_text) %>%
  step_tokenize(essay0, token = "characters") %>%
  prep(training = okc_text, retain = TRUE)

juice(okc_obj_chars) %>%
  slice(2) %>%
  pull(essay0)

textrecipes documentation built on May 2, 2019, 1:27 p.m.