step_ngram: Generate n-grams From Token Variables
In textrecipes: Extra 'Recipes' for Text Processing

step_ngram

R Documentation

Generate n-grams From Token Variables

Description

step_ngram() creates a specification of a recipe step that will convert a token variable into a token variable of ngrams.

Usage

step_ngram(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  columns = NULL,
  num_tokens = 3L,
  min_num_tokens = 3L,
  delim = "_",
  skip = FALSE,
  id = rand_id("ngram")
)

Arguments

`recipe`	A recipe object. The step will be added to the sequence of operations for this recipe.
`...`	One or more selector functions to choose which variables are affected by the step. See `recipes::selections()` for more details.
`role`	Not used by this step since no new variables are created.
`trained`	A logical to indicate if the quantities for preprocessing have been estimated.
`columns`	A character string of variable names that will be populated (eventually) by the `terms` argument. This is `NULL` until the step is trained by `recipes::prep.recipe()`.
`num_tokens`	The number of tokens in the n-gram. This must be an integer greater than or equal to 1. Defaults to 3.
`min_num_tokens`	The minimum number of tokens in the n-gram. This must be an integer greater than or equal to 1 and smaller than `n`. Defaults to 3.
`delim`	The separator between words in an n-gram. Defaults to "_".
`skip`	A logical. Should the step be skipped when the recipe is baked by `recipes::bake.recipe()`? While all operations are baked when `recipes::prep.recipe()` is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using `skip = FALSE`.
`id`	A character string that is unique to this step to identify it.

Details

The use of this step will leave the ordering of the tokens meaningless. If min_num_tokens < num_tokens then the tokens order in increasing fashion with respect to the number of tokens in the n-gram. If min_num_tokens = 1 and num_tokens = 3 then the output contains all the 1-grams followed by all the 2-grams followed by all the 3-grams.

Value

An updated version of recipe with the new step added to the sequence of existing steps (if any).

Tidying

When you tidy() this step, a tibble with columns terms (the selectors or variables selected).

Tuning Parameters

This step has 1 tuning parameters:

num_tokens: Number of tokens (type: integer, default: 3)

Case weights

The underlying operation does not allow for case weights.

Examples

library(recipes)
library(modeldata)
data(tate_text)

tate_rec <- recipe(~., data = tate_text) %>%
  step_tokenize(medium) %>%
  step_ngram(medium)

tate_obj <- tate_rec %>%
  prep()

bake(tate_obj, new_data = NULL, medium) %>%
  slice(1:2)

bake(tate_obj, new_data = NULL) %>%
  slice(2) %>%
  pull(medium)

tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)

textrecipes documentation built on Nov. 16, 2023, 5:06 p.m.