#| label: setup
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

textrecipes

R-CMD-check Codecov test coverage CRAN status Downloads Lifecycle: maturing

Introduction

textrecipes contain extra steps for the recipes package for preprocessing text data.

Installation

You can install the released version of textrecipes from CRAN with:

#| eval: false
install.packages("textrecipes")

Install the development version from GitHub with:

#| label: installation
#| eval: false
# install.packages("pak")
pak::pak("tidymodels/textrecipes")

Example

In the following example we will go through the steps needed, to convert a character variable to the TF-IDF of its tokenized words after removing stopwords, and, limiting ourself to only the 10 most used words. The preprocessing will be conducted on the variable medium and artist.

#| message: false
library(recipes)
library(textrecipes)
library(modeldata)

data("tate_text")

okc_rec <- recipe(~ medium + artist, data = tate_text) |>
  step_tokenize(medium, artist) |>
  step_stopwords(medium, artist) |>
  step_tokenfilter(medium, artist, max_tokens = 10) |>
  step_tfidf(medium, artist)

okc_obj <- okc_rec |>
  prep()

str(bake(okc_obj, tate_text))

Breaking changes

As of version 0.4.0, step_lda() no longer accepts character variables and instead takes tokenlist variables.

the following recipe

#| eval: false
recipe(~text_var, data = data) |>
  step_lda(text_var)

can be replaced with the following recipe to achive the same results

#| eval: false
lda_tokenizer <- function(x) text2vec::word_tokenizer(tolower(x))
recipe(~text_var, data = data) |>
  step_tokenize(text_var,
    custom_token = lda_tokenizer
  ) |>
  step_lda(text_var)

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.



EmilHvitfeldt/textrecipes documentation built on June 10, 2025, 1:21 a.m.