In topepo/recipes: Preprocessing and Feature Engineering Steps for Modeling

knitr::opts_chunk$set(
  message = FALSE,
  digits = 3,
  collapse = TRUE,
  comment = "#>",
  eval = requireNamespace("modeldata", quietly = TRUE)
  )
options(digits = 3)

When recipe steps are used, there are different approaches that can be used to select which variables or features should be used.

The three main characteristics of variables that can be queried:

the name of the variable
the data type (e.g. numeric or nominal)
the role that was declared by the recipe

The manual pages for ?selections and ?has_role have details about the available selection methods.

To illustrate this, the palmer penguins data will be used:

library(recipes)
library(modeldata)

data("penguins")
str(penguins)

rec <- recipe(body_mass_g ~ ., data = penguins)
rec

Before any steps are used the information on the original variables is:

summary(rec, original = TRUE)

This shows the types and roles. Each variable can have one or more types, so we can printing them out seperately

summary(rec, original = TRUE)$type

Notice that integer variables have roles "integer" and "numeric", and the factor variables have roles "factor", "unordered", "nominal". This allows for some neat selections where the selector all_numeric() select double and integer variables, and more specific selectors such as all_integer() only select integer variables. A full hierarchy of types can be seen in ?has_role.

We can add a step to normalize numeric data:

dummied <- rec %>% step_normalize(all_numeric())

This will capture any variables that are either character integers or doubles: bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g. However, since body_mass_g is our outcome, we might want to keep it as a factor so we can subtract that variable out either by name or by role:

dummied <- rec %>% step_normalize(bill_length_mm, bill_depth_mm, 
                                  flipper_length_mm) # or
dummied <- rec %>% step_normalize(all_numeric(), - body_mass_g) # or
dummied <- rec %>% step_normalize(all_numeric_predictors()) # recommended

Whenever possible, it is recommended to use the more specific *_predictors() variants to avoid accidentally selecting the outcomes.

rec %>%
  step_dummy(sex) %>%
  prep() %>%
  juice()

Using the last definition:

dummied <- prep(dummied, training = penguins)
with_dummy <- bake(dummied, new_data = penguins)
with_dummy

body_mass_g is unaffected.

One important aspect of selecting variables in steps is that the variable names and types may change as steps are being executed. In the above example, sex is a factor variable, if step_dummy() was used on it, then sex would be removed and the binary variable sex_male is in its place. One reason to have general selection routines like all_predictors() or contains() is to be able to select variables that have not been created yet.

All steps in the recipes package support empty selections. Meaning that if all_date_predictors() is used in a step, and no date variables was found the in the data set, then the step is applied without error. The calculations inside the step will be skipped. This allows for quite relaxed recipes as you don't have to make sure that the variables exists at that point in the recipe.

topepo/recipes documentation built on April 10, 2024, 10:30 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com