knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
customsteps
0.7.0.0 is now available on CRAN.
This is the first official release, and below I will demonstrate, how
customsteps
can be used to create recipe steps, that apply custom
transformations to a data set.
Note, you should already be fairly familiar with the recipes
package
before you continue reading this post or give customsteps
a spin!
customsteps
packageAlong with the recipes
package distribution comes a number of pre-specified
steps, that enables the user to manipulate data sets in various ways. The
resulting data sets (/design matrices) can then be used as inputs into
statistical or machine learning models.
If you want to apply a specific transformation to your data set, that is not
supported by the pre-specified steps, you have two options. You can write an
entire custom recipe step from
scratch.
This however takes quite a bit of work and code. An alternative - and sometimes
better - approach is to apply the customsteps
package.
# install.packages("customsteps") library(customsteps)
customsteps
contains a set of customizable higher-order recipe step functions,
that create specifications of recipe steps, that will transform or filter the
data in accordance with custom input functions.
Let me just remind you of the definition of higher-order functions:
In mathematics and computer science, a higher-order function is a function that does at least one of the following: 1. takes one or more functions as arguments, 2. returns a function as its result.
Next, I will present an example of how to use the customsteps
package in order
to create a recipe step, that will apply a custom transformation to a data set.
Assume, that I want to transform a variable ${\mathbf{x}}$ like this:
The transformed variable $\hat{\mathbf{x}}$ can then be derived as (try to do it yourself):
$\hat{\mathbf{x}} = \alpha + (\mathbf{x} - \bar{\mathbf{x}})\frac{\beta}{s_\mathbf{x}}$
where $\bar{\mathbf{x}}$ is the mean of $\mathbf{x}$, and $s_\mathbf{x}$ is the standard deviation of ${\mathbf{x}}$.
Note that centering ${\mathbf{x}}$ around 0 and scaling it in order to arrive at a standard deviation of 1 is just a special case of the above transformation with parameters $\alpha = 0, \beta = 1$.
prep
helper functionFirst, I need to write a function, that estimates the relevant statistical
parameters from an initial data set. I call this function the prep
helper
function.
Obviously, the above transformation requires the mean $\bar{\mathbf{x}}$ and
standard deviation $s_\mathbf{x}$ to be learned from the initial data set.
Therefore I define a function compute_means_sd
, that estimates the two
parameters for (an arbitrary number of) numeric variables.
By convention the prep
helper function must take the argument x
: the subset
of selected variables from the initial data set.
library(purrr) compute_means_sd <- function(x) { map(.x = x, ~ list(mean = mean(.x), sd = sd(.x))) }
Let us see the function in action. I will apply it to a subset of the famous
mtcars
data set.
library(dplyr) # divide 'mtcars' into two data sets. cars_initial <- mtcars[1:16, ] cars_new <- mtcars[17:nrow(mtcars), ] # learn parameters from initial data set. params <- cars_initial %>% select(mpg, disp) %>% compute_means_sd(.) # display parameters. as.data.frame(params)
It works like a charm. Great, we are halfway there!
bake
helper functionSecond, I have to specify a bake
helper function, that defines how to apply
the transformation to a new data set using the parameters estimated from the
intial data set.
By convention the bake
helper function must take the following arguments:
x
: the new data set, that the step will be applied to.prep_output
: the output from the prep
helper function containing
any parameters estimated from the initial data set.I define the function center_scale
, that will serve as my bake
helper
function. It will center and scale variables of a new data set.
center_scale <- function(x, prep_output, alpha, beta) { # extract only the relevant variables from the new data set. new_data <- select(x, names(prep_output)) # apply transformation to each of these variables. # variables are centered around 'alpha' and scaled to have a standard # deviation of 'beta'. map2(.x = new_data, .y = prep_output, ~ alpha + (.x - .y$mean) * beta / .y$sd) }
My first (sanity) check of the function is to apply it to the initial data set, that was used for estimation of the means and standard deviations.
library(tibble) # center and scale variables of new data set to have a mean of zero # and a standard deviation of one. cars_initial_transformed <- center_scale(x = cars_initial, prep_output = params, alpha = 0, beta = 1) # display transformed variables. cars_initial_transformed %>% compute_means_sd(.) %>% as.data.frame(.)
Results are correct within computational precision.
Also, I will just check the function out on the other subset of mtcars
.
# center and scale variables of new data set to have a mean of zero # and a standard deviation of one. cars_new_transformed <- center_scale(x = cars_new, prep_output = params, alpha = 0, beta = 1) # display transformed variables. cars_new_transformed %>% as.tibble(.) %>% head(.)
Looks right! All that is left now is to put the pieces together into my new very own custom recipe step.
The function step_custom_transformation
takes prep
and bake
helper
functions as inputs and turns them into a complete recipe step, that can be used
out of the box.
I create the specification of the recipe step from the new functions compute_means_sd
and
center_scale
by invoking step_custom_transformation
.
library(recipes) rec <- recipe(cars_initial) %>% step_custom_transformation(mpg, disp, prep_function = compute_means_sd, bake_function = center_scale, bake_options = list(alpha = 0, beta = 1), bake_how = "replace")
And that is all there is to it! Easy.
Note, by setting 'bake_options' to "replace", the selected terms will be replaced with the transformed variables, when the recipe is baked.
I will just check, that the recipe works as expected. First I will prep
(/train)
the recipe.
# prep recipe. rec <- prep(rec) # print recipe. rec
I will go right ahead and bake the new recipe.
# bake recipe. cars_baked <- rec %>% bake(cars_new) %>% select(mpg, disp) # display results. cars_baked %>% head(.)
Results are as expected (same as before). Great succes!
You should now be able to create your very own recipe steps to do (almost) whatever transformation to your data, that you want.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.