knitr::opts_chunk$set(echo = TRUE, message = F, warning = F)
suppressPackageStartupMessages( require(tidyverse) ) suppressPackageStartupMessages( require(recipes) ) suppressPackageStartupMessages( require(mlbench) )
recipes
is a package that is meant to simplify data preparation and transformation steps, as well as predefine the roles of each variable in the dataset (predictor and response).
There is a very good tutorial
in this proof of concept we are going to perform several preparation steps using recipes
.
Usually no dataset will require all of these preparation steps therefore we will use two different datasets.
The dataset consists of a number of chemical elements and their conecntrations in different types of glass. The type of glass is the predictor.
data(Glass) data = Glass %>% as_tibble() summary(data)
Skewed data can be a problem for all modelling algorithms that require normally distributed data.
data %>% summarise_if( is.numeric, e1071::skewness ) %>% knitr::kable()
Highly correlating variables are problematic for some modelling algorithms such as regressions. Even for tree-based ensembl methods such as randomforest which can handly highly correlating variables, they will have decreased importance when investigating the contribution of variables to the final model. So we might want to remove them.
The corrplot
package offers different methods of ordering the variables on the correlation matrix plot. Here we chose hclust
require(corrplot) correlations = cor( select_if( data, is.numeric ) ) corrplot(correlations, method = 'number', order = 'hclust')
We first build the recipe by adding the data. Adding a formula as well assigns the role of predictor and response to the variables
rec = recipes::recipe(data, Type~.) summary(rec)
We can add a number of transformation steps to the recipe
The Yeo Johnson transformation is similar to the Box_Cox Transformation which one can think of a more fine-tuned version of a log transformation. However unlike BoxCox and log transformation Yeo-Johnson can handle negative data.
Adding steps is as easy as that. We simply must not forget to specify which variables should be affected by the step. In order to provide an example for the (synthax)[https://topepo.github.io/recipes/reference/selections.html] we specify all numeric variables via all_numeric()
and exclude the response via all_outcomes()
. Since the response is a factor variable it would actually not be included in the all_nzumerics()
select.
rec = rec %>% step_center( all_numeric(), - all_outcomes() ) %>% step_scale( all_numeric(), - all_outcomes() ) %>% step_YeoJohnson( all_numeric(), - all_outcomes() )
rec = rec %>% step_corr( all_numeric(), - all_outcomes(), threshold = 0.5 ) rec
Here we can prepare the recipe, using a training dataset which in our case will be the full data set. We can use the same prepared recipe and apply it to future datasets. This comes in handy if we train a model on transformed data and then want to apply the same transformations on future data. This is especially relevant for the transformation steps.
prep_rec = prep( rec, training = data ) prep_rec
data_trans = bake( prep_rec, data )
sum_old = data %>% summarise_if( is.numeric, e1071::skewness ) %>% mutate( data = 'untransformed') %>% select( data, everything() ) sum_trans = data_trans %>% summarise_if( is.numeric, e1071::skewness ) %>% mutate( data = 'transformed') %>% select( data, everything() ) sum_old %>% bind_rows( sum_trans) %>% knitr::kable()
Here we can see a great reduction in skewness (all values are closer to zero after the transformation)
We can already see from the recipe summary that Ca
and Ba
have been removed, however lets look at the correlation matrix
correlations = cor( select_if( data_trans, is.numeric ) ) corrplot( correlations, method = 'number', order = 'hclust' )
Here we can see now that none of the correlations now are greater than 0.5 . For modelling we might want to reduce the threshold even further. 0.5 Is still a pretty high correlation factor.
Here we have different attributes of different soybean plants. The response variable here is Class
and most of the predictors are factor
variables.
data(Soybean) data = Soybean %>% as_tibble() summary(data)
Amelia::missmap( data )
Variables with near zero variance are pretty useless for modelling since when splitting the data into test and validation tests we always have a large chance that we end up with no occurrences of a specific class insideo one of the splits.
We define a variable as degenerate if either is true
summary( data[, caret::nearZeroVar(data) ] )
rec = recipe(data, Class~.)
There are several techniques on how to deal with missing data. We can either drop the observations that have incomplete values or we drop the variables with incomplete mesaurements. Looking back at the missingness map that means for this dataset we have to drop almost all of the variables or one third of all observations. So we have can either assume that all missing values carry mean or median values or more elegantly we try to impute the missing values based on similar observations that cluster in the same way.
rec = rec %>% step_knnimpute( all_predictors() )
rec = rec %>% step_nzv( all_predictors(), options = list(freq_cut = 95/5, unique_cut = 10 ) )
prep_rec = prep( rec, training = data ) prep_rec
Note that one degenerate variable was not removed. Probably the imputation balanced it out.
data_trans = bake( prep_rec, data )
summary( data[, caret::nearZeroVar(data, freqCut = 96/5) ] ) summary( data_trans[, caret::nearZeroVar(data_trans, freqCut = 95/5) ] )
Indeed it is as we suspected the imputation of the missing values has balanced out the other two variables.
Amelia::missmap(data_trans)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.