knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Before training a model, it's often necessary and prudent to preprocess
your input data. We provide a function (preprocess_data()
) to preprocess input
data. The defaults we chose are based on best practices used in
FIDDLE
[@tang_democratizing_2020]. Feel free to check out FIDDLE for more information
about data preprocessing!
preprocess_data()
takes an input dataset where the rows are the samples and
the columns are the outcome variable and features. We preprocess the data as
follows:
_
).caret::preProcess()
based on the method provided.Since I assume a lot of you won't read this entire vignette, I'm going to say
this at the beginning. If the preprocess_data()
function is running super
slow, you should consider parallelizing it so it goes faster!
preprocess_data()
also can report live progress updates. See
vignette("parallel")
for details.
We're going to start off simple and get more complicated, but if you want the whole shebang at once, just scroll to the bottom.
First, we have to load mikropml
:
library(mikropml)
Let's start with only binary variables:
# raw binary dataset bin_df <- data.frame( outcome = c("normal", "normal", "cancer"), var1 = c("no", "yes", "no"), var2 = c(0, 1, 1), var3 = factor(c("a", "a", "b")) ) bin_df
In addition to the dataframe itself, you have to provide the name of the outcome column to preprocess_data()
. Here's what the preprocessed data looks like:
# preprocess raw binary data preprocess_data(dataset = bin_df, outcome_colname = "outcome")
The output is a list: dat_transformed
which has the transformed data,
grp_feats
which is a list of grouped features, and removed_feats
which is a
list of features that were removed. Here, grp_feats
is NULL
because there
are no perfectly correlated features (e.g. c(0,1,0)
and c(0,1,0)
, or
c(0,1,0)
and c(1,0,1)
- see below for more details).
The first column (var1
) in dat_transformed
is a character and is changed to
var1_yes
that has zeros (no) and ones (yes). The values in the second column
(var2
) stay the same because it's already binary, but the name changes to
var2_1
. The third column (var3
) is a factor and is also changed to binary
where b is 1 and a is 0, as denoted by the new column name var3_b
.
On to non-binary categorical data:
# raw categorical dataset cat_df <- data.frame( outcome = c("normal", "normal", "cancer"), var1 = c("a", "b", "c") ) cat_df
# preprocess raw categorical data preprocess_data(dataset = cat_df, outcome_colname = "outcome")
As you can see, this variable was split into 3 different columns - one for each
type (a, b, and c). And again, grp_feats
is NULL
.
Now, looking at continuous variables:
# raw continuous dataset cont_df <- data.frame( outcome = c("normal", "normal", "cancer"), var1 = c(1, 2, 3) ) cont_df
# preprocess raw continuous data preprocess_data(dataset = cont_df, outcome_colname = "outcome")
Wow! Why did the numbers change? This is because the default is to normalize the
data using "center"
and "scale"
. While this is often best practice, you may
not want to normalize the data, or you may want to normalize the data in a
different way. If you don't want to normalize the data, you can use
method=NULL
:
# preprocess raw continuous data, no normalization preprocess_data(dataset = cont_df, outcome_colname = "outcome", method = NULL)
You can also normalize the data in different ways. You can choose any method
supported by the method
argument of caret::preProcess()
(see the
caret::preProcess()
docs for details). Note that these methods are only
applied to continuous variables.
Another feature of preprocess_data()
is that if you provide continuous
variables as characters, they will be converted to numeric:
# raw continuous dataset as characters cont_char_df <- data.frame( outcome = c("normal", "normal", "cancer"), var1 = c("1", "2", "3") ) cont_char_df
# preprocess raw continuous character data as numeric preprocess_data(dataset = cont_char_df, outcome_colname = "outcome")
If you don't want this to happen, and you want character data to remain
character data even if it can be converted to numeric, you can use
to_numeric=FALSE
and they will be kept as categorical:
# preprocess raw continuous character data as characters preprocess_data(dataset = cont_char_df, outcome_colname = "outcome", to_numeric = FALSE)
As you can see from this output, in this case the features are treated as groups rather than numbers (e.g. they are not normalized).
By default, preprocess_data()
collapses features that are perfectly positively
or negatively correlated. This is because having multiple copies of those
features does not add information to machine learning, and it makes run_ml
faster.
# raw correlated dataset corr_df <- data.frame( outcome = c("normal", "normal", "cancer"), var1 = c("no", "yes", "no"), var2 = c(0, 1, 0), var3 = c(1, 0, 1) ) corr_df
# preprocess raw correlated dataset preprocess_data(dataset = corr_df, outcome_colname = "outcome")
As you can see, we end up with only one variable, as all 3 are grouped together.
Also, the second element in the list is no longer NULL
. Instead, it tells you
that grp1
contains var1
, var2
, and var3
.
If you want to group positively correlated features, but not negatively
correlated features (e.g. for interpretability, or another downstream
application), you can do that by using group_neg_corr=FALSE
:
# preprocess raw correlated dataset; don't group negatively correlated features preprocess_data(dataset = corr_df, outcome_colname = "outcome", group_neg_corr = FALSE)
Here, var3
is kept on it's own because it's negatively correlated with var1
and var2
. You can also choose to keep all features separate, even if they are
perfectly correlated, by using collapse_corr_feats=FALSE
:
# preprocess raw correlated dataset; don't group negatively correlated features preprocess_data(dataset = corr_df, outcome_colname = "outcome", collapse_corr_feats = FALSE)
In this case, grp_feats
will always be NULL
.
What if we have variables that are all zero, or all "no"? Those ones won't contribute any information, so we remove them:
# raw dataset with non-variable features nonvar_df <- data.frame( outcome = c("normal", "normal", "cancer"), var1 = c("no", "yes", "no"), var2 = c(0, 1, 1), var3 = c("no", "no", "no"), var4 = c(0, 0, 0), var5 = c(12, 12, 12) ) nonvar_df
Here, var3
, var4
, and var5
all have no variability, so these variables are
removed during preprocessing:
# remove features with near-zero variance preprocess_data(dataset = nonvar_df, outcome_colname = "outcome")
You can read the caret::preProcess()
documentation for more information. By
default, we remove features with "near-zero variance" (remove_var='nzv'
). This
uses the default arguments from caret::nearZeroVar()
. However, particularly
with smaller datasets, you might not want to remove features with near-zero
variance. If you want to remove only features with zero variance, you can use
remove_var='zv'
:
# remove features with zero variance preprocess_data(dataset = nonvar_df, outcome_colname = "outcome", remove_var = "zv")
If you want to include all features, you can use the argument remove_zv=NULL
.
For this to work, you cannot collapse correlated features (otherwise it errors
out because of the underlying caret
function we use).
# don't remove features with near-zero or zero variance preprocess_data(dataset = nonvar_df, outcome_colname = "outcome", remove_var = NULL, collapse_corr_feats = FALSE)
If you want to be more nuanced in how you remove near-zero variance features
(e.g. change the default 10% cutoff for the percentage of distinct values out of
the total number of samples), you can use the caret::preProcess()
function
after running preprocess_data
with remove_var=NULL
(see the
caret::nearZeroVar()
function for more information).
preprocess_data()
also deals with missing data. It:
If you'd like to deal with missing data in a different way, please do that prior
to inputting the data to preprocess_data()
.
# raw dataset with missing outcome value miss_oc_df <- data.frame( outcome = c("normal", "normal", "cancer", NA), var1 = c("no", "yes", "no", "no"), var2 = c(0, 1, 1, 1) ) miss_oc_df
# preprocess raw dataset with missing outcome value preprocess_data(dataset = miss_oc_df, outcome_colname = "outcome")
# raw dataset with missing value in non-variable feature miss_nonvar_df <- data.frame( outcome = c("normal", "normal", "cancer"), var1 = c("no", "yes", "no"), var2 = c(NA, 1, 1) ) miss_nonvar_df
# preprocess raw dataset with missing value in non-variable feature preprocess_data(dataset = miss_nonvar_df, outcome_colname = "outcome")
Here, the non-variable feature with missing data is removed because we removed features with near-zero variance. If we maintained that feature, it'd be all ones:
# preprocess raw dataset with missing value in non-variable feature preprocess_data(dataset = miss_nonvar_df, outcome_colname = "outcome", remove_var = NULL, collapse_corr_feats = FALSE)
# raw dataset with missing value in categorical feature miss_cat_df <- data.frame( outcome = c("normal", "normal", "cancer"), var1 = c("no", "yes", NA), var2 = c(NA, 1, 0) ) miss_cat_df
# preprocess raw dataset with missing value in non-variable feature preprocess_data(dataset = miss_cat_df, outcome_colname = "outcome")
Here each binary variable is split into two, and the missing value is considered zero for both of them.
# raw dataset with missing value in continuous feature miss_cont_df <- data.frame( outcome = c("normal", "normal", "cancer", "normal"), var1 = c(1, 2, 2, NA), var2 = c(1, 2, 3, NA) ) miss_cont_df
Here we're not normalizing continuous features so it's easier to see what's going on (i.e. the median value is used):
# preprocess raw dataset with missing value in continuous feature preprocess_data(dataset = miss_cont_df, outcome_colname = "outcome", method = NULL)
Here's some more complicated example raw data that puts everything we discussed together:
test_df <- data.frame( outcome = c("normal", "normal", "cancer", NA), var1 = 1:4, var2 = c("a", "b", "c", "d"), var3 = c("no", "yes", "no", "no"), var4 = c(0, 1, 0, 0), var5 = c(0, 0, 0, 0), var6 = c("no", "no", "no", "no"), var7 = c(1, 1, 0, 0), var8 = c(5, 6, NA, 7), var9 = c(NA, "x", "y", "z"), var10 = c(1, 0, NA, NA), var11 = c(1, 1, NA, NA), var12 = c("1", "2", "3", "4") ) test_df
Let's throw this into the preprocessing function with the default values:
preprocess_data(dataset = test_df, outcome_colname = "outcome")
As you can see, we got several messages:
var11
).var9
).
There are 4 missing rather than just 1 (like in the raw data) because we split the categorical variable into 4 different columns first.var8
).Additionally, you can see that the continuous variables were normalized, the
categorical variables were all changed to binary, and several features were
grouped together. The variables in each group can be found in grp_feats
.
After you preprocess your data (either using preprocess_data()
or by
preprocessing the data on your own),
you're ready to train and evaluate machine learning models!
Please see run_ml()
information about training models.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.