For this example, we will use the UnsupervisedTreatment
, but the same parameters can be used with the other treatment plans as well.
Here we create a simple data set where the inputs have missing values.
library(vtreat)
d = data.frame( "x" = c(0, 1, 1000, NA), "w" = c(3, 6, NA, 100), "y" = c(0, 0, 1, 1) ) knitr::kable(d)
Some of the summary statistics of d
. We're primarily interested in the inputs x
and w
.
summary(d)
By default, vtreat
fills in missing values with the mean value of the column, and adds an advisory *_isBAD
column to mark the location of the original missing values.
transform_design = vtreat::UnsupervisedTreatment( var_list = c('x', 'w'), # columns to transform cols_to_copy = 'y') # copy the y column over # learn treatments from data # normally we would unpack the treatments too, # but we don't need them for this example unpack[d_treated = cross_frame] <- fit_prepare(transform_design, d) knitr::kable(d_treated)
If you do not want to use the mean to fill in missing values, you can change the imputation function globally by using the parameter missingness_imputation
in the UnsupervisedTreatment
parameter list. Here, we fill in missing values with the median.
median2 <- function(x, wts) { median(x) } newparams = unsupervised_parameters( list(missingness_imputation = median2) ) transform_design = vtreat::UnsupervisedTreatment( var_list = c('x', 'w'), # columns to transform cols_to_copy = 'y', # copy the y column over params = newparams ) unpack[d_treated = cross_frame] <- fit_prepare(transform_design, d) knitr::kable(d_treated)
You can also use a constant value instead of a function. Here we replace missing values with the value -1.
newparams = unsupervised_parameters( list(missingness_imputation = -1) ) transform_design = vtreat::UnsupervisedTreatment( var_list = c('x', 'w'), # columns to transform cols_to_copy = 'y', # copy the y column over params = newparams ) unpack[d_treated = cross_frame] <- fit_prepare(transform_design, d) knitr::kable(d_treated)
You can control the imputation strategy per column via the map imputation_map
. Any column not named in the imputation map will use the imputation strategy specified by the missingness_imputation
parameter (which is the mean by default).
Here we use the maximum value to fill in the missing values for x
and the value 0
to fill in the missing values for w
.
max2 <- function(x, wts) { max(x) } transform_design = vtreat::UnsupervisedTreatment( var_list = c('x', 'w'), # columns to transform cols_to_copy = 'y', # copy the y column over imputation_map = list( x = max2, w = 0 ) ) unpack[d_treated = cross_frame] <- fit_prepare(transform_design, d) knitr::kable(d_treated)
If we don't specify a column, vtreat
looks atmissingness_imputation
(in this case, -1
).
newparams = unsupervised_parameters( list(missingness_imputation = -1) ) transform_design = vtreat::UnsupervisedTreatment( var_list = c('x', 'w'), # columns to transform cols_to_copy = 'y', # copy the y column over params = newparams, # default missingness value imputation_map = list( # custom imputations x = max2 ) ) unpack[d_treated = cross_frame] <- fit_prepare(transform_design, d) knitr::kable(d_treated)
If missingness_imputation
is not specified, vtreat
uses a weighted mean.
transform_design = vtreat::UnsupervisedTreatment( var_list = c('x', 'w'), # columns to transform cols_to_copy = 'y', # copy the y column over imputation_map = list( # custom imputations x = max2 ) ) unpack[d_treated = cross_frame] <- fit_prepare(transform_design, d) knitr::kable(d_treated)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.