Nina Zumel and John Mount updated February 2020
This article documents vtreat
's "fit_prepare" variation for unsupervised problems.
This API was inspired by the pyvtreat
API, which was in turn based on the .fit()
, .transform()
, .fit_transform()
workflow of scikit-learn
in Python
.
The same example in the original R
vtreat
notation can be found here.
The same example in the Python
version of vtreat
can be found here.
Load modules/packages.
library(vtreat) packageVersion('vtreat') suppressPackageStartupMessages(library(ggplot2)) library(WVPlots) library(rqdatatable)
Generate example data.
y
is a noisy sinusoidal plus linear function of the variable x
xc
is a categorical variable that represents a discretization of y
, along with some NaN
sx2
is a pure noise variable with no relationship to the outputx3
is a constant variableset.seed(2020) make_data <- function(nrows) { d <- data.frame(x = 5*rnorm(nrows)) d['y'] = sin(d[['x']]) + 0.01*d[['x']] + 0.1*rnorm(n = nrows) d[4:10, 'x'] = NA # introduce NAs d['xc'] = paste0('level_', 5*round(d$y/5, 1)) d['x2'] = rnorm(n = nrows) d['x3'] = 1 d[d['xc']=='level_-1', 'xc'] = NA # introduce a NA level return(d) } d = make_data(500) d %.>% head(.) %.>% knitr::kable(.)
Check how many levels xc
has, and their distribution (including NaN
)
unique(d['xc'])
table(d$xc, useNA = 'always')
The vtreat
package is primarily intended for data treatment prior to supervised learning, as detailed in the Classification and Regression examples. In these situations, vtreat
specifically uses the relationship between the inputs and the outcomes in the training data to create certain types of synthetic variables. We call these more complex synthetic variables y-aware variables.
However, you may also want to use vtreat
for basic data treatment for unsupervised problems, when there is no outcome variable. Or, you may not want to create any y-aware variables when preparing the data for supervised modeling. For these applications, vtreat
is a convenient alternative to model.matrix()
that keeps information about the levels of factor variables observed in the data, and can manage novel levels that appear in future data.
In any case, we still want training data where all the input variables are numeric and have no missing values or NaN
s.
First create the data treatment transform object, in this case a treatment for an unsupervised problem.
transform_design <- vtreat::UnsupervisedTreatment( var_list = setdiff(colnames(d), c('y')), # columns to transform cols_to_copy = 'y' # copy y to the treated data ) # learn transform from data treatment_plan <- fit(transform_design, d) # prepare the data using the treatment plan d_prepared <- prepare(treatment_plan, d) # for unsupervised problems fit_transform(transform_design, d) # will produce the same treatment plan and treated data set # as the above, in one step # unpack[treatment_plan = treatments, # d_prepared = cross_frame] <- fit_prepare(transform_design, d) # list the derived variables get_feature_names(treatment_plan)
The treated training set should be clean: completely numeric, with no missing values.
Now examine the score frame, which gives information about each new variable, including its type and which original variable it is derived from. Some of the columns of the score frame (rsq
, sig
) are not relevant to the unsupervised case; those columns are used by the Regression and Classification transforms.
score_frame <- get_score_frame(treatment_plan) knitr::kable(score_frame)
Notice that the variable xc
has been converted to multiple variables:
NA
or missing (xc_lev*
)xc
is in the training data (xc_catP
)The numeric variable x
has been converted to two variables:
x
that has no NaN
s or missing valuesx
was NaN
or NA
in the original data (xd_isBAD
)Any or all of these new variables are available for downstream modeling.
Also note that the variable x3
does not appear in the score frame (or in the treated data), as it had no range (didn't vary), so the unsupervised treatment dropped it.
Let's look at the top of d_prepared
, which includes all the new variables, plus y
(and excluding x3
).
d_prepared %.>% head(.) %.>% knitr::kable(.)
Of course, what we really want to do with the prepared training data is to model.
Let's start with an unsupervised analysis: clustering.
# don't use y to cluster not_variables <- c('y') model_vars <- setdiff(colnames(d_prepared), not_variables) clusters = kmeans(d_prepared[, model_vars, drop = FALSE], centers = 5) d_prepared['clusterID'] <- clusters$cluster head(d_prepared$clusterID) ggplot(data = d_prepared, aes(x=x, y=y, color=as.character(clusterID))) + geom_point() + ggtitle('y as a function of x, points colored by (unsupervised) clusterID') + theme(legend.position="none") + scale_colour_brewer(palette = "Dark2")
Since in this case we have an outcome variable, y
, we can try fitting a linear regression model to d_prepared
.
f <- wrapr::mk_formula('y', model_vars) model = lm(f, data = d_prepared) # now predict d_prepared['prediction'] = predict( model, newdata = d_prepared) # look at the fit (on the training data) WVPlots::ScatterHist( d_prepared, xvar = 'prediction', yvar = 'y', smoothmethod = 'identity', estimate_sig = TRUE, title = 'Relationship between prediction and y')
Now apply the model to new data.
# create the new data dtest <- make_data(450) # prepare the new data with vtreat dtest_prepared = prepare(treatment_plan, dtest) dtest_prepared$y = dtest$y # apply the model to the prepared data dtest_prepared['prediction'] = predict( model, newdata = dtest_prepared) # compare the predictions to the outcome (on the test data) WVPlots::ScatterHist( dtest_prepared, xvar = 'prediction', yvar = 'y', smoothmethod = 'identity', estimate_sig = TRUE, title = 'Relationship between prediction and y') # get r-squared sigr::wrapFTest(dtest_prepared, predictionColumnName = 'prediction', yColumnName = 'y', nParameters = length(model_vars) + 1)
UnsupervisedTreatment
We've tried to set the defaults for all parameters so that vtreat
is usable out of the box for most applications.
unsupervised_parameters()
Some parameters of note include:
codeRestriction: The types of synthetic variables that vtreat
will (potentially) produce. By default, all possible applicable types will be produced. See Types of prepared variables below.
minFraction (default: 0): For categorical variables, indicator variables (type levs
) are only produced for levels that are present at least minFraction
of the time. A consequence of this is that 1/minFraction
is the maximum number of indicators that will be produced for a given categorical variable. By default, all possible indicator variables are produced.
missingness_imputation: The function or value that vtreat
uses to impute or "fill in" missing numerical values. The default is mean
. To change the imputation function or use different functions/values for different columns, see the Imputation example for examples.
customCoders: For passing in user-defined transforms for custom data preparation. Won't be needed in most situations, but see here for an example of applying a GAM transform to input variables.
# calculate the prevalence of each level of xc by hand, including NA table(d$xc, useNA = "ifany")/nrow(d) # create a parameter list, overriding the default for minFraction newparams = unsupervised_parameters( list(minFraction = 0.2) # only make indicators for levels that appear more than 20% of the time ) transform_common = UnsupervisedTreatment( var_list = setdiff(colnames(d), c('y')), # columns to transform params = newparams # set the parameters ) # learn transform from data treatment_plan <- fit(transform_common, d) # prepare the data using the treatment plan d_prepared <- prepare(treatment_plan, d) # examine the score frame knitr::kable(get_score_frame(treatment_plan))
In this case, the unsupervised treatment only created levels for the two most common levels, level_1
and NA
, which are both present more than 20% of the time.
In unsupervised situations, this may only be desirable when there are an unworkably large number of possible levels (for example, when using ZIP code as a variable). It is more useful in conjunction with the y-aware variables produced by NumericOutcomeTreatment
, BinomialOutcomeTreatment
, or MultinomialOutcomeTreatment
.
clean: Produced from numerical variables: a clean numerical variable with no NaNs
or missing values
lev: Produced from categorical variables, one for each level: for each level of the variable, indicates if that level was "on"
catP: Produced from categorical variables: indicates how often each level of the variable was "on" (its prevalence)
isBAD: Produced for numerical variables: an indicator variable that marks when the original variable was missing or NaN
In all cases (classification, regression, unsupervised, and multinomial classification) the intent is that vtreat
transforms are essentially one liners.
The preparation commands are organized as follows:
R
regression example, fit/prepare interface, R
regression example, design/prepare/experiment interface, Python
regression example.R
classification example, fit/prepare interface, R
classification example, design/prepare/experiment interface, Python
classification example.R
unsupervised example, fit/prepare interface, R
unsupervised example, design/prepare/experiment interface, Python
unsupervised example.R
multinomial classification example, fit/prepare interface, R
multinomial classification example, design/prepare/experiment interface, Python
multinomial classification example.These current revisions of the examples are designed to be small, yet complete. So as a set they have some overlap, but the user can rely mostly on a single example for a single task type.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.