Conditioner

Starting with version 1.6.0 the R version of vtreat exposes an additional fit_prepare interface, based on the API of the Python version of vtreat.

The idea is from sklearn's pipeline. It works as follows.

For each of the common modeling tasks, the user constructs the appropriate, uninitialized data treatment object (a "spec"):

BinomialOutcomeTreatment (binary classification)
NumericOutcomeTreatment (regression)
UnsupervisedTreatment (unsupervised problems)
MultinomialOutcomeTreatment (multiclass classification)

Each object defines three primary methods:

fit()
prepare()
fit_prepare()

They work as follows.

fit(): Takes a spec and training data and returns the correct data preparation plan ("treatment plan") from the data.
prepare(): Takes a treatment plan and new data and returns new treated data. This is notationally identical to vtreat's existing prepare() function.
fit_prepare(): Takes a spec and training data and returns both a treatment plan and a treated data set suitable for fitting a downstream model.

fit_prepare() performs the cross-validated work required to avoid nested-model bias. The nested model bias we are working to avoid is an over fit due to using data for data transform design, and then naively treating the same data using the transform for down-stream modeling. Note that (except in the unsupervised case) fit_prepare(spec, d) is not a shorthand for fit(spec, d) %.>% prepare(., d), but in fact a different method hat takes extra steps to make sure the fit and treatment plan are jointly correct.

This corresponds to the classic R vtreat notations as follows:

plan <- fit(*Treatment(), d) ~ plan <- designTreatments*(d)
prepare(plan, d) ~ prepare(plan, d)
fit_prepare(*Treatment(), d) ~ mkCrossFrame*Experiment(d)

We introduced this notation into the R version of vtreat for consistency of notation, to take advantage of the excellent Scikit-learn paradigm, and to compensate for some unfortunate name choices during the early development of vtreat in R. Both notations have the same underlying implementation, and we expect to teach and maintain both paradigms.

Examples of the modeling typical tasks in both notations can be found here:

Regression: R notation, fit_prepare() notation.
Binary Classification: R notation, fit_prepare() notation.
Unsupervised Coding: R notation, fit_prepare() notation.
Multinomial Classification: R notation, fit_prepare() notation.

WinVector/vtreat documentation built on Aug. 29, 2023, 4:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

WinVector/vtreat
A Statistically Sound 'data.frame' Processor/Conditioner

Examples/fit_transform/fit_prepare_api.md
In WinVector/vtreat: A Statistically Sound 'data.frame' Processor/Conditioner

R Package Documentation

Browse R Packages

We want your feedback!

WinVector/vtreat A Statistically Sound 'data.frame' Processor/Conditioner

Examples/fit_transform/fit_prepare_api.md In WinVector/vtreat: A Statistically Sound 'data.frame' Processor/Conditioner

R Package Documentation

Browse R Packages

We want your feedback!

WinVector/vtreat
A Statistically Sound 'data.frame' Processor/Conditioner

Examples/fit_transform/fit_prepare_api.md
In WinVector/vtreat: A Statistically Sound 'data.frame' Processor/Conditioner