title: "The vtreat R package: a statistically sound data processor for predictive modeling" tags: - R - data science - predictive modeling - classification - regression - data preparation - significance - dimensionality reduction - reproducible research - cross-validation authors: - name: John Mount orcid: 0000-0002-3696-2012 email: jmount@win-vector.com affiliation: 1 - name: Nina Zumel orcid: 0000-0001-8831-0190 email: nzumel@win-vector.com affiliation: 1 affiliations: - name: Win-Vector, LLC index: 1 date: 9 February 2018 bibliography: paper.bib
When applying statistical methods or applying machine learning
techniques to real world data, there are common data issues that can cause modeling to
fail. The vtreat
package
(@vtreat) is an R data frame processor that prepares messy real world data
for predictive modeling in a reproducible and statistically sound manner.
The package's objective is to produce clean data frames that preserve
the original information, and are safe for model training and model
application. Vtreat
does
this by collecting statistics from training data in order to produce a
treatment plan. Vtreat
then uses this treatment plan to process subsequent data frames prior
to both model training and model application. The processed data
frame is guaranteed to be purely numeric, with no missing or NaN
values, and no string or categorical values.
Vtreat
serves as a powerful
alternative to R's native model.matrix
construct. The goals of the
package differ from those of training harness systems such as caret
(@caret) and unsupervised ad-hoc processing systems such as recipes
(@recipes).
In particular vtreat emphasizes safe but y-aware (supervised) pre-processing of data for predictive modeling tasks. It automates:
Vtreat
is careful to
automate only domain-agnostic data cleaning steps that are to common
to many applications. This intentionally leaves domain-specific
processing to the researcher and their own appropriate tools.
The use of vtreat
avoids the
perils of ad-hoc data treatment, and provides a reproducible,
documented, and citable data treatment procedure.
For more details and further discussion, please see our expository article @vtreatX and the package online documentation.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.