The preprocessr package aims to facilitate the preprocessing portion of a machine learning pipeline.
Here is what the package provides:
Below is an example of a typical usage of the package.
We will use an artificial dataset to illustrate usage of the package.
set.seed(123) data <- dplyr::data_frame( ID = 1:100, A = c(rep("a1", 50), rep("a2", 50)), B = sample(c("b1", "b2", NA), 100, replace=TRUE), C = rnorm(100), D = 1, E = sample(c(10,20,30,40,NA), 100, replace=TRUE) ) data
The dataset above contains:
Here is one way we could preprocess this dataset:
library(dplyr) library(preprocessr) prep <- preprocess( remove_vars("ID"), remove_constants(), impute(numerics, median, na.rm=TRUE), impute(nonnumerics, most_frequent), bin("C", 10), encode_one_hot("A"), encode_numeric("B") ) prep$fit_transform(data) %>% as_data_frame
Typically, you will want to to apply the same preprocessing steps on a test dataset, different from the one you used to fit your preprocessing pipeline.
test <- dplyr::data_frame( ID = 1:100, A = c(rep("a1", 50), rep("a2", 50)), B = sample(c("b1", "b2", "b3", NA), 100, replace=TRUE), C = rnorm(100), D = 1, E = sample(c(10,20,30,40,NA), 100, replace=TRUE) ) prep$transform(test) %>% as_data_frame
Looking at the definition of column B in the test set, we see a value ("b3") that was not in the train set, and we don't know what to do with this value when we try to encode it numerically, hence the NAs in the ouput.
To take care of this potential issue, it's a good idea to use the factorize function. This function, as the name suggests, turns character variables into factors, and takes care of new levels in the test set (unseen in the train set) by replacing them with the most frequent level seen in the train set.
prep <- preprocess( remove_vars("ID"), remove_constants(), impute(numerics, median, na.rm=TRUE), impute(nonnumerics, most_frequent), bin("C", 10), factorize(), encode_one_hot("A"), encode_numeric("B") ) prep$fit(data) prep$transform(test) %>% as_data_frame
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.