The preprocessr package aims to facilitate the preprocessing portion of a machine learning pipeline.

Here is what the package provides:

Below is an example of a typical usage of the package.

Data

We will use an artificial dataset to illustrate usage of the package.

set.seed(123)

data <- 
  dplyr::data_frame(
    ID = 1:100,
    A = c(rep("a1", 50), rep("a2", 50)),
    B = sample(c("b1", "b2", NA), 100, replace=TRUE),
    C = rnorm(100),
    D = 1,
    E = sample(c(10,20,30,40,NA), 100, replace=TRUE)
  )

data

The dataset above contains:

Preprocessing

Here is one way we could preprocess this dataset:

library(dplyr)
library(preprocessr)

prep <- 
  preprocess(
    remove_vars("ID"),
    remove_constants(),
    impute(numerics, median, na.rm=TRUE),
    impute(nonnumerics, most_frequent),
    bin("C", 10),
    encode_one_hot("A"),
    encode_numeric("B")
  )

prep$fit_transform(data) %>% as_data_frame

Transform test data

Typically, you will want to to apply the same preprocessing steps on a test dataset, different from the one you used to fit your preprocessing pipeline.

test <- 
  dplyr::data_frame(
    ID = 1:100,
    A = c(rep("a1", 50), rep("a2", 50)),
    B = sample(c("b1", "b2", "b3", NA), 100, replace=TRUE),
    C = rnorm(100),
    D = 1,
    E = sample(c(10,20,30,40,NA), 100, replace=TRUE)
  )

prep$transform(test) %>% as_data_frame

Looking at the definition of column B in the test set, we see a value ("b3") that was not in the train set, and we don't know what to do with this value when we try to encode it numerically, hence the NAs in the ouput.

To take care of this potential issue, it's a good idea to use the factorize function. This function, as the name suggests, turns character variables into factors, and takes care of new levels in the test set (unseen in the train set) by replacing them with the most frequent level seen in the train set.

prep <- 
  preprocess(
    remove_vars("ID"),
    remove_constants(),
    impute(numerics, median, na.rm=TRUE),
    impute(nonnumerics, most_frequent),
    bin("C", 10),
    factorize(),
    encode_one_hot("A"),
    encode_numeric("B")
  )

prep$fit(data)

prep$transform(test) %>% as_data_frame


rtsho/preprocessr documentation built on May 29, 2019, 8:58 a.m.