In UBC-MDS/PrepR: Makes Preprocessing Easier

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

library(PrepR)

PrepR

Package Summary

PrepR is a package for R to help preprocessing in machine learning tasks. There are certain repetitive tasks that come up often when doing a machine learning project and this package aims to alleviate those chores. Some of the issues that come up regularly are: finding the types of each column in a dataframe, splitting the data (whether into train/test sets or train/test/validation sets, one-hot encoding, and scaling features. This package will help with all of those tasks.

Please find detailed descriptions of the functions in this vignette.

Installation:

You can install the released version of PrepR from CRAN with:

# install.packages("PrepR")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("UBC-MDS/PrepR")

Features

This package has the following features:

train_valid_test_split: This function splits the data set into train, validation, and test sets.
data_type: This function identifies data types for each column/feature. It returns one dataframe for each type of data.
one-hot: This function performs one-hot encoding on the categorical features and returns a dataframe with sensible column names.
scaler: This function performs standard scaling on the numerical features.

Dependencies

R version 3.6.1 and R packages:

dplyr
gensvm
tibble
nlme
magrittr

Usage

library(PrepR)

Identify features of different data types

data_type(my_data)$num

data_type(my_data)$cat

Examples:

fruits_df <- data.frame(fruits = c("apple", "pear", "apple", "banana", "orange"), count = c(1L, 2L, 3L, 4L, 5L), price = c(2, 3.4, 0.5, 7, 8))
fruits_y <- data.frame(target = c(1, 2, 3, 4, 5))

fruits_df

data_type(fruits_df)$num

Train, validation, and test split

train_valid_test_split(X, y, test_size, valid_size, train_size, stratify, random_state, shuffle)

Examples:

train_valid_test_split(fruits_df, fruits_y)$x_train

One-hot encode features of categorical type

onehot(my_data)

Examples:

onehot(data_type(fruits_df)$cat)

Standard Scaling of categorical features

scaler(x_train, x_test, colnames)$x_train

scaler(x_train, x_test, colnames)$x_test

Examples:

num_var <- data_type(fruits_df)$num

scaler(num_var, num_var, num_var, c("count", "price"))$x_train

Our package in the R ecosystem

There are no packages that do everything that this one does, but there are packages for machine learning in R that this package will make use of. The caret package in R does some of these preprocessing steps, such as train/test split. It does not, however, have a function that takes a dataframe and returns multiple dataframes based on their column type; this is the issue that this package's data_type function seeks to solve. However, it does not seem that there is an option for having a validation set in caret's version of train/test split. This is an option that would be useful to anyone doing machine learning. The caret package also does one-hot encoding with the function dummyVars, though onehot() in the PrepR package is more intuitive: its function does not return meaningful, human-readable column names. It also removes one column by default, which is better for fast computation, but worse for human-readability.

Overall, this package fits in well with the R ecosystem and helps make machine learning a little easier.