knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" ) library(PrepR)
PrepR
is a package for R to help preprocessing in machine learning tasks.
There are certain repetitive tasks that come up often when doing a machine learning project and this package aims to alleviate those chores.
Some of the issues that come up regularly are: finding the types of each column in a dataframe, splitting the data (whether into train/test sets or train/test/validation sets, one-hot encoding, and scaling features.
This package will help with all of those tasks.
Please find detailed descriptions of the functions in this vignette.
You can install the released version of PrepR from CRAN with:
# install.packages("PrepR")
And the development version from GitHub with:
# install.packages("devtools") devtools::install_github("UBC-MDS/PrepR")
This package has the following features:
train_valid_test_split
: This function splits the data set into train, validation, and test sets.
data_type
: This function identifies data types for each column/feature. It returns one dataframe for each type of data.
one-hot
: This function performs one-hot encoding on the categorical features and returns a dataframe with sensible column names.
scaler
: This function performs standard scaling on the numerical features.
R version 3.6.1 and R packages:
library(PrepR)
Identify features of different data types
data_type(my_data)$num
data_type(my_data)$cat
Examples:
fruits_df <- data.frame(fruits = c("apple", "pear", "apple", "banana", "orange"), count = c(1L, 2L, 3L, 4L, 5L), price = c(2, 3.4, 0.5, 7, 8)) fruits_y <- data.frame(target = c(1, 2, 3, 4, 5)) fruits_df
data_type(fruits_df)$num
Train, validation, and test split
train_valid_test_split(X, y, test_size, valid_size, train_size, stratify, random_state, shuffle)
Examples:
train_valid_test_split(fruits_df, fruits_y)$x_train
One-hot encode features of categorical type
onehot(my_data)
Examples:
onehot(data_type(fruits_df)$cat)
Standard Scaling of categorical features
scaler(x_train, x_test, colnames)$x_train
scaler(x_train, x_test, colnames)$x_test
Examples:
num_var <- data_type(fruits_df)$num scaler(num_var, num_var, num_var, c("count", "price"))$x_train
There are no packages that do everything that this one does, but there are packages for machine learning in R that this package will make use of.
The caret
package in R does some of these preprocessing steps, such as train/test split.
It does not, however, have a function that takes a dataframe and returns multiple dataframes based on their column type;
this is the issue that this package's data_type
function seeks to solve.
However, it does not seem that there is an option for having a validation set in caret
's version of train/test split.
This is an option that would be useful to anyone doing machine learning.
The caret
package also does one-hot encoding with the function dummyVars
, though onehot()
in the PrepR package is more intuitive:
its function does not return meaningful, human-readable column names.
It also removes one column by default, which is better for fast computation, but worse for human-readability.
Overall, this package fits in well with the R ecosystem and helps make machine learning a little easier.
The official documentation is hosted on Read the Docs: https://ubc-mds.github.io/PrepR/
This package was created using the Whole Game Chapter from the R Packages handbook by Hadley Wickham
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.