In UBC-MDS/laundRy: laundRy - preprocessing package for dataframes

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

laundRy

The laundRy package performs many standard preprocessing techniques for Tidyverse tibbles, before use in statistical analysis and machine learning. The package functionality includes categorizing column types, handling missing data and imputation, transforming/standardizing columns and feature selection. The laundRy package aims to remove much of the grunt work in the typical data science workflow, allowing the analyst maximum time and energy to devote to modelling!

Installation

You can install the released version of laundRy from CRAN with:

install.packages("laundRy")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("UBC-MDS/laundRy")

Features

categorize: This function will take in a dataframe, and output a list of lists with column types as list labels (numerical, categorical, text), and each list containing the column names associated with each column type.
fill_missing: This function takes in a dataframe and depending on user input, will either remove all rows with missing values, or will fill missing values using mean, median, or regression imputation.
transform_columns: This function will take in a dataframe and apply pre-processing techniques to each column. Categorical columns will be transformed with a One Hot Encoding and numerical columns will be scaled.
feature_selector: This function takes in a dataframe which has X and y columns specified, a target task (Regression or Classification), and a maximum number of features to select. The function returns the most important features for the target task.

Dependencies

TO DO

Usage

TODO

LaundRy in the R ecosystem

mice offers similar functionality for the fill_missing function, but is not integrated with a column categorizer.
The main feature selection and preprocessing package in R is caret, which carries out similar functionality to our feature_selector function though laundRy makes the workflow more efficient and adds imputation.
As far as we know, there are no similar packages for Categorizing Columns and providing a list of the categorized columns. laundRy is the first package we are aware of to abstract away the full dataframe pre-processing workflow with a unified and simple API.