knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
MealprepR offers a toolkit, made with care, to help users save time in the data preprocessing kitchen.
Recognizing that the preparation step of a data science project often requires the most time and effort, mealprepR
aims to help data science chefs of all specialties master their recipes of analysis. This package tackles pesky tasks such as classifying columns as categorical or numeric ingredients, straining NA values and outliers, and automating a preprocessing recipe pipeline.
find_fruits_veg()
: This function returns the indices of numeric and categorical variables in the dataset.
find_missing_ingredients()
: For each column with missing values, this function will create a reference list of row indices, sum the number, and calculate the proportion of missing values.
find_bad_apples()
: This function uses a univariate approach to outlier detection. For each column with outliers (values that are 2 or more standard deviations from the mean), this function will create a reference list of row indices with outliers, and the total number of outliers in that column.
make_recipe()
: This function is used to quickly apply common data preprocessing techniques.
mealprepR complements many of the existing packages in the R ecosystem around the theme of data preprocessing. With respect to categorizing the columns of a dataframe by type, the base R code sapply(yourdataframe, class)
will produce the data types of each column. However, there appears to be no easy way to divide the columns into categorical and numerical groups. find_fruits_veg()
aims to fill this void.
In terms of missing values, the package na.tools provides a suite of tools to check the number and proportion of missing values in a dataset as well as methods controlling how they may be replaced. The forecast package includes the function na.interp()
which provides a similar replacement tool for time series data. The gap between these packages is that neither provides a summary of the missing values including the list of indices where they occur. find_missing_ingredients()
augments these tools by providing a summary dataframe detailing which columns have missing values, as well as their count and proportion.
The base R summary()
function is a popular first function to run during the data exploration stage of a project because it returns the mean, median, minimum, and maximum of each variable in a dataset, which allows users to easily see whether there are possible outliers. However, the output of this function does not tell you which rows of data these outliers are found in, or how many outliers are present in the dataframe. Packages like outliers and OutlierDetection provide many more ways of detecting outliers using both univariate and multivariate outlier detection methods, and some functions such as the outlierTest()
function from the car package also output the row within a dataframe which has the most extreme observation given a model. The find_bad_apples()
function provides more detail than the base R summary()
function for outlier detection, but does not provide multivariate outlier detection like other functions from packages specifically made to detect outliers.
Lastly, there are many great tools in the data science ecosystem for pre-processing data such as caret in R. However, you may find yourself frequently writing the same lengthy code for common preprocessing tasks (e.g scale numeric features and one hot encode categorical features). preprocess_recipe()
provides a shortcut function to apply your favourite recipes quickly to preprocess data in one line of code.
You can install the released version of mealprepR from CRAN with:
install.packages("mealprepR")
And the development version from GitHub with:
# install.packages("devtools") devtools::install_github("UBC-MDS/mealprepR")
To view mealprepR documentation online, please visit https://ubc-mds.github.io/mealprepR/
To load the package:
library(mealprepR) library(devtools) load_all()
find_fruits_veg()
find_fruits_veg() will help you find the wanted columns index.
The example below shows how to use find_fruits_veg() to find numerical
columns' index of iris
data frame.
head(iris) find_fruits_veg(iris, 'num') # find the numeric column indices find_fruits_veg(iris, 'categ') # find the categorical column indices
find_missing_ingredients()
Before launching into a new data analysis, running the function find_missing_ingredients()
on a data frame of interest will produce a report on each column with missing values.
To demonstrate how this function works, follow the example below with toy data
df <- data.frame("letters" = c(1,2,NA,"b"),"numbers" = c(3,4,NA,NA)) df
find_missing_ingredients(df)
find_bad_apples()
If you don't already have a dataframe to work with, run this code to set up a toy dataframe (df
) for testing.
df <- data.frame('A' = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'B' = c(1, 1, 1, 1, 1, 1, 1, 1, 1, -100, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 100), 'C' = c(19, 1, 1, 1, 17, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 20, 1, 1, 1)) df
To find the outliers in the dataframe, run find_bad_apples(df)
and you'll get a dataframe that shows which columns have outliers.
find_bad_apples(df)
To see the nested tibbles that show the indices of the outliers, add $indices
to the end of the function. This output shows that column 'B' has outliers in rows 10 and 30, and column 'C' has outliers in rows 1, 5, and 27.
find_bad_apples(df)$indices
make_recipe()
Do you find yourslef constantly applying the same data preprocessing techniques time and time again? make_recipe
can help by applying your favourite preprocessing recips in only a few lines of code.
Below make_recipe
applies the following common recipe in only one line of code:
First load the classic mtcars
data set.
library(dplyr) X <- dplyr::as_tibble(mtcars) %>% mutate( carb = as.factor(carb), gear = as.factor(gear), vs = as.factor(vs), am = as.factor(am) ) head(X)
Use make_recipe
to quickly apply split your data and apply your favourite preprocessing techniques!
library(mealprepR) mtcars_splits <- make_recipe( X = X, y = "gear", recipe = "ohe_and_standard_scaler", splits_to_return = "train_test" ) head(mtcars_splits$X_train)
Calling make_recipe
will then return a list containing all of your transformed data.
str(mtcars_splits)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.