readme.md

Preparation R

This package is used to temporarily relieve swelling, burning, pain, and itching caused by data preparation. Heavily influenced by sklearn preprocessing module. As such it aims to implement the Transformer API and allow for pipelines that can be saved and applied to new datasets.

Isn’t there already a package that does this?

Yes, and it’s pretty comprehensive. Check out the recipes package here: https://tidymodels.github.io/recipes/ . So why reinvent the wheel? Well I am not a huge fan of the tidyverse. I like that it turns new users on to R and the folks at RStudio have done so much for the R community. The tidyverse is very opinionated and still evolving. I prefer to stick to base R when I can and I especially like understanding how things work under the hood. Hence this package.

Prep Functions

Processing pipelines are nothing new. So it’s no suprise that this package follows a similar approach. You can create a pipeline explicitly using the pipeline function or in a maggritr style by using the pipeline operator, %|>%, to pipe multiple prep functions into each other.

data(iris)

p1 <- pipeline(
  prep_minmax(~.-Species),
  prep_onehot(~sel_factor()),
  sink_matrix()
)

p2 <-
  prep_minmax(~.-Species) %|>%
  prep_onehot(~sel_factor()) %|>%
  sink_matrix()

all.equal(p1, p2)

## [1] TRUE

## print out
p1

## [ Pipeline ] [isfit:  no ]
## |--[ MinMaxScaler ] [isfit:  no ]
## |--[ OnehotEncoder ] [isfit:  no ]
## |--[ Sink ] [isfit:  no ]

Fitting

The purpose of creating these pipelines is to fit them to data and save them to apply on different datasets. The fit method is used to fit a pipeline. It works by fitting each transform in sequence and passing the transformed data down the pipe. Once it has been trained, the isfit member will be set to TRUE

p1$fit(iris)
p1

## [ Pipeline ] [isfit:  yes ]
## |--[ MinMaxScaler ] [isfit:  yes ]
## |--[ OnehotEncoder ] [isfit:  yes ]
## |--[ Sink ] [isfit:  yes ]

Transforming

Once a pipeline has been fit, the transform method can be called and passed a new dataset. The settings saved during the training process will be applied to the new dataset ensuring a reproducible workflow with little micromanagement.

z <- p1$transform(iris)
knitr::kable(head(z), digits = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species=setosa Species=versicolor Species=virginica -0.56 0.25 -0.86 -0.92 1 0 0 -0.67 -0.17 -0.86 -0.92 1 0 0 -0.78 0.00 -0.90 -0.92 1 0 0 -0.83 -0.08 -0.83 -0.92 1 0 0 -0.61 0.33 -0.86 -0.92 1 0 0 -0.39 0.58 -0.76 -0.75 1 0 0

Zelazny7/prepr documentation built on May 6, 2019, 7:02 p.m.