readme.md
In Zelazny7/prepr: Data Preparation Pipeline and Manifest

This package is used to temporarily relieve swelling, burning, pain, and itching caused by data preparation. Heavily influenced by sklearn preprocessing module. As such it aims to implement the Transformer API and allow for pipelines that can be saved and applied to new datasets.

Yes, and it’s pretty comprehensive. Check out the recipes package here: https://tidymodels.github.io/recipes/ . So why reinvent the wheel? Well I am not a huge fan of the tidyverse. I like that it turns new users on to R and the folks at RStudio have done so much for the R community. The tidyverse is very opinionated and still evolving. I prefer to stick to base R when I can and I especially like understanding how things work under the hood. Hence this package.

Processing pipelines are nothing new. So it’s no suprise that this package follows a similar approach. You can create a pipeline explicitly using the pipeline function or in a maggritr style by using the pipeline operator, %|>%, to pipe multiple prep functions into each other.

data(iris)

p1 <- pipeline(
  prep_minmax(~.-Species),
  prep_onehot(~sel_factor()),
  sink_matrix()
)

p2 <-
  prep_minmax(~.-Species) %|>%
  prep_onehot(~sel_factor()) %|>%
  sink_matrix()

all.equal(p1, p2)

## [1] TRUE

## print out
p1

## [ Pipeline ] [isfit:  no ]
## |--[ MinMaxScaler ] [isfit:  no ]
## |--[ OnehotEncoder ] [isfit:  no ]
## |--[ Sink ] [isfit:  no ]

The purpose of creating these pipelines is to fit them to data and save them to apply on different datasets. The fit method is used to fit a pipeline. It works by fitting each transform in sequence and passing the transformed data down the pipe. Once it has been trained, the isfit member will be set to TRUE

p1$fit(iris)
p1

## [ Pipeline ] [isfit:  yes ]
## |--[ MinMaxScaler ] [isfit:  yes ]
## |--[ OnehotEncoder ] [isfit:  yes ]
## |--[ Sink ] [isfit:  yes ]

Once a pipeline has been fit, the transform method can be called and passed a new dataset. The settings saved during the training process will be applied to the new dataset ensuring a reproducible workflow with little micromanagement.

z <- p1$transform(iris)
knitr::kable(head(z), digits = 2)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species=setosa Species=versicolor Species=virginica -0.56 0.25 -0.86 -0.92 1 0 0 -0.67 -0.17 -0.86 -0.92 1 0 0 -0.78 0.00 -0.90 -0.92 1 0 0 -0.83 -0.08 -0.83 -0.92 1 0 0 -0.61 0.33 -0.86 -0.92 1 0 0 -0.39 0.58 -0.76 -0.75 1 0 0

Zelazny7/prepr documentation built on May 6, 2019, 7:02 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com