Preprocessing is often the most time-consuming phase in data analysis and preprocessing transformations interdependent in unexpected ways. This package helps to make preprocessing faster and more effective. It provides an S4 framework for creating and evaluating preprocessing combinations for classification, clustering and outlier detection. The framework supports adding of user-defined preprocessors and preprocessing phases. Default preprocessors can be used for low variance removal, missing value imputation, scaling, outlier removal, noise smoothing, feature selection and class imbalance correction.
Let's start by adding contaminations to Iris-data to simulate the need for preprocessing:
set.seed(1) testdata <- iris testdata[sample(1:150,40),3] <- NA # add missing values to the third variable testdata[,4] <- rnorm(150, testdata[,4], 2) # add noise to the fourth variable testdata$Irrelevant <- runif(150, 0, 1) # add an irrelevant feature
In the interactive mode preprocessing techniques can be applied in a sequence with function prepro(). The resulting object contains the preprocessing call history, computations and the fitness of the preprocessed data for model fitting. In the example below missing values are imputed first with meanimpute and then outliers removed with Orh-algorithm. Support vector machine svmRadial from kernlab package is used as a classifier. The default classifier is rpart from rpart package.
suppressMessages(library(ggplot2)) suppressMessages(library(lattice)) suppressMessages(library(kernlab)) suppressMessages(library(rpart))
library(preprocomb) step1 <- prepro(testdata, "meanimpute", model="svmRadial") step2 <- prepro(step1, "orhoutlier", model="svmRadial") step2
In the programmatic mode search for the best combinations is executed. First, a grid of preprocessing combinations and corresponding preprocessed data sets is created. Secondly, the preprocessed data sets are evaluated for classification accuracy, clustering tendency and skewness of outlier scores In the example below the preprocessing pipeline consists 540 combinations and their evaluations.
examplegrid <- setgrid(phases=c("imputation", "outliers", "scaling", "smoothing", "selection"), data=testdata) exampleresult <- preprocomb(grid=examplegrid, models=c("svmRadial"), nholdout=10, cluster=TRUE, outlier=TRUE, cores=2)
Extracting the wall-clock time of execution in minutes:
exampleresult@walltime
Extracting the best combinations for classification:
exampleresult@bestclassification
The package is intended to be used with domain specific preprocessing phases and techniques. There are however a set of default options available. Phases:
Each of the phases has two or more preprocessing techniques including "noaction". Available preprocessing techniques can be shown by:
getpreprocessor()
and preprocecssor function definition by giving the name of the preprocessing technique as argument:
getpreprocessor("basicscale")
Preproccessing techniques can be added to the system in two steps:
Step 1: Function definition
scaleexample <- function(dataobject) { dataobject <- initializedataclassobject(data.frame(x=scale(dataobject@x), y=dataobject@y)) }
Notice that added preprocecessing technique definition input and output are both DataClass objects. The slot "y" is a factor vector containing the class labels and slot "x" the other variables, which all must be numeric.
Step 2: Adding of the function to the system
setpreprocessor("scaleexample", "scaleexample(dataobject)") step3 <- prepro(step2, "scaleexample", model="svmRadial") # continues the example above step3
Added preprocessing techniques can be added to phases and used in creating a new grid of combinations:
newscaling <- setphase("newscaling", c("noaction", "scaleexample"), TRUE) newexamplegrid <- setgrid(phases=c("imputation", "newscaling"), data=testdata)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.