Preprocomb"
In preprocomb: Tools for Preprocessing Combinations

Preprocessing is often the most time-consuming phase in data analysis and preprocessing transformations interdependent in unexpected ways. This package helps to make preprocessing faster and more effective. It provides an S4 framework for creating and evaluating preprocessing combinations for classification, clustering and outlier detection. The framework supports adding of user-defined preprocessors and preprocessing phases. Default preprocessors can be used for low variance removal, missing value imputation, scaling, outlier removal, noise smoothing, feature selection and class imbalance correction.

Test data

Let's start by adding contaminations to Iris-data to simulate the need for preprocessing:

set.seed(1)
testdata <- iris
testdata[sample(1:150,40),3] <- NA # add missing values to the third variable
testdata[,4] <- rnorm(150, testdata[,4], 2) # add noise to the fourth variable
testdata$Irrelevant <- runif(150, 0, 1) # add an irrelevant feature

Interactive mode

In the interactive mode preprocessing techniques can be applied in a sequence with function prepro(). The resulting object contains the preprocessing call history, computations and the fitness of the preprocessed data for model fitting. In the example below missing values are imputed first with meanimpute and then outliers removed with Orh-algorithm. Support vector machine svmRadial from kernlab package is used as a classifier. The default classifier is rpart from rpart package.

suppressMessages(library(ggplot2))
suppressMessages(library(lattice))
suppressMessages(library(kernlab))
suppressMessages(library(rpart))

library(preprocomb)
step1 <- prepro(testdata, "meanimpute", model="svmRadial")
step2 <- prepro(step1, "orhoutlier", model="svmRadial")
step2

Programmatic mode

In the programmatic mode search for the best combinations is executed. First, a grid of preprocessing combinations and corresponding preprocessed data sets is created. Secondly, the preprocessed data sets are evaluated for classification accuracy, clustering tendency and skewness of outlier scores In the example below the preprocessing pipeline consists 540 combinations and their evaluations.

examplegrid <- setgrid(phases=c("imputation", "outliers", "scaling", "smoothing", "selection"), data=testdata)
exampleresult <- preprocomb(grid=examplegrid, models=c("svmRadial"), nholdout=10, cluster=TRUE, outlier=TRUE, cores=2)

Extracting the wall-clock time of execution in minutes:

exampleresult@walltime

Extracting the best combinations for classification:

exampleresult@bestclassification

Default options

The package is intended to be used with domain specific preprocessing phases and techniques. There are however a set of default options available. Phases:

"imputation", missing value imputation
"variance", low variance removal
"smoothing", noise smoothing
"scaling", value range scaling
"outliers", outlier removal
"sampling", class imbalance corrections
"selection", irrelevant feature selection

Each of the phases has two or more preprocessing techniques including "noaction". Available preprocessing techniques can be shown by:

getpreprocessor()

and preprocecssor function definition by giving the name of the preprocessing technique as argument:

getpreprocessor("basicscale")

Customization

Preproccessing techniques can be added to the system in two steps:

Step 1: Function definition

scaleexample <- function(dataobject) {
dataobject <- initializedataclassobject(data.frame(x=scale(dataobject@x), y=dataobject@y))
}

Notice that added preprocecessing technique definition input and output are both DataClass objects. The slot "y" is a factor vector containing the class labels and slot "x" the other variables, which all must be numeric.

Step 2: Adding of the function to the system

setpreprocessor("scaleexample", "scaleexample(dataobject)")
step3 <- prepro(step2, "scaleexample", model="svmRadial") # continues the example above
step3

Added preprocessing techniques can be added to phases and used in creating a new grid of combinations:

newscaling <- setphase("newscaling", c("noaction", "scaleexample"), TRUE)
newexamplegrid <- setgrid(phases=c("imputation", "newscaling"), data=testdata)