Preprocessing is often the most time-consuming phase in data analysis and preprocessing transformations interdependent in unexpected ways. This package helps to make preprocessing faster and more effective. It provides an S4 framework for creating and evaluating preprocessing combinations for classification and clustering. The framework supports adding of user-defined preprocessors and preprocessing phases. Default preprocessors can be used for low variance removal, missing value imputation, scaling, outlier removal, noise smoothing, feature selection and class imbalance correction.

Preprocessing combinations

The method implemented by package 'preprocomb' is based on a conference paper: "Vattulainen, M.(2014) A method to improve the predictive power of a business performance measurement system by data preprocessing combinations: two cases in predictive classification of service sales volume from balanced data. In Ahmad Ghazawneh, Jacob Nørbjerg and Jan Pries-Heje(eds.) Proceedings of the 37th Information Systems Research Seminar in Scandinavia (IRIS 37), Ringsted, Denmark, 10-13 August 2014. ISBN 978-87-7349-876-7 (USB)

Getting started

options(width = 400)
library(preprocomb)

Functions from the supporting packages ('preprosim', 'preproviz' and 'metaheur') are called explicitly by using the scope resolution operator ::.

Creating reproducible test data

Package 'preprosim' can be used to create contaminated data sets in a reproducible manner. In the example below Iris dataset is used as a basedata. 6561 versions of Iris data are created having various levels of contaminations.

examplesimulation <- preprosim::preprosimrun(seed=1, data=iris, fitmodels=FALSE)

A moderately contaminated data set is selected to represent the need of preprocessing.

contaminateddf <- preprosim::getpreprosimdf(examplesimulation, c(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1))
str(contaminateddf)

Default preprocessors

suppressMessages(library(ggplot2))
suppressMessages(library(lattice))
suppressMessages(library(kernlab))
suppressMessages(library(rpart))

The package is intended as a platform for domain-specific preprocessing techniques. However, 20 default preprocessors are available for demonstration. Their names and descriptions can be seen with getpreprocessor() function.

getpreprocessors()

Low-level implementation of a preprocessor in the 'preprocomb' system can be seen by proving the argument type="definition". In the example below the definition of the first preproccessing technique is shown.

getpreprocessors(type="definition", nro=1)

Interactive mode

In the interactive mode preprocessing techniques can be applied in a sequence with function prepro(). The resulting object contains the preprocessing call history, computations and the fitness of the preprocessed data for model fitting. In the example below missing values are imputed first with meanimpute and then outliers removed with Tukey's IQR 1.5 rule. Support vector machine 'svmRadial' from 'kernlab' package is used as a classifier. The default classifier is 'rpart' from 'rpart' package.

step1 <- prepro(contaminateddf, "meanimpute", model="svmRadial")
step2 <- prepro(step1, "tukeyoutlier", model="svmRadial")
step2

Programmatic mode

In the programmatic mode search for the best combinations is executed.

Step 1

First, a grid of preprocessing combinations and corresponding preprocessed data sets is created with setgrid() function. Showing the resulting GridClass object gives the data validation results.

examplecombgrid <- setgrid(phases=c("imputation", "scaling", "smoothing"), data=contaminateddf)
examplecombgrid

All the combinations can be acquired with getcombinations() function and a specifically preprocessed data set with getpreprocombdf() function. In the example below data set for combination number 7 is provided.

head(getcombinations(examplecombgrid),10)
preprocesseddf <- getpreprocombdf(examplecombgrid, 7)
str(preprocesseddf)
getpreprocombdf(examplecombgrid, 7, type="summary")

Step 2

Secondly, the preprocessed data sets are evaluated for classification accuracy.

exampleresult <- preprocomb(grid=examplecombgrid, models=c("svmRadial"), nholdout=400, cores=2)

Extracting the best and worst combinations for classification:

exampleresult@bestclassification
exampleresult@worstclassification

Results can also be plotted to see the range of classification accuracy of combinations.

preprocombplot(exampleresult, type="boxplot")

For detailed analysis of the results the raw data can be accessed as well. In the example below the worst accuracy is selected.

min(exampleresult@rawall$svmRadialMean)

Wall-clock time of computing the accuracies:

exampleresult@walltime

Combination optimization

Package 'metaheur' can be used to apply metaheuristic optimization to preprocessing grid to find near-best combinations faster:

examplemetaheur <- metaheur::metaheur(examplegrid, model="svmRadial", iterations = 30, nholdout = 400)

Extracting the near best combination:

metaheur::getbestheur(examplemetaheur)

Extracting the wall-clock time of execution in minutes:

examplemetaheur@walltime

The search result is made possible by the characteristics of the objective function:

preprocombplot(exampleresult, type="lineplot")

For further information, please see package 'metaheur' vignette.

Default options

The package is intended to be used with domain specific preprocessing phases and techniques. There are however a set of default options available. Phases:

Each of the phases has two or more preprocessing techniques including "noaction".

There are alse six default phases each with one or several techniques totalling 1080 combinations and an extended version with 6480 combinations for binary data with all values positive. For the latter serious computing resources are needed for exhaustive search.

largegrid <- setgrid(phases=preprodefault, data=contaminateddf)
largergrid <- setgrid(phases=preprodefaultextended, data=contaminateddf)

Customization

The existing preprocessing techniques can be combined in new ways:

newimputephase <- setphase("newimputephase", c("naomit", "meanimpute"), TRUE)

Extensions

Preproccessing techniques can be added to the system in two steps:

Step 1

First, new prepreprocessing techniques can be defined as functions:

scaleexample <- function(dataobject) {
dataobject <- initializedataclassobject(data.frame(x=scale(dataobject@x), y=dataobject@y))
}

Notice that added preprocecessing technique definition input and output are both DataClass objects. The slot "y" is a factor vector containing the class labels and slot "x" the other variables, which all must be numeric.

Step 2

Preprocessing functions are added to the system.

setpreprocessor("scaleexample", "scaleexample(dataobject)", "THIS AN EXAMPLE")
step3 <- prepro(step2, "scaleexample", model="svmRadial") # continues the example above
step3

Step 3

Added preprocessing techniques can be added to phases and used in creating a new grid of combinations:

newscaling <- setphase("newscaling", c("noaction", "scaleexample"), TRUE)
newexamplegrid <- setgrid(phases=c("imputation", "newscaling"), data=testdata)

Supporting packages

Preprocessing combinations is a method to resolve preprocessing problems.

For understanding the condition of the data there are two supporting packages.

Preprosim

Package 'preprosim' can be used to create contaminated data sets as seen in the beginning of this vignette. It can also be used to plot the simulation results to observe how the data behaves when contaminations are increased in a controlled manner.

simulationrun2 <- preprosim::preprosimrun(iris, param=preprosim::newparam(iris, "custom", x="misval", z="noise"), caretmodel="rpart", holdoutrounds = 2, verbose=FALSE)
preprosim::preprosimplot(simulationrun2, "xz", x="misval", z="noise")

Preproviz

Package 'preproviz' can be used to visualize data quality issue interdependencies by means of constructing features that express the quality of a data point.

viz <- preproviz::preproviz(list(contaminateddf, preprocesseddf))
preproviz::plotVARCLUST(viz)
preproviz::plotCMDS(viz)

For more details on preprosim and preproviz, please see related vignettes.

Notes

The object examplesimulation used in test data chapter of this vignette can be created with the shown commmand. The object stored as an 'examplesimulation' in package preprocomb includes only the data set number 3281 to reduce the size of the package.



mvattulainen/preprocomb documentation built on May 23, 2019, 10:54 a.m.