Preprocessing is often the most time-consuming phase in data analysis and preprocessing transformations interdependent in unexpected ways. This package helps to make preprocessing faster and more effective. It provides an S4 framework for creating and evaluating preprocessing combinations for classification and clustering. The framework supports adding of user-defined preprocessors and preprocessing phases. Default preprocessors can be used for low variance removal, missing value imputation, scaling, outlier removal, noise smoothing, feature selection and class imbalance correction.
The method implemented by package 'preprocomb' is based on a conference paper: "Vattulainen, M.(2014) A method to improve the predictive power of a business performance measurement system by data preprocessing combinations: two cases in predictive classification of service sales volume from balanced data. In Ahmad Ghazawneh, Jacob Nørbjerg and Jan Pries-Heje(eds.) Proceedings of the 37th Information Systems Research Seminar in Scandinavia (IRIS 37), Ringsted, Denmark, 10-13 August 2014. ISBN 978-87-7349-876-7 (USB)
options(width = 400)
library(preprocomb)
Functions from the supporting packages ('preprosim', 'preproviz' and 'metaheur') are called explicitly by using the scope resolution operator ::.
Package 'preprosim' can be used to create contaminated data sets in a reproducible manner. In the example below Iris dataset is used as a basedata. 6561 versions of Iris data are created having various levels of contaminations.
examplesimulation <- preprosim::preprosimrun(seed=1, data=iris, fitmodels=FALSE)
A moderately contaminated data set is selected to represent the need of preprocessing.
contaminateddf <- preprosim::getpreprosimdf(examplesimulation, c(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1)) str(contaminateddf)
suppressMessages(library(ggplot2)) suppressMessages(library(lattice)) suppressMessages(library(kernlab)) suppressMessages(library(rpart))
The package is intended as a platform for domain-specific preprocessing techniques. However, 20 default preprocessors are available for demonstration. Their names and descriptions can be seen with getpreprocessor() function.
getpreprocessors()
Low-level implementation of a preprocessor in the 'preprocomb' system can be seen by proving the argument type="definition". In the example below the definition of the first preproccessing technique is shown.
getpreprocessors(type="definition", nro=1)
In the interactive mode preprocessing techniques can be applied in a sequence with function prepro(). The resulting object contains the preprocessing call history, computations and the fitness of the preprocessed data for model fitting. In the example below missing values are imputed first with meanimpute and then outliers removed with Tukey's IQR 1.5 rule. Support vector machine 'svmRadial' from 'kernlab' package is used as a classifier. The default classifier is 'rpart' from 'rpart' package.
step1 <- prepro(contaminateddf, "meanimpute", model="svmRadial") step2 <- prepro(step1, "tukeyoutlier", model="svmRadial") step2
In the programmatic mode search for the best combinations is executed.
First, a grid of preprocessing combinations and corresponding preprocessed data sets is created with setgrid() function. Showing the resulting GridClass object gives the data validation results.
examplecombgrid <- setgrid(phases=c("imputation", "scaling", "smoothing"), data=contaminateddf) examplecombgrid
All the combinations can be acquired with getcombinations() function and a specifically preprocessed data set with getpreprocombdf() function. In the example below data set for combination number 7 is provided.
head(getcombinations(examplecombgrid),10) preprocesseddf <- getpreprocombdf(examplecombgrid, 7) str(preprocesseddf) getpreprocombdf(examplecombgrid, 7, type="summary")
Secondly, the preprocessed data sets are evaluated for classification accuracy.
exampleresult <- preprocomb(grid=examplecombgrid, models=c("svmRadial"), nholdout=400, cores=2)
Extracting the best and worst combinations for classification:
exampleresult@bestclassification exampleresult@worstclassification
Results can also be plotted to see the range of classification accuracy of combinations.
preprocombplot(exampleresult, type="boxplot")
For detailed analysis of the results the raw data can be accessed as well. In the example below the worst accuracy is selected.
min(exampleresult@rawall$svmRadialMean)
Wall-clock time of computing the accuracies:
exampleresult@walltime
Package 'metaheur' can be used to apply metaheuristic optimization to preprocessing grid to find near-best combinations faster:
examplemetaheur <- metaheur::metaheur(examplegrid, model="svmRadial", iterations = 30, nholdout = 400)
Extracting the near best combination:
metaheur::getbestheur(examplemetaheur)
Extracting the wall-clock time of execution in minutes:
examplemetaheur@walltime
The search result is made possible by the characteristics of the objective function:
preprocombplot(exampleresult, type="lineplot")
For further information, please see package 'metaheur' vignette.
The package is intended to be used with domain specific preprocessing phases and techniques. There are however a set of default options available. Phases:
Each of the phases has two or more preprocessing techniques including "noaction".
There are alse six default phases each with one or several techniques totalling 1080 combinations and an extended version with 6480 combinations for binary data with all values positive. For the latter serious computing resources are needed for exhaustive search.
largegrid <- setgrid(phases=preprodefault, data=contaminateddf) largergrid <- setgrid(phases=preprodefaultextended, data=contaminateddf)
The existing preprocessing techniques can be combined in new ways:
newimputephase <- setphase("newimputephase", c("naomit", "meanimpute"), TRUE)
Preproccessing techniques can be added to the system in two steps:
First, new prepreprocessing techniques can be defined as functions:
scaleexample <- function(dataobject) { dataobject <- initializedataclassobject(data.frame(x=scale(dataobject@x), y=dataobject@y)) }
Notice that added preprocecessing technique definition input and output are both DataClass objects. The slot "y" is a factor vector containing the class labels and slot "x" the other variables, which all must be numeric.
Preprocessing functions are added to the system.
setpreprocessor("scaleexample", "scaleexample(dataobject)", "THIS AN EXAMPLE") step3 <- prepro(step2, "scaleexample", model="svmRadial") # continues the example above step3
Added preprocessing techniques can be added to phases and used in creating a new grid of combinations:
newscaling <- setphase("newscaling", c("noaction", "scaleexample"), TRUE) newexamplegrid <- setgrid(phases=c("imputation", "newscaling"), data=testdata)
Preprocessing combinations is a method to resolve preprocessing problems.
For understanding the condition of the data there are two supporting packages.
Package 'preprosim' can be used to create contaminated data sets as seen in the beginning of this vignette. It can also be used to plot the simulation results to observe how the data behaves when contaminations are increased in a controlled manner.
simulationrun2 <- preprosim::preprosimrun(iris, param=preprosim::newparam(iris, "custom", x="misval", z="noise"), caretmodel="rpart", holdoutrounds = 2, verbose=FALSE) preprosim::preprosimplot(simulationrun2, "xz", x="misval", z="noise")
Package 'preproviz' can be used to visualize data quality issue interdependencies by means of constructing features that express the quality of a data point.
viz <- preproviz::preproviz(list(contaminateddf, preprocesseddf)) preproviz::plotVARCLUST(viz) preproviz::plotCMDS(viz)
For more details on preprosim and preproviz, please see related vignettes.
The object examplesimulation used in test data chapter of this vignette can be created with the shown commmand. The object stored as an 'examplesimulation' in package preprocomb includes only the data set number 3281 to reduce the size of the package.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.