standardPRE: A function for applying data pre-processing steps
In performanceEstimation: An Infra-Structure for Performance Estimation of Predictive Models

Description Usage Arguments Details Value Author(s) References See Also Examples

This function implements a series of simple data pre-processing steps and also allows the user to supply hers/his own functions to be applied to the data. The result of the function is a list containing the new (pre-processed) versions of the given train and test sets.

1	standardPRE(form, train, test, steps, ...)

`form`	A formula specifying the predictive task.
`train`	A data frame containing the training set.
`test`	A data frame containing the test set.
`steps`	A vector with function names that are to be applied in the sequence they appear in this vector to both the training and testing sets, to obtain new versions of these two data samples.
`...`	Any further parameters that will be passed to all functions specified in `steps`

This function is mainly used by both standardWF and timeseriesWF as a means to allow for users of these two standard workflows to specify some data pre-processing steps. These are steps one wishes to apply to the different train and test samples involved in an experimental comparison, before any model is learned or any predictions are obtained.

Nevertheless, the function can also be used outside of these standard workflows for obtaining pre-processed versions of train and test samples.

The function accepts as pre-processing functions both some already implemented functions as well as any function defined by the user provided these follow some protocol. Namely, these user-defined pre-processing functions should be aware that they will be called with a formula, a training data frame and a testing data frame in the first three arguments. Moreover, any arguments used in the call to standardPRE will also be forwarded to these user-defined functions. Finally, these functions should return a list with two components: "train" and "test", containing the pre-processed versions of the supplied train and test data frames.

The function already contains implementations of the following pre-processing steps that can be used in the steps parameter:

"scale" - that scales (subtracts the mean and divides by the standard deviation) any knnumeric features on both the training and testing sets. Note that the mean and standard deviation are calculated using only the training sample.

"centralImp" - that fills in any NA values in both sets using the median value for numeric predictors and the mode for nominal predictors. Once again these centrality statistics are calculated using only the training set although they are applied to both train and test sets.

"knnImp" - that fills in any NA values in both sets using the median value for numeric predictors and the mode for nominal predictors, but using only the k-nearest neighbors to calculate these satistics.

"na.omit" - that uses the R function na.omit to remove any rows containing NA's from both the training and test sets.

"undersampl" - this undersamples the training data cases that do not belong to the minority class (this pre-processing step is only available for classification tasks!). It takes the parameter perc.under that controls the level of undersampling (defaulting to 1, which means that there would be as many cases from the minority as from the other(s) class(es)).

"smote" - this operation uses the SMOTE (Chawla et. al. 2002) resampling algorithm to generate a new training sample with a more "balanced" distributions of the target class (this pre-processing step is only available for classification tasks!). It takes the parameters perc.under, perc.over and k to control the algorithm. Read the documentation of function smote to know more details.

A list with components "train" and "test" with both containing a data frame.

Luis Torgo ltorgo@dcc.fc.up.pt

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321-357.

Torgo, L. (2014) An Infra-Structure for Performance Estimation and Experimental Comparison of Predictive Models in R. arXiv:1412.0436 [cs.MS] http://arxiv.org/abs/1412.0436

standardPOST, standardWF, timeseriesWF

## Not run: 

##  A small example with standard pre-preprocessing: clean NAs and scale
data(algae,package="DMwR")

idx <- sample(1:nrow(algae),150)
tr <- algae[idx,1:12]
ts <- algae[-idx,1:12]
summary(tr)
summary(ts)

preData <- standardPRE(a1 ~ ., tr, ts, steps=c("centralImp","scale"))
summary(preData$train)
summary(preData$test)

######
## Using in the context of an experiment
library(e1071)
res <- performanceEstimation(
  PredTask(a1 ~ .,algae[,1:12],"alga1"),
  Workflow(learner="svm",pre=c("centralImp","scale")),
  EstimationTask(metrics="mse")
  )

summary(res)

######
## A user-defined pre-processing function
myScale <- function(f,tr,ts,avg,std,...) {
    tgtVar <- deparse(f[[2]])
    allPreds <- setdiff(colnames(tr),tgtVar)
    numPreds <- allPreds[sapply(allPreds,
                          function(p) is.numeric(tr[[p]]))]
    tr[,numPreds] <- scale(tr[,numPreds],center=avg,scale=std)
    ts[,numPreds] <- scale(ts[,numPreds],center=avg,scale=std)
    list(train=tr,test=ts)
}

## now using it with some random averages and stds for the 8 numeric
## predictors (just for illustration)
newData <- standardPRE(a1 ~ .,tr,ts,steps="myScale",
                       avg=rnorm(8),std=rnorm(8))


## End(Not run)