pre_process: Data preprocessing
In emil: Evaluation of Modeling without Information Leakage

Description Usage Arguments Details Value Author(s) See Also Examples

These functions are run in evaluate just prior to model fitting, to extract fitting and test sets from the entire dataset and apply transformations to pre-process the data (for handling missing values, scaling, compression etc.). They can also be used to adapt the form of the data to a specific fitting function, e.g. pre_pamr that transposes the dataset to make it compatible with the pamr classification method.

pre_split(x, y, fold)

pre_convert(data, x_fun, y_fun, ...)

pre_transpose(data)

pre_remove(data, feature)

pre_center(data, y = FALSE, na.rm = TRUE)

pre_scale(data, y = FALSE, na.rm = TRUE, center = TRUE)

pre_remove_constant(data, na.rm = TRUE)

pre_remove_correlated(data, cutoff)

pre_pca(data, ncomponent, scale. = TRUE, ...)

`x`	Dataset.
`y`	Response vector.
`fold`	A logical or numeric vector with `TRUE` or positive numbers for fitting observations, `FALSE` or `0` for test observations, and `NA` for observations not to be included.
`data`	Fitting and testing data sets, as returned by `pre_split`.
`x_fun`	Function to apply to the descriptors of the datasets (e.g. `x`). This function will be applied independenly to the fitting and testing sets.
`y_fun`	Function to be applied to the response of the training and test sets (independently).
`...`	Sent to internal methods, see the code of each function.
`feature`	The features to be removed. Can be integer, logical or character.
`na.rm`	A logical value indicating whether `NA` values should be ignored.
`center`	Whether to center the data before scaling.
`cutoff`	See `findCorrelation`.
`ncomponent`	Number of PCA components to use. Missing all components are used.
`scale.`	Sent to `prcomp`.

When supplied to evaluate, pre-processing functions can be chained (i.e. executed sequentially) after an initating call to pre_split. This can either be done using the pipe operator defined in the magrittr package or by putting all pre-processing functions in a regular list (see the examples).

Note that all transformations are defined based on the fitting data only and then applied to both fitting set and test set. It is important to not let the test data in any way be part of the model fitting, including the preprocessing, to not risk information leakage and biased results!

The imputation functions can also be used outside of evaluate by not supplying a fold to pre_split. See the code of impute_median for an example.

A list with the following components

fit: Fitting set.
test: Test set.
feature_selection: Integer vector mapping the features of the training and test sets to the original data sets.
fold: The fold that was used to split the data.

Christofer Bäcklin

pre_factor_to_logical, emil, pre_impute_knn

# Setup an example to work on
x <- as.matrix(iris[-5])
x[sample(600, 6)] <- NA
y <- iris$Species
cv <- resample("crossvalidation", y, nrepeat=3, nfold=4)
procedure <- modeling_procedure("lda")

# Simple dataset splitting
sets <- pre_split(x, y, cv[[1]])

# Chaining using the pipe operator
sets <- pre_split(x, y, cv[[1]]) %>%
    pre_impute_median %>%
    pre_scale

# Integration with `evaluate`
result <- evaluate(procedure, x, y, resample=cv,
    pre_process = function(...){
        pre_split(...) %>%
        pre_impute_median %>%
        pre_scale
    }
)

# or analogously with a list
result <- evaluate(procedure, x, y, resample=cv,
    pre_process = list(pre_split, pre_impute_median, pre_scale))

# Imputing without splitting
x.imputed <- impute_knn(x)

# Using a whole chain without splitting
x.processed <- pre_split(x, y=NULL) %>%
    pre_impute_median %>%
    pre_scale %>%
    (function(data) data$fit$x)