pre_process: Data preprocessing

Description Usage Arguments Details Value Author(s) See Also Examples

Description

These functions are run in evaluate just prior to model fitting, to extract fitting and test sets from the entire dataset and apply transformations to pre-process the data (for handling missing values, scaling, compression etc.). They can also be used to adapt the form of the data to a specific fitting function, e.g. pre_pamr that transposes the dataset to make it compatible with the pamr classification method.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
pre_split(x, y, fold)

pre_convert(data, x_fun, y_fun, ...)

pre_transpose(data)

pre_remove(data, feature)

pre_center(data, y = FALSE, na.rm = TRUE)

pre_scale(data, y = FALSE, na.rm = TRUE, center = TRUE)

pre_remove_constant(data, na.rm = TRUE)

pre_remove_correlated(data, cutoff)

pre_pca(data, ncomponent, scale. = TRUE, ...)

Arguments

x

Dataset.

y

Response vector.

fold

A logical or numeric vector with TRUE or positive numbers for fitting observations, FALSE or 0 for test observations, and NA for observations not to be included.

data

Fitting and testing data sets, as returned by pre_split.

x_fun

Function to apply to the descriptors of the datasets (e.g. x). This function will be applied independenly to the fitting and testing sets.

y_fun

Function to be applied to the response of the training and test sets (independently).

...

Sent to internal methods, see the code of each function.

feature

The features to be removed. Can be integer, logical or character.

na.rm

A logical value indicating whether NA values should be ignored.

center

Whether to center the data before scaling.

cutoff

See findCorrelation.

ncomponent

Number of PCA components to use. Missing all components are used.

scale.

Sent to prcomp.

Details

When supplied to evaluate, pre-processing functions can be chained (i.e. executed sequentially) after an initating call to pre_split. This can either be done using the pipe operator defined in the magrittr package or by putting all pre-processing functions in a regular list (see the examples).

Note that all transformations are defined based on the fitting data only and then applied to both fitting set and test set. It is important to not let the test data in any way be part of the model fitting, including the preprocessing, to not risk information leakage and biased results!

The imputation functions can also be used outside of evaluate by not supplying a fold to pre_split. See the code of impute_median for an example.

Value

A list with the following components

fit

Fitting set.

test

Test set.

feature_selection

Integer vector mapping the features of the training and test sets to the original data sets.

fold

The fold that was used to split the data.

Author(s)

Christofer Bäcklin

See Also

pre_factor_to_logical, emil, pre_impute_knn

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Setup an example to work on
x <- as.matrix(iris[-5])
x[sample(600, 6)] <- NA
y <- iris$Species
cv <- resample("crossvalidation", y, nrepeat=3, nfold=4)
procedure <- modeling_procedure("lda")

# Simple dataset splitting
sets <- pre_split(x, y, cv[[1]])

# Chaining using the pipe operator
sets <- pre_split(x, y, cv[[1]]) %>%
    pre_impute_median %>%
    pre_scale

# Integration with `evaluate`
result <- evaluate(procedure, x, y, resample=cv,
    pre_process = function(...){
        pre_split(...) %>%
        pre_impute_median %>%
        pre_scale
    }
)

# or analogously with a list
result <- evaluate(procedure, x, y, resample=cv,
    pre_process = list(pre_split, pre_impute_median, pre_scale))

# Imputing without splitting
x.imputed <- impute_knn(x)

# Using a whole chain without splitting
x.processed <- pre_split(x, y=NULL) %>%
    pre_impute_median %>%
    pre_scale %>%
    (function(data) data$fit$x)

emil documentation built on Aug. 1, 2018, 1:03 a.m.

Related to pre_process in emil...