Description Usage Arguments Details Value Author(s) See Also Examples
These functions are run in evaluate
just prior to model
fitting, to extract fitting and test sets from the entire dataset and apply
transformations to pre-process the data (for handling missing values,
scaling, compression etc.).
They can also be used to adapt the form of the data to a specific
fitting function, e.g. pre_pamr
that transposes the dataset
to make it compatible with the pamr
classification method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | pre_split(x, y, fold)
pre_convert(data, x_fun, y_fun, ...)
pre_transpose(data)
pre_remove(data, feature)
pre_center(data, y = FALSE, na.rm = TRUE)
pre_scale(data, y = FALSE, na.rm = TRUE, center = TRUE)
pre_remove_constant(data, na.rm = TRUE)
pre_remove_correlated(data, cutoff)
pre_pca(data, ncomponent, scale. = TRUE, ...)
|
x |
Dataset. |
y |
Response vector. |
fold |
A logical or numeric vector with |
data |
Fitting and testing data sets, as returned by
|
x_fun |
Function to apply to the descriptors of the datasets
(e.g. |
y_fun |
Function to be applied to the response of the training and test sets (independently). |
... |
Sent to internal methods, see the code of each function. |
feature |
The features to be removed. Can be integer, logical or character. |
na.rm |
A logical value indicating whether |
center |
Whether to center the data before scaling. |
cutoff |
See |
ncomponent |
Number of PCA components to use. Missing all components are used. |
scale. |
Sent to |
When supplied to evaluate
, pre-processing functions can be
chained (i.e. executed sequentially) after an initating call to
pre_split
.
This can either be done using the pipe operator defined
in the magrittr package or by putting all pre-processing functions in a
regular list (see the examples).
Note that all transformations are defined based on the fitting data only and then applied to both fitting set and test set. It is important to not let the test data in any way be part of the model fitting, including the preprocessing, to not risk information leakage and biased results!
The imputation functions can also be used outside of
evaluate
by not supplying a fold to
pre_split
.
See the code of impute_median
for an example.
A list with the following components
fit
Fitting set.
test
Test set.
feature_selection
Integer vector mapping the features of the training and test sets to the original data sets.
fold
The fold that was used to split the data.
Christofer Bäcklin
pre_factor_to_logical
, emil
,
pre_impute_knn
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | # Setup an example to work on
x <- as.matrix(iris[-5])
x[sample(600, 6)] <- NA
y <- iris$Species
cv <- resample("crossvalidation", y, nrepeat=3, nfold=4)
procedure <- modeling_procedure("lda")
# Simple dataset splitting
sets <- pre_split(x, y, cv[[1]])
# Chaining using the pipe operator
sets <- pre_split(x, y, cv[[1]]) %>%
pre_impute_median %>%
pre_scale
# Integration with `evaluate`
result <- evaluate(procedure, x, y, resample=cv,
pre_process = function(...){
pre_split(...) %>%
pre_impute_median %>%
pre_scale
}
)
# or analogously with a list
result <- evaluate(procedure, x, y, resample=cv,
pre_process = list(pre_split, pre_impute_median, pre_scale))
# Imputing without splitting
x.imputed <- impute_knn(x)
# Using a whole chain without splitting
x.processed <- pre_split(x, y=NULL) %>%
pre_impute_median %>%
pre_scale %>%
(function(data) data$fit$x)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.