nestcv.train  R Documentation 
This function applies nested crossvalidation (CV) to training of models
using the caret
package. The function also allows the option of embedded
filtering of predictors for feature selection nested within the outer loop of
CV. Predictions on the outer test folds are brought back together and error
estimation/ accuracy determined. The default is 10x10 nested CV.
nestcv.train( y, x, filterFUN = NULL, filter_options = NULL, weights = NULL, balance = NULL, balance_options = NULL, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, outer_folds = NULL, cv.cores = 1, metric = ifelse(is.factor(y), "logLoss", "RMSE"), trControl = NULL, tuneGrid = NULL, savePredictions = "final", outer_train_predict = FALSE, finalCV = TRUE, na.option = "pass", ... )
y 
Response vector. For classification this should be a factor. 
x 
Matrix or dataframe of predictors 
filterFUN 
Filter function, e.g. ttest_filter or relieff_filter.
Any function can be provided and is passed 
filter_options 
List of additional arguments passed to the filter
function specified by 
weights 
Weights applied to each sample for models which can use
weights. Note 
balance 
Specifies method for dealing with imbalanced class data.
Current options are 
balance_options 
List of additional arguments passed to the balancing function 
outer_method 
String of either 
n_outer_folds 
Number of outer CV folds 
outer_folds 
Optional list containing indices of test folds for outer
CV. If supplied, 
cv.cores 
Number of cores for parallel processing of the outer loops.
NOTE: this uses 
metric 
A string that specifies what summary metric will be used to select the optimal model. By default, "logLoss" is used for classification and "RMSE" is used for regression. Note this differs from the default setting in caret which uses "Accuracy" for classification. See details. 
trControl 
A list of values generated by the 
tuneGrid 
Data frame of tuning values, see caret::train. 
savePredictions 
Indicates whether holdout predictions for each inner
CV fold should be saved for ROC curves, accuracy etc see
caret::trainControl. Default is 
outer_train_predict 
Logical whether to save predictions on outer training folds to calculate performance on outer training folds. 
finalCV 
Logical whether to perform one last round of CV on the whole
dataset to determine the final model parameters. If set to 
na.option 
Character value specifying how 
... 
Arguments passed to caret::train including 
Parallelisation is performed on the outer folds using parallel::mclapply
on
unix/mac and parallel::parLapply
on windows.
We strongly recommend that you try calls to nestcv.train
with cv.cores=1
first. With caret
this may flag up that specific packages are not installed
or that there are problems with input variables y
and x
which may have to
be corrected for the call to run in multicore mode.
If the outer folds are run using parallelisation, then parallelisation in
caret must be off, otherwise an error will be generated. Alternatively if
you wish to use parallelisation in caret, then parallelisation in
nestcv.train
can be fully disabled by leaving cv.cores = 1
.
For classification, metric
defaults to using 'logLoss' with the
trControl
arguments classProbs = TRUE, summaryFunction = mnLogLoss
,
rather than 'Accuracy' which is the default classification metric in
caret
. See trainControl. LogLoss is arguably more consistent than
Accuracy for tuning parameters in datasets with small sample size.
Models can be fitted with a single set of fixed parameters, in which case
trControl
defaults to trainControl(method = "none")
which disables
inner CV as it is unnecessary. See
https://topepo.github.io/caret/modeltrainingandtuning.html#fittingmodelswithoutparametertuning
An object with S3 class "nestcv.train"
call 
the matched call 
output 
Predictions on the leftout outer folds 
outer_result 
List object of results from each outer fold containing predictions on leftout outer folds, caret result and number of filtered predictors at each fold. 
outer_folds 
List of indices of outer test folds 
dimx 
dimensions of 
y 
original response vector 
yfinal 
final response vector (postbalancing) 
final_fit 
Final fitted caret model using best tune parameters 
final_vars 
Column names of filtered predictors entering final model 
summary_vars 
Summary statistics of filtered predictors 
roc 
ROC AUC for binary classification where available. 
trControl 

bestTunes 
best tuned parameters from each outer fold 
finalTune 
final parameters used for final model 
summary 
Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression. 
Myles Lewis
## sigmoid function sigmoid < function(x) {1 / (1 + exp(x))} ## load iris dataset and simulate a binary outcome data(iris) x < iris[, 1:4] colnames(x) < c("marker1", "marker2", "marker3", "marker4") x < as.data.frame(apply(x, 2, scale)) y2 < sigmoid(0.5 * x$marker1 + 2 * x$marker2) > runif(nrow(x)) y2 < factor(y2, labels = c("class1", "class2")) ## Example using random forest with caret cvrf < nestcv.train(y2, x, method = "rf", n_outer_folds = 3, cv.cores = 2) summary(cvrf) ## Example of glmnet tuned using caret ## set up small tuning grid for quick execution ## length.out of 20100 is usually recommended for lambda ## and more alpha values ranging from 01 tg < expand.grid(lambda = exp(seq(log(2e3), log(1e0), length.out = 5)), alpha = 1) ncv < nestcv.train(y = y2, x = x, method = "glmnet", n_outer_folds = 3, tuneGrid = tg, cv.cores = 2) summary(ncv) ## plot tuning for outer fold #1 plot(ncv$outer_result[[1]]$fit, xTrans = log) ## plot final ROC curve plot(ncv$roc) ## plot ROC for leftout inner folds inroc < innercv_roc(ncv) plot(inroc)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.