nestcv.glmnet: Nested cross-validation with glmnet

View source: R/nestedcv.R

nestcv.glmnetR Documentation

Nested cross-validation with glmnet

Description

This function enables nested cross-validation (CV) with glmnet including tuning of elastic net alpha parameter. The function also allows the option of embedded filtering of predictors for feature selection nested within the outer loop of CV. Predictions on the outer test folds are brought back together and error estimation/ accuracy determined. The default is 10x10 nested CV.

Usage

nestcv.glmnet(
  y,
  x,
  family = c("gaussian", "binomial", "poisson", "multinomial", "cox", "mgaussian"),
  filterFUN = NULL,
  filter_options = NULL,
  balance = NULL,
  balance_options = NULL,
  outer_method = c("cv", "LOOCV"),
  n_outer_folds = 10,
  n_inner_folds = 10,
  outer_folds = NULL,
  alphaSet = seq(0, 1, 0.1),
  min_1se = 0,
  keep = TRUE,
  outer_train_predict = FALSE,
  weights = NULL,
  penalty.factor = rep(1, ncol(x)),
  cv.cores = 1,
  finalCV = TRUE,
  na.option = "omit",
  ...
)

Arguments

y

Response vector

x

Matrix of predictors. Dataframes will be coerced to a matrix as is necessary for glmnet.

family

Either a character string representing one of the built-in families, or else a glm() family object. Passed to cv.glmnet and glmnet

filterFUN

Filter function, e.g. ttest_filter or relieff_filter. Any function can be provided and is passed y and x. Must return a character vector with names of filtered predictors.

filter_options

List of additional arguments passed to the filter function specified by filterFUN.

balance

Specifies method for dealing with imbalanced class data. Current options are "randomsample" or "smote". See randomsample() and smote()

balance_options

List of additional arguments passed to the balancing function

outer_method

String of either "cv" or "LOOCV" specifying whether to do k-fold CV or leave one out CV (LOOCV) for the outer folds

n_outer_folds

Number of outer CV folds

n_inner_folds

Number of inner CV folds

outer_folds

Optional list containing indices of test folds for outer CV. If supplied, n_outer_folds is ignored.

alphaSet

Vector of alphas to be tuned

min_1se

Value from 0 to 1 specifying choice of optimal lambda from 0=lambda.min to 1=lambda.1se

keep

Logical indicating whether inner CV predictions are retained for calculating left-out inner CV fold accuracy etc. See argument keep in cv.glmnet.

outer_train_predict

Logical whether to save predictions on outer training folds to calculate performance on outer training folds.

weights

Weights applied to each sample. Note weights and balance cannot be used at the same time. Weights are only applied in glmnet and not in filters.

penalty.factor

Separate penalty factors can be applied to each coefficient. Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model. Default is 1 for all variables. See glmnet

cv.cores

Number of cores for parallel processing of the outer loops. NOTE: this uses parallel::mclapply on unix/mac and parallel::parLapply on windows.

finalCV

Logical whether to perform one last round of CV on the whole dataset to determine the final model parameters. If set to FALSE, the median of hyperparameters from outer CV folds are used for the final model. Performance metrics are independent of this last step.

na.option

Character value specifying how NAs are dealt with. "omit" (the default) is equivalent to na.action = na.omit. "omitcol" removes cases if there are NA in 'y', but columns (predictors) containing NA are removed from 'x' to preserve cases. Any other value means that NA are ignored (a message is given).

...

Optional arguments passed to cv.glmnet

Details

glmnet does not tolerate missing values, so na.option = "omit" is the default.

Value

An object with S3 class "nestcv.glmnet"

call

the matched call

output

Predictions on the left-out outer folds

outer_result

List object of results from each outer fold containing predictions on left-out outer folds, best lambda, best alpha, fitted glmnet coefficients, list object of inner fitted cv.glmnet and number of filtered predictors at each fold.

outer_method

the outer_method argument

n_inner_folds

number of inner folds

outer_folds

List of indices of outer test folds

dimx

dimensions of x

y

original response vector

yfinal

final response vector (post-balancing)

final_param

Final mean best lambda and alpha from each fold

final_fit

Final fitted glmnet model

final_coef

Final model coefficients and mean expression

roc

ROC AUC for binary classification where available.

summary

Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression.

Author(s)

Myles Lewis

Examples


## Example binary classification problem with P >> n
x <- matrix(rnorm(150 * 2e+04), 150, 2e+04)  # predictors
y <- factor(rbinom(150, 1, 0.5))  # binary response

## Partition data into 2/3 training set, 1/3 test set
trainSet <- caret::createDataPartition(y, p = 0.66, list = FALSE)

## t-test filter using whole dataset
filt <- ttest_filter(y, x, nfilter = 100)
filx <- x[, filt]

## Train glmnet on training set only using filtered predictor matrix
library(glmnet)
fit <- cv.glmnet(filx[trainSet, ], y[trainSet], family = "binomial")
plot(fit)

## Predict response on test partition
predy <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "class")
predy <- as.vector(predy)
predyp <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "response")
predyp <- as.vector(predyp)
output <- data.frame(testy = y[-trainSet], predy = predy, predyp = predyp)

## Results on test partition
## shows bias since univariate filtering was applied to whole dataset
predSummary(output)

## Nested CV
fit2 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1,
                      filterFUN = ttest_filter,
                      filter_options = list(nfilter = 100))
summary(fit2)
plot_lambdas(fit2, showLegend = "bottomright")

## ROC plots
library(pROC)
testroc <- roc(output$testy, output$predyp, direction = "<")
inroc <- innercv_roc(fit2)
plot(fit2$roc)
lines(inroc, col = 'blue')
lines(testroc, col = 'red')
legend('bottomright', legend = c("Nested CV", "Left-out inner CV folds", 
                                 "Test partition, non-nested filtering"), 
       col = c("black", "blue", "red"), lty = 1, lwd = 2, bty = "n")


nestedcv documentation built on Dec. 5, 2022, 5:25 p.m.