dCVnet: Double cross-validated elastic-net regularised regression

View source: R/dCVnet_main.R

dCVnetR Documentation

Double cross-validated elastic-net regularised regression

Description

Fits and cross-validates an elastic-net regularised regression model using independent cross-validation to select optimal alpha and lambda hyperparameters for the regularisation.

Usage

dCVnet(
  y,
  data,
  f = "~.",
  family = "binomial",
  offset = NULL,
  nrep_outer = 2,
  k_outer = 10,
  nrep_inner = 5,
  k_inner = 10,
  alphalist = c(0.2, 0.5, 0.8),
  nlambda = 100,
  type.measure = "deviance",
  opt.lambda.type = c("min", "1se"),
  opt.empirical_cutoff = FALSE,
  opt.uniquefolds = FALSE,
  opt.ystratify = TRUE,
  opt.random_seed = NULL,
  opt.use_imputation = FALSE,
  opt.imputation_usey = FALSE,
  opt.imputation_method = c("mean", "knn", "missForestPredict"),
  ...
)

Arguments

y

the outcome (can be numeric vector, a factor (for binomial / multinomial) or a matrix for cox/mgaussian) For factors see Factor Outcomes section below.

data

a data.frame containing variables needed for the formula (f).

f

a one sided formula. The RHS must refer to columns in data and may include interactions, transformations or expansions (like poly, or log). The formula must include an intercept.

family

the model family (see glmnet)

offset

optional model offset (see glmnet)

nrep_outer

an integer, the number of repetitions (k-fold outer CV)

k_outer

an integer, the k in the outer k-fold CV.

nrep_inner

an integer, the number of repetitions (k-fold inner CV)

k_inner

an integer, the k in the inner k-fold CV.

alphalist

a numeric vector of values in (0,1]. This sets the search space for optimising hyperparameter alpha.

nlambda

an integer, number of gradations between lambda.min and lambda.max to search. See glmnet argument nlambda.

type.measure

passed to cv.glmnet. This sets the metric used for hyperparameter optimisation in the inner cross-validation. Options: "deviance", "class", "mse", "mae"

opt.lambda.type

Method for selecting optimum lambda. One of

  • "min" - returns the lambda with best CV score.

  • "1se" - returns the +1 se lambda

opt.empirical_cutoff

Boolean. Use the empirical proportion of cases as the cutoff for outer CV classification (affects outer CV performance only). Otherwise classify at 50% probability.

opt.uniquefolds

Boolean. In most circumstances folds will be unique. This requests that random folds are checked for uniqueness in inner and outer loops. Currently it warns if non-unique values are found.

opt.ystratify

Boolean. Outer and inner sampling is stratified by outcome. This is implemented with createFolds

opt.random_seed

Interpreted as integer. This is used to control the generation of random folds.

opt.use_imputation

Boolean. Run imputation on missing predictors?

opt.imputation_usey

Boolean. Should conditional imputation methods use y in the imputation model? Note: no effect if opt.use_imputation is FALSE, or if opt.imputation_method is "mean".

opt.imputation_method

Which imputation method?:

  • "mean" - mean imputation (unconditional)

  • "knn" - k-nearest neighbours imputation (uses preProcess).

  • "missForestPredict" - use the missForestPredict package to impute missing values.

...

Arguments to pass through to cv.glmnet (may break things).

Details

The time-consuming double (i.e. nested) cross-validation (CV) is used because single cross-validation - which both tunes hyperparameters and estimates out-of-sample classification performance - will be optimistically biased.

Cross-validation for both the inner and outer loop is repeated k-fold.

Both alpha and lambda hyperparameters of the elastic-net can be tuned:

  • lambda - the total regularisation penalty

  • alpha - the balance of L1(LASSO) and L2 (Ridge) regularisation types

Value

a dCVnet object containing:

  • input: call arguments and input data

  • prod: the production model and preprocessing information used in making new predictions.

  • folds: outer-loop CV fold membership

  • performance: Cross-validated performance

  • tuning: Outer loop CV tuning information

Coefficients

Currently all coefficients reported / used by dCVnet are semi-standardised for x, but not y. In other words, the predictor matrix x is mean-centered and scaled by the standard deviation prior to calculations.

The predict method for dCVnet stores these means/SDs, and will apply the same standardisation to x on predicting new values.

As a result the reported coefficients for dCVnet can be interpreted as (semi-)standardised effect sizes. A coefficient of 0.5 is the effect of a 1SD difference in that x matrix element. Note this is the same for all elements of x, even factors, and so if coefficients for binary variables are interpreted as an effect size this will include the impact of prevalence.

When running cross-validation, standardisation is based on the means and standard deviations of the training dataset, not the held-out test data. This prevents leakage from train to test data.

This approach can be contrasted with glmnet which internally standardises, and then back-transforms, coefficients to the original scale. In contrast dCVnet model coefficients are always presented as (semi-)standardised.

Coefficients in the original scale can be recovered using the standard deviations and means employed in the standardisation (see: https://stats.stackexchange.com/a/75025) These means and standard deviations are retained for the production model in the preprocess slots of the dCVnet object: my_model$prod$preprocess.

Factor Outcomes

For categorical families (binomial, multinomial) input can be:

  • numeric (integer): c(0,1,2)

  • factor: factor(1:3, labels = c("A", "B", "C")))

  • character: c("A", "B", "C")

  • other

These are treated differently.

Numeric data is used as provided. Character data will be coerced to a factor: factor(x, levels = sort(unique(x))). Factor data will be used as provided, but must have levels in alphabetical order.

In all cases the reference category must be ordered first, this means for the binomial family the 'positive' category is second.

Why alphabetical? Previously bugs arose due to different handling of factor levels between functions called by dCVnet. These appear to be resolved in the latest versions of the packages, but this restriction will stay until I can verify.

Notes

Sparse matrices are not supported by dCVnet.

Examples

## Not run: 

# Iris example: Setosa vs. Virginica
#
# This example is fast to run, but not very informative because it is a
#  simple problem without overfitting and the predictors work 'perfectly'.
# `help(iris)` for more infomation on the data.

# Make a two-class problem from the iris dataset:
siris <- droplevels(subset(iris, iris$Species != "versicolor"))
# scale the iris predictors:
siris[,1:4] <- scale(siris[,1:4])

set.seed(1) # for reproducibility
model <- dCVnet(y = siris$Species,
                     f = ~ Sepal.Length + Sepal.Width +
                           Petal.Length + Petal.Width,
                     data = siris,
                     alphalist = c(0.2, 0.5, 1.0),
                     opt.lambda.type = "1se")

#Note: in most circumstances non-default (larger) values of
#      nrep_inner and nrep_outer will be required.

# Input summary:
dCVnet::parseddata_summary(model)

# Model summary:
summary(model)

# Detailed cross-validated model performance summary:
summary(performance(model))

# hyperparameter tuning plot:
plot(model)
# as above, but zoomed in:
plot(model)$plot + ggplot2::coord_cartesian(ylim = c(0,0.03), xlim=c(-4, -2))

# Performance ROC plot:
plot(model, type = "roc")

# predictor importance (better with more outer reps):
dCVnet::coefficients_summary(model)
#    show variability over both folds and reps:
dCVnet::plot_outerloop_coefs(model, "all")

# selected hyperparameters:
dCVnet::selected_hyperparameters(model, what = "data")

# Reference logistic regressions (unregularised & univariate):
ref_model <- dCVnet::refunreg(model)

dCVnet::report_reference_performance_summary(ref_model)


## End(Not run)

AndrewLawrence/dCVnet documentation built on Sept. 24, 2024, 5:24 a.m.