dCVnet: Double cross-validated elastic-net regularised regression
In AndrewLawrence/dCVnet: doubly cross-validated elastic-net regularised generalised linear models

dCVnet

R Documentation

Double cross-validated elastic-net regularised regression

Description

Fits and cross-validates an elastic-net regularised regression model using independent cross-validation to select optimal alpha and lambda hyperparameters for the regularisation.

Usage

dCVnet(
  y,
  data,
  f = "~.",
  family = "binomial",
  offset = NULL,
  nrep_outer = 2,
  k_outer = 10,
  nrep_inner = 5,
  k_inner = 10,
  alphalist = c(0.2, 0.5, 0.8),
  nlambda = 100,
  type.measure = "deviance",
  opt.lambda.type = c("min", "1se"),
  opt.empirical_cutoff = FALSE,
  opt.uniquefolds = FALSE,
  opt.ystratify = TRUE,
  opt.random_seed = NULL,
  opt.use_imputation = FALSE,
  opt.imputation_usey = FALSE,
  opt.imputation_method = c("mean", "knn", "missForestPredict"),
  ...
)

Arguments

`y`	the outcome (can be numeric vector, a factor (for binomial / multinomial) or a matrix for cox/mgaussian) For factors see Factor Outcomes section below.
`data`	a data.frame containing variables needed for the formula (f).
`f`	a one sided formula. The RHS must refer to columns in `data` and may include interactions, transformations or expansions (like `poly`, or `log`). The formula must include an intercept.
`family`	the model family (see `glmnet`)
`offset`	optional model offset (see `glmnet`)
`nrep_outer`	an integer, the number of repetitions (k-fold outer CV)
`k_outer`	an integer, the k in the outer k-fold CV.
`nrep_inner`	an integer, the number of repetitions (k-fold inner CV)
`k_inner`	an integer, the k in the inner k-fold CV.
`alphalist`	a numeric vector of values in (0,1]. This sets the search space for optimising hyperparameter alpha.
`nlambda`	an integer, number of gradations between lambda.min and lambda.max to search. See `glmnet` argument `nlambda`.
`type.measure`	passed to `cv.glmnet`. This sets the metric used for hyperparameter optimisation in the inner cross-validation. Options: `"deviance"`, `"class"`, `"mse"`, `"mae"`
`opt.lambda.type`	Method for selecting optimum lambda. One of `"min"` - returns the lambda with best CV score. `"1se"` - returns the +1 se lambda
`opt.empirical_cutoff`	Boolean. Use the empirical proportion of cases as the cutoff for outer CV classification (affects outer CV performance only). Otherwise classify at 50% probability.
`opt.uniquefolds`	Boolean. In most circumstances folds will be unique. This requests that random folds are checked for uniqueness in inner and outer loops. Currently it warns if non-unique values are found.
`opt.ystratify`	Boolean. Outer and inner sampling is stratified by outcome. This is implemented with `createFolds`
`opt.random_seed`	Interpreted as integer. This is used to control the generation of random folds.
`opt.use_imputation`	Boolean. Run imputation on missing predictors?
`opt.imputation_usey`	Boolean. Should conditional imputation methods use y in the imputation model? Note: no effect if `opt.use_imputation` is `FALSE`, or if `opt.imputation_method` is `"mean"`.
`opt.imputation_method`	Which imputation method?: `"mean"` - mean imputation (unconditional) `"knn"` - k-nearest neighbours imputation (uses `preProcess`). `"missForestPredict"` - use the missForestPredict package to impute missing values.
`...`	Arguments to pass through to cv.glmnet (may break things).

Details

The time-consuming double (i.e. nested) cross-validation (CV) is used because single cross-validation - which both tunes hyperparameters and estimates out-of-sample classification performance - will be optimistically biased.

Cross-validation for both the inner and outer loop is repeated k-fold.

Both alpha and lambda hyperparameters of the elastic-net can be tuned:

lambda - the total regularisation penalty
alpha - the balance of L1(LASSO) and L2 (Ridge) regularisation types

Value

a dCVnet object containing:

input: call arguments and input data
prod: the production model and preprocessing information used in making new predictions.
folds: outer-loop CV fold membership
performance: Cross-validated performance
tuning: Outer loop CV tuning information

Coefficients

Currently all coefficients reported / used by dCVnet are semi-standardised for x, but not y. In other words, the predictor matrix x is mean-centered and scaled by the standard deviation prior to calculations.

The predict method for dCVnet stores these means/SDs, and will apply the same standardisation to x on predicting new values.

As a result the reported coefficients for dCVnet can be interpreted as (semi-)standardised effect sizes. A coefficient of 0.5 is the effect of a 1SD difference in that x matrix element. Note this is the same for all elements of x, even factors, and so if coefficients for binary variables are interpreted as an effect size this will include the impact of prevalence.

When running cross-validation, standardisation is based on the means and standard deviations of the training dataset, not the held-out test data. This prevents leakage from train to test data.

This approach can be contrasted with glmnet which internally standardises, and then back-transforms, coefficients to the original scale. In contrast dCVnet model coefficients are always presented as (semi-)standardised.

Coefficients in the original scale can be recovered using the standard deviations and means employed in the standardisation (see: https://stats.stackexchange.com/a/75025) These means and standard deviations are retained for the production model in the preprocess slots of the dCVnet object: my_model$prod$preprocess.

Factor Outcomes

For categorical families (binomial, multinomial) input can be:

numeric (integer): c(0,1,2)
factor: factor(1:3, labels = c("A", "B", "C")))
character: c("A", "B", "C")
other

These are treated differently.

Numeric data is used as provided. Character data will be coerced to a factor: factor(x, levels = sort(unique(x))). Factor data will be used as provided, but must have levels in alphabetical order.

In all cases the reference category must be ordered first, this means for the binomial family the 'positive' category is second.

Why alphabetical? Previously bugs arose due to different handling of factor levels between functions called by dCVnet. These appear to be resolved in the latest versions of the packages, but this restriction will stay until I can verify.

Notes

Sparse matrices are not supported by dCVnet.

Examples

## Not run: 

# Iris example: Setosa vs. Virginica
#
# This example is fast to run, but not very informative because it is a
#  simple problem without overfitting and the predictors work 'perfectly'.
# `help(iris)` for more infomation on the data.

# Make a two-class problem from the iris dataset:
siris <- droplevels(subset(iris, iris$Species != "versicolor"))
# scale the iris predictors:
siris[,1:4] <- scale(siris[,1:4])

set.seed(1) # for reproducibility
model <- dCVnet(y = siris$Species,
                     f = ~ Sepal.Length + Sepal.Width +
                           Petal.Length + Petal.Width,
                     data = siris,
                     alphalist = c(0.2, 0.5, 1.0),
                     opt.lambda.type = "1se")

#Note: in most circumstances non-default (larger) values of
#      nrep_inner and nrep_outer will be required.

# Input summary:
dCVnet::parseddata_summary(model)

# Model summary:
summary(model)

# Detailed cross-validated model performance summary:
summary(performance(model))

# hyperparameter tuning plot:
plot(model)
# as above, but zoomed in:
plot(model)$plot + ggplot2::coord_cartesian(ylim = c(0,0.03), xlim=c(-4, -2))

# Performance ROC plot:
plot(model, type = "roc")

# predictor importance (better with more outer reps):
dCVnet::coefficients_summary(model)
#    show variability over both folds and reps:
dCVnet::plot_outerloop_coefs(model, "all")

# selected hyperparameters:
dCVnet::selected_hyperparameters(model, what = "data")

# Reference logistic regressions (unregularised & univariate):
ref_model <- dCVnet::refunreg(model)

dCVnet::report_reference_performance_summary(ref_model)


## End(Not run)

AndrewLawrence/dCVnet documentation built on Sept. 24, 2024, 5:24 a.m.