dCVnet is an R package to estimate doubly Cross-Validated Elastic-net regularised generalised linear models (glm) with an approachable interface. dCVnet adds nested repeated k-fold cross-validation, imputation for missing data, and assorted convenience functions, to the regularised glm fit by the glmnet package.
If you use dCVnet in your research, please cite our paper: Lawrence et al (2021)
A working installation of R is required. A fully featured interface to R (e.g. RStudio) is recommended.
dCVnet is not (yet) on CRAN, so the remotes package is useful to download and install the package from github:
install.packages("remotes")
The commands below will install missing package dependencies (see the DESCRIPTION file Imports section). It will then run a toy example from the package's main function.
remotes::install_github("AndrewLawrence/dCVnet", dependencies = TRUE, build_vignettes = TRUE)
remotes::install_github("AndrewLawrence/dCVnet@dev", dependencies = TRUE, build_vignettes = TRUE)
remotes::install_local("path/to/dCVnet_1.0.8.tar.gz", dependencies = TRUE, build_vignettes = TRUE)
library(dCVnet)
example(dCVnet, run.dontrun = TRUE)
# to see the usage without running calculations set run.dontrun = FALSE
browseVignettes(package = "dCVnet")
Note: this needs build_vignettes = TRUE
to be set at installation.
Please search for your issue in the project Issues section. If that doesn't clear it up please make a new issue, if possible try to give a Reproducible Example (see here, or here).
This project is licensed under the GPL>3. See DESCRIPTION file.
[ see the Presentations folder for slides from a talk on dCVnet given 2021-09-15 ]
The motivating problem behind dCVnet is prediction modelling1 in data with relatively few observations (n) for the number of predictors (p), especially where there may be uninformative or redundant predictors which the analyst isn't willing, or able, to remove.
In an ideal world we would collect more observations (i.e. increase n), or better understand which predictors to include/exclude or how to best model them (i.e. reduce p), but this can be impractical or impossible.
With few observations and many predictors several inter-related statistical problems arise. These problems become worse2 with greater ratios of p/n:
dCVnet uses elastic-net regularisation (from glmnet) to combat these problems. double cross-validation3 is applied to tune the regularisation and validly assess model performance.
A model which is overfit is tuned to the noise in the sample rather than reproducible relationships. As a result it will perform poorly in new (unseen) data. This failure to perform well in new data is termed generalisation (or out-of-sample) error.
Generalisation error can be assessed using properly conducted cross-validation. The model is repeatedly refit in subsets of the data and performance evaluated in the observations which were not used for model fitting. Appropriately cross-validated estimates of model performance are unaffected by the optimism caused by overfitting and reflect the likely performance of the model in unseen data.
There are different forms of cross-validation, particularly dCVnet implements repeated k-fold cross-validation.
However, cross-validation only tells the analyst if overfitting is occurring, it is not a means to reduce overfitting. For this purpose we can apply regularisation which produces more cautious models which are likely to generalise better to new data.
Regularisation adds a cost to the complexity of the model. Unregularised models optimise the fit of the model to the data, regularised models optimise the fit of the model given a budget of allowable complexity. This results in shrinkage of model coefficients towards zero. This makes models more cautious and can substantially improve generalisation to unseen data. dCVnet uses elastic-net regularisation from the glmnet package for R.
The type of regularisation used by dCVnet (Elastic-net Regularisation) is a combination of two types of regularisation with the aim of avoiding their weaknesses and benefiting from their strengths:
Ridge regression (using a L2-penalty) allows predictors with perfect collinearity, but every predictor contributes (the solution is not sparse).
LASSO (Least Absolute Shrinkage and Selection Operator) regression uses the L1-penalty. It produces variable selection effect by favouring a sparse solution (meaning less important coefficients drop to zero), however LASSO is unstable when working with correlated predictors.
Adding regularisation to a model introduces "algorithm hyperparameters" - these are settings which which must be tuned/optimised for each problem.
Elastic-net regularisation requires two hyperparameters be specified:
alpha
- the balance of L1- and L2-regularisation penalties.
(L2/Ridge only : alpha = 0, L1/LASSO only : alpha = 1)lambda
- penalty factor determining the combined amount of
regularisation.There are no default values for these parameters, suitable values vary depending on the problem and so should be 'tuned'.
One way to tune parameters without overfitting is to use Cross-validation to select values which perform well in unseen data. This is a form of model selection.
If the cross-validation for hyperparameter selection is combined with the cross-validation to estimate generalisation error this will add back in optimism to our estimates of the generalisation error.
To combat this cross-validation for generalisation error must be completely independent of the cross-validation for hyperparameter selection, see Crawley & Talbot (2010; JMLR 11:2079-2107) for a fuller description of the issue. Nesting the hyperparameter tuning can achieve this.
Double cross-validation4 is implemented to allow principled (and independent) selection of the optimal hyperparameters for generalisability, and estimation of performance in out-of-sample data when hyperparameters are estimated in this way. Double cross-validation is computationally expensive, but ensures hyperparameter tuning is fully separate from performance evaluation.
Imputation of missing variables should be nested within
cross-validation, as of version 1.3 (September 2024) dCVnet supports
some forms of imputation. This is described in the dCVnet-imputation
vignette.
A dCVnet object includes a final "production" model which can be applied to new data to make predictions. It is this production model (fit to the complete dataset) that the cross-validated performance metrics apply. Usually where coefficients are to be interpreted the coefficients of interest are those of this production model.
This package aims to provide an approachable interface for conducting a nested (or double) cross-validation of the elastic net solution to a two-class prediction problem. The Elastic-net calculations (and some inner loop cross-validation) are performed by the R package glmnet which dCVnet depends on.
1 dCVnet can be useful for inference, but this is not its main purpose. The time-consuming outer cross-validation loop is not as important for inference, other software can be used directly.
2 Where p/n > 1, the standard least-squares regression solutions are not defined and generalised models will have convergence problems. In both cases predictors will have perfect multicollinearity. Where p/n > 0.1, common rules of thumb for sample size are violated.
3 Double cross-validation is also called nested or nested-loop cross-validation.
With less flexible models, and enough data the optimism which nested-CV addresses can be negligible. However, nested cross-validation is particularly important with smaller datasets. Demonstrating internal validity without validation leakage is important for reproducible research.
4 Other examples of nested CV in R: MLR, TANDEM, nlcv, caret/rsample, nestedcv
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.