dCVnet | R Documentation |
Fits and cross-validates an elastic-net regularised regression model using independent cross-validation to select optimal alpha and lambda hyperparameters for the regularisation.
dCVnet(
y,
data,
f = "~.",
family = "binomial",
offset = NULL,
nrep_outer = 2,
k_outer = 10,
nrep_inner = 5,
k_inner = 10,
alphalist = c(0.2, 0.5, 0.8),
nlambda = 100,
type.measure = "deviance",
opt.lambda.type = c("min", "1se"),
opt.empirical_cutoff = FALSE,
opt.uniquefolds = FALSE,
opt.ystratify = TRUE,
opt.random_seed = NULL,
opt.use_imputation = FALSE,
opt.imputation_usey = FALSE,
opt.imputation_method = c("mean", "knn", "missForestPredict"),
...
)
y |
the outcome (can be numeric vector, a factor (for binomial / multinomial) or a matrix for cox/mgaussian) For factors see Factor Outcomes section below. |
data |
a data.frame containing variables needed for the formula (f). |
f |
a one sided formula.
The RHS must refer to columns in |
family |
the model family (see |
offset |
optional model offset (see |
nrep_outer |
an integer, the number of repetitions (k-fold outer CV) |
k_outer |
an integer, the k in the outer k-fold CV. |
nrep_inner |
an integer, the number of repetitions (k-fold inner CV) |
k_inner |
an integer, the k in the inner k-fold CV. |
alphalist |
a numeric vector of values in (0,1]. This sets the search space for optimising hyperparameter alpha. |
nlambda |
an integer, number of gradations between
lambda.min and lambda.max to search.
See |
type.measure |
passed to |
opt.lambda.type |
Method for selecting optimum lambda. One of
|
opt.empirical_cutoff |
Boolean. Use the empirical proportion of cases as the cutoff for outer CV classification (affects outer CV performance only). Otherwise classify at 50% probability. |
opt.uniquefolds |
Boolean. In most circumstances folds will be unique. This requests that random folds are checked for uniqueness in inner and outer loops. Currently it warns if non-unique values are found. |
opt.ystratify |
Boolean.
Outer and inner sampling is stratified by outcome.
This is implemented with |
opt.random_seed |
Interpreted as integer. This is used to control the generation of random folds. |
opt.use_imputation |
Boolean. Run imputation on missing predictors? |
opt.imputation_usey |
Boolean.
Should conditional imputation methods use y in the imputation model?
Note: no effect if |
opt.imputation_method |
Which imputation method?:
|
... |
Arguments to pass through to cv.glmnet (may break things). |
The time-consuming double (i.e. nested) cross-validation (CV) is used because single cross-validation - which both tunes hyperparameters and estimates out-of-sample classification performance - will be optimistically biased.
Cross-validation for both the inner and outer loop is repeated k-fold.
Both alpha and lambda hyperparameters of the elastic-net can be tuned:
lambda - the total regularisation penalty
alpha - the balance of L1(LASSO) and L2 (Ridge) regularisation types
a dCVnet object containing:
input
: call arguments and input data
prod
: the production model and preprocessing information
used in making new predictions.
folds
: outer-loop CV fold membership
performance
: Cross-validated performance
tuning
: Outer loop CV tuning information
Currently all coefficients reported / used by dCVnet are semi-standardised for x, but not y. In other words, the predictor matrix x is mean-centered and scaled by the standard deviation prior to calculations.
The predict method for dCVnet stores these means/SDs, and will apply the same standardisation to x on predicting new values.
As a result the reported coefficients for dCVnet can be interpreted as (semi-)standardised effect sizes. A coefficient of 0.5 is the effect of a 1SD difference in that x matrix element. Note this is the same for all elements of x, even factors, and so if coefficients for binary variables are interpreted as an effect size this will include the impact of prevalence.
When running cross-validation, standardisation is based on the means and standard deviations of the training dataset, not the held-out test data. This prevents leakage from train to test data.
This approach can be contrasted with glmnet which internally standardises, and then back-transforms, coefficients to the original scale. In contrast dCVnet model coefficients are always presented as (semi-)standardised.
Coefficients in the original scale can be recovered using the
standard deviations and means employed in the standardisation
(see: https://stats.stackexchange.com/a/75025)
These means and standard deviations are retained for the production model
in the preprocess
slots of the dCVnet object: my_model$prod$preprocess
.
For categorical families (binomial, multinomial) input can be:
numeric (integer): c(0,1,2)
factor: factor(1:3, labels = c("A", "B", "C")))
character: c("A", "B", "C")
other
These are treated differently.
Numeric data is used as provided.
Character data will be coerced to a factor:
factor(x, levels = sort(unique(x)))
.
Factor data will be used as provided, but must have levels in
alphabetical order.
In all cases the reference category must be ordered first, this means for the binomial family the 'positive' category is second.
Why alphabetical? Previously bugs arose due to different handling of factor levels between functions called by dCVnet. These appear to be resolved in the latest versions of the packages, but this restriction will stay until I can verify.
Sparse matrices are not supported by dCVnet.
## Not run:
# Iris example: Setosa vs. Virginica
#
# This example is fast to run, but not very informative because it is a
# simple problem without overfitting and the predictors work 'perfectly'.
# `help(iris)` for more infomation on the data.
# Make a two-class problem from the iris dataset:
siris <- droplevels(subset(iris, iris$Species != "versicolor"))
# scale the iris predictors:
siris[,1:4] <- scale(siris[,1:4])
set.seed(1) # for reproducibility
model <- dCVnet(y = siris$Species,
f = ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width,
data = siris,
alphalist = c(0.2, 0.5, 1.0),
opt.lambda.type = "1se")
#Note: in most circumstances non-default (larger) values of
# nrep_inner and nrep_outer will be required.
# Input summary:
dCVnet::parseddata_summary(model)
# Model summary:
summary(model)
# Detailed cross-validated model performance summary:
summary(performance(model))
# hyperparameter tuning plot:
plot(model)
# as above, but zoomed in:
plot(model)$plot + ggplot2::coord_cartesian(ylim = c(0,0.03), xlim=c(-4, -2))
# Performance ROC plot:
plot(model, type = "roc")
# predictor importance (better with more outer reps):
dCVnet::coefficients_summary(model)
# show variability over both folds and reps:
dCVnet::plot_outerloop_coefs(model, "all")
# selected hyperparameters:
dCVnet::selected_hyperparameters(model, what = "data")
# Reference logistic regressions (unregularised & univariate):
ref_model <- dCVnet::refunreg(model)
dCVnet::report_reference_performance_summary(ref_model)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.