knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In this example we will explore an application of dCVnet in a tutorial dataset provided with the Elements of Statistical Learning (ESL) textbook.
First, get required libraries, load the data and take a look:
shh <- function(...) suppressPackageStartupMessages(suppressWarnings(...)) shh(library(tidyverse)) ggplot2::theme_set(theme_minimal()) library(dCVnet) data(prostate) psych::describe(prostate[, -10], fast = TRUE) %>% knitr::kable(digits = 2)
For consistency with examples in ESL we wish to standardise the data:
# see ?prostate for more details: sprostate <- data.frame(scale(prostate[, -10]))
Note that we will ignore the train-test split as this vignette will demonstrate cross-validation.
Next, noting that the base unregularised model is not very overfit, we can make the problem harder by adding noise predictors unrelated to the outcome:
set.seed(1) noise <- matrix(rnorm(nrow(sprostate) * 10), nrow = nrow(sprostate)) colnames(noise) <- paste0("noise", 1:10) sprostate <- cbind(sprostate, noise) # By construction, the noise variables are not correlated with the outcome: cor(cbind(sprostate$lpsa, noise))[-1, 1]
Before cross-validating our estimates of prediction model performance, first lets build some prediction models without cross-validation:
# A ordinary linear regression m1 <- lm(lpsa ~ ., data = sprostate, y = TRUE, x = TRUE)
# a LASSO regularised regression m2 <- glmnet::cv.glmnet(y = sprostate$lpsa, x = as.matrix(sprostate[, -9]))
The resulting coefficients look like this:
data.frame(broom::tidy(m1)[, 1:2] %>% setNames(c("Term", "lm")), `glmnet\nlambda.min` = c(tidy_coef.glmnet(m2, s = "lambda.min")$Coef), `glmnet\nlambda.1se` = c(tidy_coef.glmnet(m2, s = "lambda.1se")$Coef)) %>% knitr::kable(digits = 3)
The performance of these models (in the data used to train them) is:
predictdf <- data.frame( observed = sprostate$lpsa, lm = predict(m1), glmnet.lambda.min = c(predict( m2, s = "lambda.min", newx = as.matrix(sprostate[, -9]) )), glmnet.lambda.1se = c(predict( m2, s = "lambda.1se", newx = as.matrix(sprostate[, -9]) )) ) results_nonCV <- sapply(predictdf[, -1], caret::postResample, predictdf[, 1]) knitr::kable(results_nonCV, digits = 3)
Plotting out the predictions against the true values gives:
predictdf %>% pivot_longer(cols = c("lm", "glmnet.lambda.min", "glmnet.lambda.1se"), values_to = "Prediction", names_to = "Model") %>% ggplot(aes(y = observed, x = Prediction)) + geom_point() + geom_abline(slope = 1, intercept = 0) + coord_equal() + facet_grid(cols = vars(Model), as.table = FALSE)
However, these performance results are not cross-validated, we can use dCVnet to produce cross-validated performance results and improve the tuning of the glmnet models by
1) repeating the k-fold CV to stabilise tuning parameter selection 1) tuning over alpha (elastic-net hyperparameter for the balance between L1 and L2 penalties)
First though, we shall assess the scale of the overfitting problem by producing cross-validated estimates of lm performance:
# set folds: # there are 97 observations # so make 7 folds with 10, and 3 folds with 9. folds <- rep(1:10, times = rep(c(10, 9), times = c(7, 3))) table(folds) # randomise order: folds <- sample(folds) cv_lm_results <- map_dfr( # For folds 1:10: 1:10, function(i) { # split data: d_train <- sprostate[folds != i, ] d_test <- sprostate[folds == i, ] # train model: m <- lm(formula = formula(m1), data = d_train) # predict in held out data: p <- predict(m, newdata = d_test) # calculate performance measures and return: return(as.list(caret::postResample(pred = p, obs = d_test$lpsa))) } ) # Average over 10-folds: colMeans(cv_lm_results)
So there is a performance gap for the unregularised lm model (e.g. in MAE) such
that the complete data performance
(r signif(results_nonCV["MAE","lm"],3)
)
appears better than the cross-validated performance
(r signif(colMeans(cv_lm_results)["MAE"],3)
).
We can run the equivalent model in dCVnet:
m3 <- dCVnet(y = sprostate$lpsa, data = sprostate[, -9], f = "~.", family = "gaussian", alphalist = 1.0, k_inner = 10, k_outer = 10, nrep_inner = 3, nrep_outer = 3)
report_performance_summary(m3)[c(1:2, 4), c(1:3, 7:9)] %>% knitr::kable(digits = 3)
Taken together we can see that the cross-validated MAE of the LASSO (r signif(report_performance_summary(m3)[2,2],3)
) is lower/better than the cross-validated MAE of the un-regularised model (r signif(colMeans(cv_lm_results)["MAE"],3)
). A comparison between the prediction performances without cross-validation would have concluded that LASSO was less performant (MAE = r signif(caret::postResample(predictdf$glmnet.lambda.min, predictdf$observed)[["MAE"]],3)
) than an unregularised model (MAE = r signif(caret::postResample(predictdf$lm, predictdf$observed)[["MAE"]],3)
)
In this example the difference is not dramatic without adding the noise predictors (without noise there's not a lot of overfitting in the unregularised model). Nevertheless the principle that tuned regularisation can produce better generalisation error has been demonstrated.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.