knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this example we will explore an application of dCVnet in a tutorial dataset provided with the Elements of Statistical Learning (ESL) textbook.

First, get required libraries, load the data and take a look:

shh <- function(...) suppressPackageStartupMessages(suppressWarnings(...))
shh(library(tidyverse))
ggplot2::theme_set(theme_minimal())
library(dCVnet)
data(prostate)
psych::describe(prostate[, -10], fast = TRUE) %>%
  knitr::kable(digits = 2)

For consistency with examples in ESL we wish to standardise the data:

# see ?prostate for more details:
sprostate <- data.frame(scale(prostate[, -10]))

Note that we will ignore the train-test split as this vignette will demonstrate cross-validation.

Next, noting that the base unregularised model is not very overfit, we can make the problem harder by adding noise predictors unrelated to the outcome:

set.seed(1)
noise <- matrix(rnorm(nrow(sprostate) * 10), nrow = nrow(sprostate))
colnames(noise) <- paste0("noise", 1:10)
sprostate <- cbind(sprostate,
                   noise)

# By construction, the noise variables are not correlated with the outcome:
cor(cbind(sprostate$lpsa, noise))[-1, 1]

Before cross-validating our estimates of prediction model performance, first lets build some prediction models without cross-validation:

# A ordinary linear regression
m1 <- lm(lpsa ~ ., data = sprostate, y = TRUE, x = TRUE)
# a LASSO regularised regression
m2 <- glmnet::cv.glmnet(y = sprostate$lpsa, x = as.matrix(sprostate[, -9]))

The resulting coefficients look like this:

data.frame(broom::tidy(m1)[, 1:2] %>% setNames(c("Term", "lm")),
           `glmnet\nlambda.min` =
             c(tidy_coef.glmnet(m2, s = "lambda.min")$Coef),
           `glmnet\nlambda.1se` =
             c(tidy_coef.glmnet(m2, s = "lambda.1se")$Coef)) %>%
  knitr::kable(digits = 3)

The performance of these models (in the data used to train them) is:

predictdf <- data.frame(
  observed = sprostate$lpsa,
  lm = predict(m1),
  glmnet.lambda.min = c(predict(
    m2,
    s = "lambda.min",
    newx = as.matrix(sprostate[, -9])
  )),
  glmnet.lambda.1se = c(predict(
    m2,
    s = "lambda.1se",
    newx = as.matrix(sprostate[, -9])
  ))
)

results_nonCV <- sapply(predictdf[, -1], caret::postResample, predictdf[, 1])

knitr::kable(results_nonCV, digits = 3)

Plotting out the predictions against the true values gives:

predictdf %>%
  pivot_longer(cols = c("lm", "glmnet.lambda.min", "glmnet.lambda.1se"),
               values_to = "Prediction",
               names_to = "Model") %>%
  ggplot(aes(y = observed, x = Prediction)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  coord_equal() +
  facet_grid(cols = vars(Model), as.table = FALSE)

However, these performance results are not cross-validated, we can use dCVnet to produce cross-validated performance results and improve the tuning of the glmnet models by

1) repeating the k-fold CV to stabilise tuning parameter selection 1) tuning over alpha (elastic-net hyperparameter for the balance between L1 and L2 penalties)

First though, we shall assess the scale of the overfitting problem by producing cross-validated estimates of lm performance:

# set folds:
#   there are 97 observations
#   so make 7 folds with 10, and 3 folds with 9.
folds <- rep(1:10, times = rep(c(10, 9), times = c(7, 3)))
table(folds)

# randomise order:
folds <- sample(folds)

cv_lm_results <- map_dfr(
  # For folds 1:10:
  1:10,
  function(i) {
    # split data:
    d_train <- sprostate[folds != i, ]
    d_test <- sprostate[folds == i, ]
    # train model:
    m <- lm(formula = formula(m1),
            data = d_train)
    # predict in held out data:
    p <- predict(m, newdata = d_test)
    # calculate performance measures and return:
    return(as.list(caret::postResample(pred = p, obs = d_test$lpsa)))
  }
)

# Average over 10-folds:
colMeans(cv_lm_results)

So there is a performance gap for the unregularised lm model (e.g. in MAE) such that the complete data performance (r signif(results_nonCV["MAE","lm"],3)) appears better than the cross-validated performance (r signif(colMeans(cv_lm_results)["MAE"],3)).

We can run the equivalent model in dCVnet:

m3 <- dCVnet(y = sprostate$lpsa,
             data = sprostate[, -9],
             f = "~.",
             family = "gaussian",
             alphalist = 1.0,
             k_inner = 10,
             k_outer = 10,
             nrep_inner = 3,
             nrep_outer = 3)
report_performance_summary(m3)[c(1:2, 4), c(1:3, 7:9)] %>%
  knitr::kable(digits = 3)

Taken together we can see that the cross-validated MAE of the LASSO (r signif(report_performance_summary(m3)[2,2],3)) is lower/better than the cross-validated MAE of the un-regularised model (r signif(colMeans(cv_lm_results)["MAE"],3)). A comparison between the prediction performances without cross-validation would have concluded that LASSO was less performant (MAE = r signif(caret::postResample(predictdf$glmnet.lambda.min, predictdf$observed)[["MAE"]],3)) than an unregularised model (MAE = r signif(caret::postResample(predictdf$lm, predictdf$observed)[["MAE"]],3))

In this example the difference is not dramatic without adding the noise predictors (without noise there's not a lot of overfitting in the unregularised model). Nevertheless the principle that tuned regularisation can produce better generalisation error has been demonstrated.



AndrewLawrence/dCVnet documentation built on Sept. 24, 2024, 5:24 a.m.