knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This document will will compare elastic-net regularised binary logistic regression (as used by dCVnet) to an unregularised binary logistic regression (fit with glm).

Practical Notes / Disclaimers

  1. For practical reasons (computation time), this vignette uses a specially selected seed and a smaller number of CV repetitions than would typically be sensible. The seed was selected because it generated random CV folds that gave results representative of a much larger number of CV repetitions.

  2. This vignette is not a recipe for using the dCVnet package. The complexities of data formatting, modelling decisions, picking the elastic-net hyperparameters to tune over, deciding the nested repeated k-fold parameters and checking the variability cv results are all omitted, being beyond the scope of this vignette.

  3. For simplicity this vignette uses classification accuracy as a performance measure throughout. This is extracted from row 5 of a summary(performance(x)) table.

A Simulated Prediction Problem

To illustrate the benefits of elastic-net regularisation for generalisation error we will use a simulated dataset for a binary classification problem in moderately high dimension: 50 observations of 40 predictors for a binary outcome. This will be referred to as the training data set.

Predictors are all normally distributed, but come in two types, predictors 1-10 are used to construct the binary outcome (with added error) while predictors 11-40 are unrelated to the outcome labels.

With a simulated dataset we know exactly how the data were generated and consequently what the ideal model should be. This gives us a 'bayes rate' - the performance of an ideal / true classifier. Further, with simulation we can generate unlimited data and use a large hold-out dataset (also termed test set) for comparison with cross-validated estimates derived within the training sample only.

Unfortunately, real data doesn't come with a bayes rate and there is often not enough data, so cross-validation is the best we can do.

The simulated data is provided with the dCVnet package and can be loaded with `data("blrsim"):

require(dCVnet, quietly = TRUE)

data("blrsim")

# View the help file for more information:
# ?blrsim
lapply(blrsim, class)


vapply(blrsim, function(x) {
  list(nvars = NCOL(x),
       nobs = NROW(x))
},
FUN.VALUE = list(1L, 1L))

# we have training datasets: x and y, test datasets: x.test and y.test,
#   and the recorded classification accuracy of an ideal classifier (bayes.rate)

# what is the ideal classification accuracy?
print(blrsim$bayes.rate)

Unregularised Binary Logistic Regression

Ignoring the inadequate sample size (40 predictors and 50 observations) a standard approach to produce a prediction model for data of this type would be binomial logistic regression.

We can do this in R using the glm function:

m1 <- glm(y ~ ., data = data.frame(y = blrsim$y, blrsim$x), family = "binomial")

# first 5 model coefficients:
head(coef(summary(m1)))

# we are interested in classification accuracy, dCVnet has a function to
#   extract these sorts of measures from a glm
m1.train <- summary(dCVnet::performance(m1))

#   the accuracy of this model in the data it was fit is:
print(m1.train[5, 2])

There are a few problems here! First we get a warning from glm about the model fit (because the model perfectly separates the two classes), but a greater concern is that the coefficients and their standard errors are inflated. In short the glm is over-fit to the data.

A model that is overfit will perform poorly in unseen data. Lets check how well the model fits the independent test dataset:

m1.test <- summary(
  dCVnet::performance(m1,
                      newdata = data.frame(y = blrsim$y.test,
                                           blrsim$x.test))
)
# Accuracy:
print(m1.test[5, 2])

As expected, the accuracy in the unseen test data is pretty bad. It is better than chance accuracy (0.50), but r round(m1.test[5,2],2) is worse than the optimal accuracy of r round(blrsim$bayes.rate, 2).

In many circumstances we won't have extra data to act as a hold-out sample, so cross-validation is a good way to estimate how well a model will perform in new/unseen data.

The dCVnet package uses repeated k-fold cross-validation, and there is a convenience function which applies the same scheme to an unregularised glm and extracts the performance measures.

Because there are very few observations we will use 5-fold cross-validation rather than 10 or 20 fold. We will repeat this 10 times (although see Note 3)

set.seed(47)
# Run a 10x-repeated 5-fold cross-validation:
# (technically we are refitting m1 from scratch here)
m1.cv <- dCVnet::cv_performance_glm(y = blrsim$y,
                                    data = blrsim$x,
                                    f = "~.",
                                    k = 5,
                                    nrep = 10)

# print the mean accuracy over the 10 repetitions:
round(m1.cv$cv.performance[5, 2, drop = FALSE], 3)

The cross-validated accuracy (r round(m1.cv$cv.performance[5, 2], 2)) is even worse than the results in the hold-out data (r round(m1.test[5, 2], 2)). Some downward bias expected to occur because each model is fit exactly 1 fold less data when doing k-fold CV than with an independent hold-out. With less data to work with the model is worse. The extent of the downward bias depends on the steepness of the learning curve for the problem at that sample size, and the selected value of k in k-fold CV (see Chapter 7 of Elements of Statistical Learning).

Regularised Binary Logistic Regression

So, the standard model described above overfits the data, what can be done? Without regularisation, a researcher might either apply for more funding to collect more data, or try to simplify the model "by hand" considering which predictors to remove (ideally on the basis of prior information so as not to bias results). In contrast regularised models do variable selection in a continuous and data-driven manner (because removing a variable from a model is the same as regularisation the associated term to zero).

In this example we will use elastic-net regularisation with a single, mostly ridge-like, alpha (0.1). In real data we might consider multiple alphas and there would be a preliminary stage where we consider which ones to include. As with the glm we will set k=5 for the k-fold cross-validation, (although to save time only 2 repetitions are used, see Note 3 above).

set.seed(47)
# fit the model and run cross-validation in one step:
m2 <- dCVnet::dCVnet(y = blrsim$y,
                     data = blrsim$x,
                     k_outer = 5,
                     k_inner = 5,
                     nrep_outer = 10,
                     nrep_inner = 5,
                     alphalist = 0.1)

# What was the performance of this model in the data it was fit to?
m2.train <- summary(performance(m2$prod$performance))

# specifically the accuracy?:
print(m2.train[5, 2])

So a regularised model will still overfit the data used to train the model (the accuracy in the training set is r m2.train[5,2] which is greater than the bayes rate of r round(blrsim$bayes.rate, 2)), however the key test comes next: whether regularisation helps the model predict unseen data, either in the test-set or through cross-validation.

# Test-set accuracy:
m2.test <- summary(
  structure(
    data.frame(
      prediction = predict(m2, newx = blrsim$x.test, type = "response")[, 1],
      classification = as.factor(predict(m2,
                                         newx = blrsim$x.test,
                                         type = "class")[, 1]),
      reference = as.factor(blrsim$y.test),
      label = "m2.test",
      stringsAsFactors = FALSE
    ),
    family = "binomial",
    class = c("performance", "data.frame")
  )
)

# print the test set accuracy:
print(m2.test[5, 2])

# We cross-validated the results when we produced m2:
m2.cv <- report_performance_summary(m2)

# print the accuracy:
print(m2.cv[m2.cv$Measure == "Accuracy", 2])

To get an overview, we can bring all the performance numbers together:

# merge together results:
results <- c(
  bayes.rate = blrsim$bayes.rate,
  #
  glm.train = m1.train[5, 2],
  glm.test = m1.test[5, 2],
  glm.cv = m1.cv$cv.performance[5, 2],
  #
  enet.train = m2.train[5, 2],
  enet.test = m2.test[5, 2],
  enet.cv = m2.cv[5, 2]
)

# print accuracy:
print(results, digits = 2)

# subtract the bayes rate accuracy:
print(results - results[1], digits = 2)

So, this vignette has shown:

Random Labels

We have shown that regularisation with elastic net can produce models which generalise better, but is the procedure robust? What happens when it is asked to predict noise/nonsense?

In the next section (note: output suppressed for brevity) the class labels of the training data will be shuffled to break the relationship with the predictors. Ideally it should no longer be possible to predict the outcome at greater than chance accuracy. The bayes rate of this shuffled problem is that of a trivial majority class classifier (i.e. 0.5 in this balanced prediction problem).

# shuffle group membership:
set.seed(48)
ry <- sample(blrsim$y)

# glm with shuffled labels:
rm1 <- glm(y ~ ., data = data.frame(y = ry, blrsim$x), family = "binomial")
rm1.train <- summary(dCVnet::performance(rm1))

rm1.cv <- dCVnet::cv_performance_glm(y = ry,
                                     data = blrsim$x,
                                     f = "~.",
                                     k = 5,
                                     nrep = 10)

rm1.test <- summary(
  dCVnet::performance(rm1,
                      newdata = data.frame(y = blrsim$y.test,
                                           blrsim$x.test))
)
# dCVnet with shuffled labels:
rm2 <- dCVnet::dCVnet(y = ry,
                      data = blrsim$x,
                      k_outer = 5,
                      k_inner = 5,
                      nrep_outer = 10,
                      nrep_inner = 5,
                      alphalist = 0.1,
                      type.measure = "class")

rm2.train <- summary(performance(rm2$prod$performance))

rm2.test <- summary(
  structure(
    data.frame(
      prediction = predict(rm2, newx = blrsim$x.test, type = "response")[, 1],
      classification = as.factor(predict(rm2,
                                         newx = blrsim$x.test,
                                         type = "class")[, 1]),
      reference = as.factor(blrsim$y.test),
      label = "m2.test",
      stringsAsFactors = FALSE
    ),
    family = "binomial",
    class = c("performance", "data.frame")
  )
)

rm2.cv <- report_performance_summary(rm2)

r_results <- c(
  bayes.rate = 0.5,
  #
  glm.train = rm1.train[5, 2],
  glm.test = rm1.test[5, 2],
  glm.cv = rm1.cv$cv.performance[5, 2],
  #
  enet.train = rm2.train[5, 2],
  enet.test = rm2.test[5, 2],
  enet.cv = rm2.cv[5, 2]
)

Now we will inspect the results:

# print accuracy:
print(r_results, digits = 2)

# subtract the bayes rate accuracy:
print(r_results - r_results[["bayes.rate"]], digits = 2)

A binary classification model trained on random labels should have (on average) a chance level (majority class frequency) accuracy in test data. Above we verify that this is the case for glmnet in this simulated data.

So the nesting of the cross-validation is working as expected.


Note that the cross-validated estimates from the dCVnet elastic-net model were actually lower than chance level (r signif(r_results[["enet.cv"]], 2)). This is because a model where full regularisation is selected by the hyperparameter tuning will be dominated by the intercept (as the intercept is not penalised). As internally all predictors are centered and scaled the intercept will be the prevalence in the training data. For the binomial family an intercept-only model will predict the majority class. When cross-validation is done at on a binomial model where the prevalence is 0.5 (as is the case here), then the majority class will vary from fold to fold, and whichever class is in the majority in the data used for training will necessarily be in the minority of the held-out fold. If the prevalence is higher in the training set then it is lower in the testing set and so predictions will be slightly worse than chance accuracy. A similar effect is seen for AUC.



AndrewLawrence/dCVnet documentation built on Sept. 24, 2024, 5:24 a.m.