In andybega/EBMAhelper: Regular S3 model fitting function and methods for EBMAforecast

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The hypothesis this note will hopefully eventually check is whether cross-validation can reduce overfitting in EBMA. Right now it only demonstrates overfitting for the demo example.

suppressPackageStartupMessages({
  library("dplyr")
  library("tidyr")
  library("forcats")
  library("EBMAforecast")
  library("EBMAhelper")
})

Continuous outcome demo: presidential forecasts

The canonical EBMAforecast package includes two demo data sets that are used to illustrate the performance of EBMA. The first demo example consists of forecasts for vote share in US presidential elections.

These data consist of 6 different predictions for the incumbent party vote share in 15 US presidential elections, along with the actual vote share.

data(presidentialForecast)

dplyr::glimpse(presidentialForecast)

The original application [@montegomery:etal:2014] uses only 1 case as an out-of-sample test, the 2012 election. I would like to get a more robust estimate of out-of-sample performance, and so I am going to use cross-validation with 3 of the 15 cases left out for testing at each split. To avoid the random element of how splits are assigned, I am going to do with for all 455 distinct^[15 choose 3 = 155] calibration/test splits, and run them all.

To assess performance, I will look at calibration and test period MAE and RMSE for the component models, EBMA, and a baseline simple average ensemble. Relevant questions:

Does EBMA perform better than the component models?
Does EBMA perform better than a simple average?
Overfitting: how much performance decline is there between calibration and test?

x_data <- presidentialForecast[, c(1:6)]
y_data <- presidentialForecast[, 7]

# 15 choose 3 for train/test split; get all unique combinations
all_splits <- combn(1:15, 3, simplify = FALSE)

output <- list(NULL)
for (i in seq_along(all_splits)) {
  #cat(paste0(i, "."))
  #if (i %% 10 == 0) cat("\n")

  # Split 12/3 for calibration/test
  test_idx  <- all_splits[[i]]
  train_idx <- setdiff(1:15, all_splits[[i]])

  # Calibrate EBMA
  ebma_fit <- ebma(y = y_data[train_idx], x = x_data[train_idx, ],
                   model_type = "normal", 
                   useModelParams = FALSE, tol = 0.0001, const = 0,
                   method = "EM")

  # Save calibration/test RMSE/MAE for each model
  df  <- summary(ebma_fit, period="calibration", showCoefs=FALSE)@summaryData
  out <- data.frame(Period = "calib", Model = rownames(df), df[, c("rmse", "mae")])
  df  <- summary(ebma_fit, period="test", showCoefs=FALSE)@summaryData
  out <- rbind(out, data.frame(Period = "test", Model = rownames(df), df[, c("rmse", "mae")]))

  # Just average
  simple_avg <- rowMeans(x_data)
  fit <- data.frame(Period = c("calib", "test"),
                    Model = "Simple avg", 
                    rmse = c(
                      sqrt(mean((simple_avg[train_idx] - y_data[train_idx])^2)),
                      sqrt(mean((simple_avg[test_idx]  - y_data[test_idx])^2))
                    ),
                    mae = c(
                      mean(abs(simple_avg[train_idx] - y_data[train_idx])),
                      mean(abs(simple_avg[test_idx]  - y_data[test_idx]))
                    ))

  out <- rbind(out, fit)
  out$index <- i

  output[i] <- list(out)
}
output <- do.call(rbind, output)

Here are the results, ordered from best test RMSE to worse test RMSE.

# Average RMSE/MAE fit by model
output %>%
  tidyr::gather(stat, value, rmse:mae) %>%
  dplyr::group_by(Model, Period, stat) %>% 
  dplyr::summarize(mean = mean(value)) %>% 
  dplyr::ungroup() %>%
  dplyr::mutate(
    Period = forcats::fct_recode(Period, Calib = "calib", Test = "test"),
    stat = toupper(stat)
  ) %>%
  tidyr::unite(stat, Period, stat) %>%
  tidyr::spread(stat, mean) %>%
  dplyr::arrange(Test_RMSE) %>%
  dplyr::select(Model, ends_with("MAE"), ends_with("RMSE")) %>%
  knitr::kable(digits = 2)

EBMA does better than the input models on both MAE and RMSE and in both the calibration and test sets.
If we compare the EBMA calibration and test MAEs, and the same for RMSE, we can see that in both cases the test error is larger, by about 0.2, than in the calibration period. This is overfitting at work. In comparison, the simple average errors are comparable in the calibration and test sets.
As a result, the simple average outperforms the EBMA ensemble in the cross-validated out-of-sample test sets, both on MAE and RMSE.

References

andybega/EBMAhelper documentation built on April 6, 2022, 11:11 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

andybega/EBMAhelper
Regular S3 model fitting function and methods for EBMAforecast

In andybega/EBMAhelper: Regular S3 model fitting function and methods for EBMAforecast

Continuous outcome demo: presidential forecasts

References

R Package Documentation

Browse R Packages

We want your feedback!

andybega/EBMAhelper Regular S3 model fitting function and methods for EBMAforecast

In andybega/EBMAhelper: Regular S3 model fitting function and methods for EBMAforecast

Continuous outcome demo: presidential forecasts

References

R Package Documentation

Browse R Packages

We want your feedback!

andybega/EBMAhelper
Regular S3 model fitting function and methods for EBMAforecast