Analysis with the 2010 tourism forecasting competition data"
In Tcomp: Data from the 2010 Tourism Forecasting Competition

Tourism data

This vignette introduces the Tcomp R package, which is a convenient wrapper around the univariate versions of the data used in the Tourism Forecasting competition described in Athanasopoulos et al (2011), originally in the International Journal of Forecasting (2011) 27(3), 822-844, available in full from Hyndman's blog. Unfortunately, the explanatory variables are not available, so only univariate forecasting methods can be tested with this data collection. The data available here are the same as those used in the Kaggle competition.

The original data are also available from Hyndman's site, in CSV format, and have been wrapped into the Tcomp package with his permission. Any errors that occurred in the wrapping process are the package author's responsibility and if there is doubt, the R version should be checked against the CSV originals. However, testing has been pretty thorough, and further down in this vignette we successfully reproduce exactly the mean absolute percentage error of the naive forecasts that are inter alia reported in that 2011 article. Please report errors as issues on this package's GitHub page.

The data

library(Tcomp)
library(knitr)
opts_chunk$set(comment=NA, fig.width=7, fig.height=4)
options(verbose = FALSE, xtable.type = "html")

The data are all in a lazy-loaded list called tourism which consists of r length(tourism) objects. tourism is a list of class Mcomp, a class inherited from Hyndman et al.'s Mcomp package which has the data from the M1 and M3 forecasting competitions.

length(tourism)
class(tourism)

Each of the r length(tourism) objects in tourism is a list of class Mdata and it has the same elements as the data in the Mcomp package. The example here is one of the annual time series:

tourism$Y5

Here is the full structure of each Mdata object, again using the fifth yearly series as an example:

str(tourism$Y5)

The Mcomp package also provides a convenient plotting method for Mdata objects, combining the training and testing periods of the data. Shown here with one of the monthly time series:

par(bty = "l")
plot(tourism$M11); grid()

Summary information

Some of the descriptive statistics from Table 3 of Athanasopoulos et al. are reproduced below.

library(dplyr, warn.conflicts = FALSE)
library(xtable)
lengthsm <- sapply(subset(tourism, "monthly"), function(s){length(s$x) + length(s$xx)})
lengthsq <- sapply(subset(tourism, "quarterly"), function(s){length(s$x) + length(s$xx)})
lengthsy <- sapply(subset(tourism, "yearly"), function(s){length(s$x) + length(s$xx)})

tmp <- data.frame(series_length = c(lengthsm, lengthsq, lengthsy),
                          series_type = c(rep("Monthly", length(lengthsm)),
                                          rep("Quarterly", length(lengthsq)),
                                          rep("Yearly", length(lengthsy))
                          )) %>%
  group_by(series_type) %>%
  summarise(
    TotalNumber = length(series_length),
    MeanLength = round(mean(series_length), 1),
    MedianLength = median(series_length),
    MinLength = min(series_length),
    MaxLength = max(series_length)
  ) %>%
  select(-series_type) %>%
  t() %>%
  as.matrix()
colnames(tmp) <- c("Monthly", "Quarterly", "Yearly")
xtable(tmp, digits = 1)

Comparison with the original shows that while the number of series, median length, minimum length and maximum length match exactly; mean length only matches if the results above are truncated rather than rounded (ie 298.6 becomes 298). Given the other statistics match, and the mean absolute percentage error of naive forecasts match (see below), this discrepancy is assumed to be a minor rounding error in Athanasopoulos et al.

Analysis

The Tcomp package provides forecast_comp(), a wrapper around functions from Hyndman's forecast package which performs forecasts based on:

exponential smoothing state space
ARIMA
theta
naive or seasonal naive

methods and returns mean absolute percentage error (MAPE) and mean absolute scaled error (MASE) for the forecasts compared to the test set, at given testing horizons

forecast_comp(tourism[[300]], tests = list(6, 12, 24))

The first four rows are MAPE and the last four are MASE; note that this is not possible to tell from the output without knowing in advance.

For this particular series (which is a monthly one), taking the MASE (which I think is the better measure), only the ARIMA model out-performed the seasonal naive model (which is forecasts values as being the samne as the most recent observed value for that month).

forecast_comp() comes with an optional plot function which compares the four forecasting methods to actuals:

forecast_comp(tourism[[1000]], tests = list(1,2,3,4), plot = TRUE)

In this case, we see that actual growth exceeded the 80% prediction interval for each method (but not always the 95% prediction interval).

Reproducing some of the competition results

forecast_comp() is designed to be used in a larger scale than one series at a time. Here is a more extended use, aimed at re-creating some of the results reported in Athanasoupolus et al.

library(Tcomp)
library(dplyr)
library(tidyr)
library(parallel)

# this function runs the four standard models in forecast_comp
# on a large chunk of the competition series from either Mcomp or Tcomp.  
# The aim is to help comparisons with Athanasopoulos et al.  
#
# The use of makePSOCKcluster and parLapply speeds up the analysis nearly four fold on my laptop
# eg running the test on all the yearly tourism series takes 12 seconds rather than 44 seconds.

#' @param dataobj a list of class Mcomp such as M3 or tourism
#' @param cond1 a condition for subsetting dataobj eg "yearly"
#' @param tests a list of different horizons at which to return the MASE for four different models
#' 
#' @return a data.frame with \code{length(tests) + 2} columns and 8 rows
accuracy_measures <- function(dataobj, cond1, tests){
  cores <- detectCores()

  cluster <- makePSOCKcluster(max(1, cores - 1))

  clusterEvalQ(cluster, {
    library(Tcomp)
    library(forecast)
  })

  results <- parLapply(cluster,
                       subset(dataobj, cond1), 
                       forecast_comp, 
                       tests = tests)

  results_mat <- do.call(rbind, results)
  nr <- nrow(results_mat)

  tmp <- as.data.frame(results_mat) %>%
    mutate(measure = rep(rep(c("MAPE", "MASE"), times = c(4, 4)), times = nr / 8)) %>%
    mutate(method = rownames(results_mat)) %>%
    gather(horizon, mase, -method, -measure) %>%
    group_by(method, measure, horizon) %>%
    summarise(result = round(mean(mase), 2)) %>%
    ungroup() %>%
    mutate(horizon = factor(horizon, levels = colnames(results[[1]]))) %>%
    spread(horizon, result) %>%
    arrange(measure) %>%
    as.data.frame()

  stopCluster(cluster)

  return(tmp)
}

accuracy_measures(tourism, "monthly", list(1, 2, 3, 6, 12, 18, 24, 1:3, 1:12, 1:24))
accuracy_measures(tourism, "quarterly", list(1, 2, 3, 4, 6, 8, 1:4, 1:8))
accuracy_measures(tourism, "yearly", list(1, 2, 3, 4, 1:2, 1:4))

The code above isn't run in the build of the vignette for reasons of time (takes a few minutes to run on my laptop, making it to intensive for a CRAN build). The results are available in the package however as a list called Tcomp_reproduction

xtable(Tcomp_reproduction$monthly)
xtable(Tcomp_reproduction$quarterly)
xtable(Tcomp_reproduction$yearly)

Comparing the results above to Tables 4, 5 and 6 in Athanasopoulos et al. we get exact matches for the mean absolute percentage error (MAPE) of the Naive forecasts, but not for any other rows. It is expected that the methods for ARIMA, ETS and Theta forecasts in the current forecast package differ slightly from the methods used in the 2011 article. However, I don't have an explanation for why the mean absolute scaled error estimates above for the naive forecasts don't match those in the 2011 article. The numbers I obtain are slightly highter than the published numbers; obviously there is a small difference in how MASE is being calculated (my code above uses forecast::accuracy() with the default values). I'm confident the data is identical because of the exact match of the MAPE values of all three sets of naive forecasts (there is less discretion in exactly how to measure MAPE).