knitr::opts_chunk$set( collapse = TRUE, eval=TRUE, comment = "#>" )
This page explains the methodoly behind the sumbission. Is a part of the reproducibility submission. The second part is the technical description and how to reproduce the results. The reproducibility part can be seen here.
This approach can be described as a linear combination of statistical forecasting methods, where the weights of the combination are calculated by a learning model based on gradient tree boosting over a set of features extracted from the individual series.
A basic flowchart of the process is included to support the detailed description.
This methodological description page is divided in 3 sections, the Forecasting methods, where the individual forecasting methods are described, the Learning model, where the method for generating the weights is explained, for both Mean and Interval predictions, and the Features, where the features used in the learning model are described.
A set of 9 forecasting methods is used. These methods are fitted on each
individual series and produce the forecasts at the asked horizon. These forecasts are linearly combined to produce the final point forecast.
The rationale behind the pool of methods is to use easily available forecasting methods that can be fitted automatically. All the forecasting methods are well known forecasting methods, and are available in the forecast
R package. For clarity, the specific R calls for fitting the models are given.
forecast::auto.arima(x, stepwise=FALSE, approximation=FALSE)
forecast::ets(x)
NNETAR A feed-forward neural network with a single hidden layer is fitted to the lags. The number of lags is automatically selected. The code for fitting the model is forecast::nnetar(x)
TBATS The Exponential smoothing state space Trigonometric, Box-Cox transformation, ARMA errors, Trend and Seasonal components model. Parameters like the application of a Box-Cox transformation, to include trend, etc. are automatically fitted. The code for fitting the model is forecast::tbats(x)
forecast::stlm(x, modelfunction = stats::ar)
forecast::rwf(x, drift=TRUE)
forecast::thetaf(x)
forecast::naive(x)
forecast::snaive(x)
In the case of an error when fitting the series (e.g. a series is constant), the SNAIVE forecast method is used instead.
This section explains the learning model that produces the weights used to combine the individual forecast methods. We can distingush two methods:
Model for producing the mean forecasts: In essence, it is a feature-based gradient tree boosting approach where the loss or error function to minimize is tailored to the OWI error used in the M4 competition. The implementation of gradient tree boosting is xgboost
, a tool that is computationally efficient and allows a high degree of customization. The set of features is explained in the next section.
Model for producing the prediction intervals: This method uses the mean forecasts (the result of our mean approach) as the center of the interval, and the finds a linear combination of the 95% prediction intervals of 3 forecasting methods in our original set, THETAF, NAIVE and SEASONAL NAIVE. One set of weights to combine this three intervals is produced per horizon, and is the same for all series, differing from the mean approach, where one set of weights is produced per series. The weights are calculated by minimizing the MSIS error of the linear combination of these three intervals in the training set.
In order to train the learning model, a dataset where the OWI errors of the methods can be measured is required. In our case, we applied temporal holdout to the available M4 dataset to create this training set. The specific temporal holdout procedure is as follows:
In this new training set we can apply the forecasting methods and measure their OWI errors in a scenario that will be 'close' to the real one. We call this training set meta_M4
Let $a$ be the features extracted from a time series $x$, $f$ the set of $M$ vectors with the forecasts for each method applied to $x$ and $xx$ the true future values of $x$. The gradient boosting method produces a vector of length $M$, real numbers, as a function of the features, $y(a)$.
We transform the output of the trees $y(a)$ from real numbers to probabilities by appling the $softmax$ transform. The error or 'loss' function for a series $x$ with features, $a$, forecasts set $f$ and the true future values $xx$ is:
$$ L_{OWA}(y,a,f,xx) = \sum_{i=1}^{M} \frac{e^{y(a)_i}}{\sum e^{y(a)}} OWA (f_i, xx) $$ This error function has three benefits:
The gradient tree boosting approach implemented in xgboost
works by additively approximating the function $y$ so that the overall error is minimized:
$$ argmin_y \sum_{j}^N L_{OWA}(y,A_j,F_j,XX_j) $$
where $A_j$, $F_j$, $XX_j$ are the features, forecasts and future values oth the $j$-eth series training set. The Gradient and Hessian of $L_{OWA}$ with respect to $y$ must be provided to xbgoost
to proceed with the minimization.
xgboost
tree models have a range of hyperparameters that must be tuned to produce good results.
We have performed a search in a subset of the hyperparameter space by using Bayesian Optimization, measuring the OWA via a 10-fold crossvalidation of the training set meta_M4
. The range of hyperparameters is as follows:
max_depth
The maximum depth of a tree. From 6 to 14.eta
The learning rate, the scale of contribution of each tree. From 0.001 to 1.subsample
The proportion of the training set used to calculate the trees each iteration. From 0.5 to 1.colsample_bytree
The proportion of the features used to calculate the trees each iteration. From 0.5 to 1.nrounds
The number of iterations of the algorithm. From 1 to 250.The final set of hyperparameters was:
max_depth = 14
, eta=0.58
, subsample=0.92
, colsample_bytree=0.77
and nrounds = 94
. This set of features was not the best performing on the crossvalidation, but it produced very similar OWA error while being more regularized, less prone to overfitting.
The learning model, (that is the function $y$), is trained in meta_M4
.
For a series $x$ in the original M4 dataset, the vector of features, $a$ is extracted from it and the individual methods' forecasts $f$ are calculated. The model then produces its output $y(a)$ and it is transformed to probabilties. Finally the individual method's forecasts are combined by linear combination using these probabilities as 'weights' to produce the final submitted forecast. $$combi_forecast(x) = \sum_{i}^{M} \frac{e^{y(a)_i}}{e^{y(a)}} f_i$$ Note that we omit the forecasting horizon $h$ for simplicity. It would be an extra input parameter.
The prediction intervals are also a linear combination of intervals of single methods. The weights of the combination are calculated the following way:
In the training set we calculate the 95% prediction intervals of the methods THETA, NAIVE and SEASONAL NAIVE, produced by their implementation in the forecast
R package. This subset of methods was chose for comptutational cost reasons. For each of these methods, we get the 'radius' of the interval, the difference from the upper bound to the mean forecast (we consider symmetric intervals). These radiuses are also vectors of length $h$, the required forecasting horizon.
Using the mean forecast produced by our approach described in the previous section, the MSIS is minimized the following way:
A set of weights $w$ will be calculated per forecasting horizon. For a given series $x$ we produce the mean combination forecast $c$, and the radius of the intervals of the $M=3$ methods, $r$. NOTE: This combination forecast $c$ used for training the interval is based on a model trained on a disjoint subset of the training set, to avoid overfitting, the final combination $c$ is based on all the training set. We start by calculaing the denomitator of the MSIS error formula, common to al horizons: $$D = \sum_{t=1+m}^n x_t - x_{t-m} $$
Since we will calculate one set of weights per horizon, we will simplify the notation of the MSIS formula and show the minimization of one specific horizon. This means that $w$ from now on are the set of $M=3$ weights for one specific horizon, eg. 13, $xx$ would therefore be 13th true future value of $x$ in the training set. The weights are calculated by minimizing the following formula:
$$ L(w,x) = \sum_{i=1}^{M}c - w_ir_i$$ $$U(w,x) = \sum_{i=1}^{M}c + w_ir_i$$
$$ MSIS(w,x,xx) = \frac{ 2 \sum_{1}^{M} r_iw_i + \frac{2}{\alpha}(L - xx)I(L
$$ argmin_w \sum_{j=1}^N MSIS(w, X_j, XX_j) $$
with $X$ and $XX$ the set of series and true future values in the training set meta_M4
respectively.
This minimization is repeated for all forecasting horizons, 48 in the M4 competitions, to produce 48 sets of $M=3$ weights.
The optimization algorithm used is the standard Conjugate Gradient implemented in the optim
R function.
For a given series $x$ we calculate the radius $r$ based on the three methods and the mean forecast $c$ of our mean approach. The precalculated set of weights $W$ is a $3 \times h$ matrix, just like $r$.
The prediction intervals for $x$ are: $$U = c + diag(W'r)$$ and $$L = c - diag(W'r)$$ With $W'$ denoting the transpose of $W$ and diag the diagonal elements of the matrix.
A final step, common to both means and prediction interval forecasts, is setting negative values to 0. This is due to no obervations in the dataset being less than 0.
A set of 42 features is used in the metalearning model. For each series the following 42 features are calculated.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.