knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This document shows how to use the etalonnage
package to facilitate GDP forecasting/nowcasting using real French data.
library(etalonnage)
First, load a dataframe
containing regressors (fr_x
) and another one containing the target (fr_y
).
To emulate a real forecast situation fr_x
covers a wider period (one more quarter) than fr_y
.
data("fr_x", "fr_y")
fr_x
is composed of:
head(fr_x, n = 3)
fr_y
is composed of real GDP values at a quarterly frequency.
head(fr_y, n =3)
During a forecasting exercise, the preprocessing to be done is often the same (set the predictors to the same frequency as the target, add dummies, ...). The package etalonnage
provides functions to facilitate the realization of these treatments.
Convert the target to growth rate using build_target
:
fr_y <- build_target(fr_y, growth_rate = TRUE, date_freq = "quarter") head(fr_y, n = 3)
Add all regressors first-diff to fr_x
using add_diff
and pivot regressors to have one column for each month in a quarter (and thus match the frequency of the target) using month_to_quarter
:
fr_x <- fr_x %>% add_diff(exclude = "date") %>% month_to_quarter()
Now suppose that one wants to forecast the French GDP growth rate in 2000Q2. Depending on the horizon, the forecast is computed conditional on the information released until April 2000, May 2000 or June 2000. Consider June 2000. At this horizon, 3 months (April 2000, May 2000 and June 2000) of Insee survey variables are available since these variables are released with no delay (i.e. during the month to which they relate) but only 2 months (April 2000 and May 2000) of Banque de France survey variables are available since these variables are released with a delay of 1 month. As a result the final dataset contains some values that shouldn't be observed and must be dropped. Regarding household consumption and IPI, only one month is available but it would be too costly to remove these variables given their importance to the forecast. To deal with these, etalonnage
package contains a function acquis
that transform series with NA
by computing a "granted" growth (acquis de croissance).
fr_x <- fr_x %>% acquis(cols = c("ipi", "conso"), month = 1) %>% dplyr::select(-dplyr::contains(c("Bdf_fd1_3", "Bdf_3")))
To deal with NA
or structural breaks, it is common to add dummies to the regressors. This can be done using add_dummy
: pass a list
of names for the dummies and add the
corresponding conditions one-by-one. Here, the columns retailInsee
and batBdf
have NA
respectively until 1999 and 2009Q4:
fr_x <- fr_x %>% add_dummy( names = list("dummy_retailInsee", "dummy_batBdf"), (date < "1999-01-01"), (date < "2009-04-01") )
It is also possible to convert columns values to growth rate using to_growth_rate
.
Make sure that fr_x
and fr_y
starts at the same date and replace NA
with some value (here 0):
fr_y <- fr_y[-1,] fr_x <- fr_x[-c(1,2),] fr_x[is.na(fr_x)] <- 0
During a forecast exercise, it is common that the forecast is performed using information provided by series released at a higher frequency than the target. These series are released with various delays so that the forecast is conditioned on the sample of series that are known at the time the estimation is performed. In order to take into account the non-synchronicity of data publications (and thus to properly assess the performances of a amodel), the forecast accuracy is assessed on the basis of a pseudo real-time experiment. This kind of evaluation aims at replicating the timeliness of the releases of the series by taking into account their publications lags. In this framework, the series are truncated in order to consider only those values of the series that would have been available on the date on which the forecasts were calculated.
When data are not i.i.d., the validation scheme has to take into consideration the time dependent structure of the data to avoid the creation of non-independent training and test sets. To assess a model
performances, etalonnage
package implements "rolling-origin-update evaluation" (ROUE), meaning that the forecast origin rolls ahead in time. At each step, ROUE increments the traning set by one observation of the test set. Here, ROUE is implemented in such a way that the size of the training set increases at each iteration (expanding window) rather than remaining constant (fixed window). In doing so, all the available information is used but equal importance is given to all observations of the training set, regardless of their "distance" to the forecast origin. At the end of the validation, a set of forecasts is available, making it possible to compare models.
All that remains is to choose a forecast origin and fit the models:
rf <- etalonnage( name = "Random Forest", X = fr_x, y = fr_y$target, regressor = "randomForest", forecast_origin = "2014-10-01", scale = "none", mtry = 15, ntree = 500, nodesize = 3, importance = TRUE ) xgb <- etalonnage( name = "XGBoost", X = fr_x, y = fr_y$target, regressor = "xgboost", forecast_origin = "2014-10-01", scale = "none", nrounds = 1500, eta = 0.05, max_depth = 6, verbose = FALSE )
For each of the two models, this is what is done by etalonnage
:
date
from the regressors,forecast_origin
,scale != "none"
, process the data,regressor
on the data,forecast_origin
+ 1 quarter,fr_x
is reached.Other arguments like mtry
or eta
come directly from the packages used to fit the models, i.e.:
regressor = "randomForest"
,regressor = "xgboost"
,regressor = "glmnet"
.Plot a model predictions using graph
method:
graph(rf, annotation_y = -0.01, annotation_x = 200)
Directly access to the predicted values using rf$predicted_values
or valuate the models using their attributes:
rf$test_rmse rf$test_mae rf$test_mda
xgb$test_rmse xgb$test_mae xgb$test_mda
Graph the two models using graph_models
:
graph_models(rf, xgb, start_graph = "2000-01-01")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.