options(digits = 3) library(readr) library(timetk) library(forecast) library(rsample) library(purrr) library(tidyr) library(sweep) library(dplyr) library(ggplot2) library(zoo)
"Demo Week: Tidy Forecasting with sweep
" is an excellent article that uses tidy methods with time series. This article uses their analysis with rsample
to get performance estimates for future observations using rolling forecast origin resampling.
The data are sales of alcoholic beverages and can be found at the Federal Reserve Back of St. Louis website. From this page, download the csv
file. readr
is used to bring in the data:
col_spec <- cols( DATE = col_date(format = ""), S4248SM144NCEN = col_double() ) library(readr) drinks <- read_csv("S4248SM144NCEN.csv", col_types = col_spec) str(drinks, give.att = FALSE)
Each row is a month of sales (in millions of US dollars).
Suppose that predictions for one year ahead were needed and the model should use the most recent data from the last 20 years. To setup this resampling scheme:
library(rsample) roll_rs <- rolling_origin( drinks, initial = 12 * 20, assess = 12, cumulative = FALSE ) nrow(roll_rs) roll_rs
Each split
element contains the information about that resample:
roll_rs$splits[[1]]
For plotting, let's index each split by the first day of the assessment set:
get_date <- function(x) min(assessment(x)$DATE) start_date <- map(roll_rs$splits, get_date) roll_rs$start_date <- do.call("c", start_date) head(roll_rs$start_date)
This resampling scheme has r nrow(roll_rs)
splits of the data so that there will be r nrow(roll_rs)
ARIMA models that are fit. To create the models, the auto.arima
function from the forecast
package is used. The functions analysis
and assessment
return the data frame, so another step converts the data in to a ts
object called mod_dat
using a function in the timetk
package.
library(forecast) # for `auto.arima` library(timetk) # for `tk_ts` library(zoo) # for `as.yearmon` fit_model <- function(x, ...) { # suggested by Matt Dancho: x %>% analysis() %>% # Since the first day changes over resamples, adjust it # based on the first date value in the data frame tk_ts(start = .$DATE[[1]] %>% as.yearmon(), freq = 12, silent = TRUE) %>% auto.arima(...) }
Each model is saved in a new column:
library(purrr) roll_rs$arima <- map(roll_rs$splits, fit_model) # For example: roll_rs$arima[[1]]
(There are some warnings produced by these first regarding extra columns in the data that can be ignored)
Using the model fits, performance will be measured in two ways:
In each case, the mean absolute percent error (MAPE) is the statistic used to characterize the model fits. The interpolation error can be computed from the Arima
object. to make things easy, the sweep
package's sw_glance
function is used:
library(sweep) roll_rs$interpolation <- map_dbl( roll_rs$arima, function(x) sw_glance(x)[["MAPE"]] ) summary(roll_rs$interpolation)
For the extrapolation error, the model and split objects are required. Using these:
library(dplyr) get_extrap <- function(split, mod) { n <- nrow(assessment(split)) # Get assessment data pred_dat <- assessment(split) %>% mutate( pred = as.vector(forecast(mod, h = n)$mean), pct_error = ( S4248SM144NCEN - pred ) / S4248SM144NCEN * 100 ) mean(abs(pred_dat$pct_error)) } roll_rs$extrapolation <- map2_dbl(roll_rs$splits, roll_rs$arima, get_extrap) summary(roll_rs$extrapolation)
What do these error estimates look like over time?
library(ggplot2) library(tidyr) roll_rs %>% select(interpolation, extrapolation, start_date) %>% as.data.frame %>% gather(error, MAPE, -start_date) %>% ggplot(aes(x = start_date, y = MAPE, col = error)) + geom_point() + geom_line() + theme_bw() + theme(legend.position = "top")
It is likely that the interpolation error is an underestimate to some degree.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.