boost_arima: General Interface for Boosted ARIMA Regression Models

View source: R/parsnip-arima_boost.R

boost_arimaR Documentation

General Interface for Boosted ARIMA Regression Models

Description

boost_arima() is a way to generate a specification of a time series model that uses boosting to improve modeling errors (residuals) on Exogenous Regressors. It works with both "automated" ARIMA (auto.arima) and standard ARIMA (arima). The main algorithms are:

  • Auto ARIMA + Catboost Errors (engine = auto_arima_catboost, default)

  • ARIMA + Catboost Errors (engine = arima_catboost)

  • Auto ARIMA + LightGBM Errors (engine = auto_arima_lightgbm)

  • ARIMA + LightGBM Errors (engine = arima_lightgbm)

Usage

boost_arima(
  mode = "regression",
  seasonal_period = NULL,
  non_seasonal_ar = NULL,
  non_seasonal_differences = NULL,
  non_seasonal_ma = NULL,
  seasonal_ar = NULL,
  seasonal_differences = NULL,
  seasonal_ma = NULL,
  tree_depth = NULL,
  learn_rate = NULL,
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  sample_size = NULL,
  loss_reduction = NULL
)

Arguments

mode

A single character string for the type of model. The only possible value for this model is "regression".

seasonal_period

A seasonal frequency. Uses "auto" by default. A character phrase of "auto" or time-based phrase of "2 weeks" can be used if a date or date-time variable is provided. See Fit Details below.

non_seasonal_ar

The order of the non-seasonal auto-regressive (AR) terms. Often denoted "p" in pdq-notation.

non_seasonal_differences

The order of integration for non-seasonal differencing. Often denoted "d" in pdq-notation.

non_seasonal_ma

The order of the non-seasonal moving average (MA) terms. Often denoted "q" in pdq-notation.

seasonal_ar

The order of the seasonal auto-regressive (SAR) terms. Often denoted "P" in PDQ-notation.

seasonal_differences

The order of integration for seasonal differencing. Often denoted "D" in PDQ-notation.

seasonal_ma

The order of the seasonal moving average (SMA) terms. Often denoted "Q" in PDQ-notation.

tree_depth

The maximum depth of the tree (i.e. number of splits).

learn_rate

The rate at which the boosting algorithm adapts from iteration-to-iteration.

mtry

The number of predictors that will be randomly sampled at each split when creating the tree models.

trees

The number of trees contained in the ensemble.

min_n

The minimum number of data points in a node that is required for the node to be split further.

sample_size

The amount of data exposed to the fitting routine.

loss_reduction

The reduction in the loss function required to split further.

Details

The data given to the function are not saved and are only used to determine the mode of the model. For boost_arima(), the mode will always be "regression".

The model can be created using the fit() function using the following engines:

  • "auto_arima_catboost" (default) - Connects to forecast::auto.arima() and catboost::catboost.train

  • "arima_catboost" - Connects to forecast::Arima() and catboost::catboost.train

  • "auto_arima_lightgbm" - Connects to forecast::auto.arima() and lightgbm::lgb.train()

  • "arima_lightgbm" - Connects to forecast::Arima() and lightgbm::lgb.train()

Main Arguments

The main arguments (tuning parameters) for the ARIMA model are:

  • seasonal_period: The periodic nature of the seasonality. Uses "auto" by default.

  • non_seasonal_ar: The order of the non-seasonal auto-regressive (AR) terms.

  • non_seasonal_differences: The order of integration for non-seasonal differencing.

  • non_seasonal_ma: The order of the non-seasonal moving average (MA) terms.

  • seasonal_ar: The order of the seasonal auto-regressive (SAR) terms.

  • seasonal_differences: The order of integration for seasonal differencing.

  • seasonal_ma: The order of the seasonal moving average (SMA) terms.

The main arguments (tuning parameters) for the model Catboost/LightGBM model are:

  • tree_depth: The maximum depth of the tree (i.e. number of splits).

  • learn_rate: The rate at which the boosting algorithm adapts from iteration-to-iteration.

  • mtry: The number of predictors that will be randomly sampled at each split when creating the tree models.

  • trees: The number of trees contained in the ensemble.

  • min_n: The minimum number of data points in a node that is required for the node to be split further.

  • sample_size: The amount of data exposed to the fitting routine.

  • loss_reduction: The reduction in the loss function required to split further.

These arguments are converted to their specific names at the time that the model is fit.

Other options and argument can be set using set_engine() (See Engine Details below).

If parameters need to be modified, update() can be used in lieu of recreating the object from scratch.

Engine Details

The standardized parameter names in boostime can be mapped to their original names in each engine:

Model 1: ARIMA:

boostime forecast::auto.arima forecast::Arima
seasonal_period ts(frequency) ts(frequency)
non_seasonal_ar, non_seasonal_differences, non_seasonal_ma max.p(5), max.d(2), max.q(5) order = c(p(0), d(0), q(0))
seasonal_ar, seasonal_differences, seasonal_ma max.P(2), max.D(1), max.Q(2) seasonal = c(P(0), D(0), Q(0))

Model 2: Catboost / LightGBM:

boostime catboost::catboost.train lightgbm::lgb.train
tree_depth depth max_depth
learn_rate learning_rate learning_rate
mtry rsm feature_fraction
trees iterations num_iterations
min_n min_data_in_leaf min_data_in_leaf
loss_reduction None min_gain_to_split
sample_size subsample bagging_fraction

Other options can be set using set_engine().

auto_arima_catboost (default engine)

Model 1: Auto ARIMA (forecast::auto.arima):

## function (y, d = NA, D = NA, max.p = 5, max.q = 5, max.P = 2, max.Q = 2, 
##     max.order = 5, max.d = 2, max.D = 1, start.p = 2, start.q = 2, start.P = 1, 
##     start.Q = 1, stationary = FALSE, seasonal = TRUE, ic = c("aicc", "aic", 
##         "bic"), stepwise = TRUE, nmodels = 94, trace = FALSE, approximation = (length(x) > 
##         150 | frequency(x) > 12), method = NULL, truncate = NULL, xreg = NULL, 
##     test = c("kpss", "adf", "pp"), test.args = list(), seasonal.test = c("seas", 
##         "ocsb", "hegy", "ch"), seasonal.test.args = list(), allowdrift = TRUE, 
##     allowmean = TRUE, lambda = NULL, biasadj = FALSE, parallel = FALSE, 
##     num.cores = 2, x = y, ...)

Parameter Notes:

  • All values of nonseasonal pdq and seasonal PDQ are maximums. The auto.arima will select a value using these as an upper limit.

  • xreg - This should not be used since Catboost will be doing the regression

Model 2: Catboost (catboost::catboost.train):

## function (learn_pool, test_pool = NULL, params = list())

Parameter Notes:

  • Catboost uses a params = list() to capture. Parsnip / Timeboost automatically sends any args provided as ... inside of set_engine() to the params = list(...).

Fit Details

Date and Date-Time Variable

It's a requirement to have a date or date-time variable as a predictor. The fit() interface accepts date and date-time features and handles them internally.

  • fit(y ~ date)

Seasonal Period Specification

The period can be non-seasonal (seasonal_period = 1) or seasonal (e.g. seasonal_period = 12 or seasonal_period = "12 months"). There are 3 ways to specify:

  1. seasonal_period = "auto": A period is selected based on the periodicity of the data (e.g. 12 if monthly)

  2. seasonal_period = 12: A numeric frequency. For example, 12 is common for monthly data

  3. seasonal_period = "1 year": A time-based phrase. For example, "1 year" would convert to 12 for monthly data.

Univariate (No xregs, Exogenous Regressors):

For univariate analysis, you must include a date or date-time feature. Simply use:

  • Formula Interface (recommended): fit(y ~ date) will ignore xreg's.

Multivariate (xregs, Exogenous Regressors)

The xreg parameter is populated using the fit() or fit_xy() function:

  • Only factor, ordered factor, and numeric data will be used as xregs.

  • Date and Date-time variables are not used as xregs

  • character data should be converted to factor.

Xreg Example: Suppose you have 3 features:

  1. y (target)

  2. date (time stamp),

  3. month.lbl (labeled month as a ordered factor).

The month.lbl is an exogenous regressor that can be passed to the arima_boost() using fit():

  • fit(y ~ date + month.lbl) will pass month.lbl on as an exogenous regressor.

  • fit_xy(data[,c("date", "month.lbl")], y = data$y) will pass x, where x is a data frame containing month.lbl and the date feature. Only month.lbl will be used as an exogenous regressor.

Note that date or date-time class values are excluded from xreg.

See Also

fit.model_spec(), set_engine()

Examples

library(tidyverse)
library(lubridate)
library(parsnip)
library(rsample)
library(timetk)
library(boostime)


# Data
m750 <- m4_monthly %>% filter(id == "M750")

# Split Data 80/20
splits <- initial_time_split(m750, prop = 0.9)

# MODEL SPEC ----

# Set engine and boosting parameters
model_spec <- boost_arima(

    # ARIMA args
    seasonal_period = 12,
    non_seasonal_ar = 0,
    non_seasonal_differences = 1,
    non_seasonal_ma = 1,
    seasonal_ar     = 0,
    seasonal_differences = 1,
    seasonal_ma     = 1,

    # Catboost Args
    tree_depth = 6,
    learn_rate = 0.1
) %>%
    set_engine(engine = "arima_catboost")

# FIT ----
model_fit_boosted <- model_spec %>%
    fit(value ~ date + as.numeric(date) + month(date, label = TRUE),
        data = training(splits))

model_fit_boosted



AlbertoAlmuinha/boostime documentation built on Aug. 13, 2022, 1:46 p.m.