The Family of Autoregressive Moving Average Models

library(simts)
library(ggplot2)
library(robcor)

In this chapter we introduce a class of time series models that is considerably flexible and among the most commonly used to describe stationary time series. This class is represented by the Seasonal AutoRegressive Integrated Moving Average (SARIMA) models which, among others, combine and include the autoregressive and moving average models seen in the previous chapter. To introduce this class of models, we start by describing a sub-class called AutoRegressive Moving Average (ARMA) models which represent the backbone on which the SARIMA class is built. The importance of ARMA models resides in their flexibility as well as their capacity of describing (or closely approximating) almost all the features of a stationary time series. The autoregressive parts of these models describe how consecutive observations in time influence each other while the moving average parts capture some possible unobserved shocks thereby allowing to model different phenomena which can be observed in various fields going from biology to finance.

With this premise, the first part of this chapter introduces and explains the class of ARMA models in the following manner. First of all we will discuss the class of linear processes, which ARMA models belong to, and we will then proceed to a detailed description of autoregressive models in which we review their definition, explain their properties, introduce the main estimation methods for their parameters and highlight the diagnostic tools which can help understand if the estimated models appear to be appropriate or sufficient to well describe the observed time series. Once this is done, we will then use most of the results given for the autoregressive models to further describe and discuss moving average models, for which we underline the property of invertibility, and finally the ARMA models. Indeed, the properties and estimation methods for the latter class are directly inherited from the discussions on the autoregressive and moving average models.

The second part of this chapter introduces the general class of SARIMA models, passing through the class of ARIMA models. These models allow to apply the ARMA modeling framework also to time series that have particular non-stationary components to them such as, for example, linear and/or seasonal trends. Extending ARMA modeling to these cases allows SARIMA models to be an extremely flexible class of models that can be used to describe a wide range of phenomena.

Linear Processes

INSERT AN INTRODUCTION TO WHAT LINEAR PROCESSES ARE AND WHY THEY ARE IMPORTANT

```{definition, label="lp", name="Linear Process"} A time series, $(X_t)$, is defined to be a linear process if it can be expressed as a linear combination of white noise by:

[{X_t} = \mu + \sum\limits_{j = - \infty }^\infty {{\psi j}{W{t - j}}} ]

where $W_t \sim WN(0, \sigma^2)$ and $\sum\limits_{j = - \infty }^\infty {\left| {{\psi _j}} \right|} < \infty$.

Note, the latter assumption is required to ensure that the series has a limit. 
Furthermore, the set of coefficients \[{\{ {\psi _j}\} _{j =  - \infty , \cdots ,\infty }}\]
can be viewed as linear filter. These coefficients do not have to be all equal
nor symmetric as later examples will show. Generally, the properties of a linear
process related to mean and variance are given by:

\[\begin{aligned}
\mu_{X} &= \mu \\
\gamma_{X}(h) &= \sigma _W^2\sum\limits_{j =  - \infty }^\infty  {{\psi _j}{\psi _{h + j}}}
\end{aligned}\]

The latter is derived from 

\[\begin{aligned}
  \gamma \left( h \right) &= Cov\left( {{x_t},{x_{t + h}}} \right) \\
   &= Cov\left( {\mu  + \sum\limits_{j =  - \infty }^\infty  {{\psi _j}{w_{t - j}}} ,\mu  + \sum\limits_{j =  - \infty }^\infty  {{\psi _j}{w_{t + h - j}}} } \right) \\
   &= Cov\left( {\sum\limits_{j =  - \infty }^\infty  {{\psi _j}{w_{t - j}}} ,\sum\limits_{j =  - \infty }^\infty  {{\psi _j}{w_{t + h - j}}} } \right) \\
   &= \sum\limits_{j =  - \infty }^\infty  {{\psi _j}{\psi _{j + h}}Cov\left( {{w_{t - j}},{w_{t - j}}} \right)}  \\
   &= \sigma _w^2\sum\limits_{j =  - \infty }^\infty  {{\psi _j}{\psi _{j + h}}}  \\ 
\end{aligned} \]

Within the above derivation, the key is to realize that 
$Cov\left( {{w_{t - j}},{w_{t + h - j}}} \right) = 0$ if $t - j \ne t + h - j$.

Lastly, one of the better formalities of a linear process is that it is able 
to be represented in a simplified form under the **backshift operator**:

\[{X_t} = \psi \left( B \right){W_t}\]

```{example, label="lpwn", name="Linear Process of White Noise"}

The white noise process $\left\{X_t\right\}$, defined in \@ref(wn),
can be expressed as a linear process under:

\[\psi _j = \begin{cases}
      1 , &\mbox{ if } j = 0\\
      0 , &\mbox{ if } |j| \ge 1
\end{cases}.\]

and $\mu = 0$.

Therefore, $X_t = W_t$, where $W_t \sim WN(0, \sigma^2_W)$

```{example, label="lpma1", name="Linear Process of Moving Average Order 1"}

Similarly, consider $\left{X_t\right}$ to be a MA(1) process, given by \@ref(ma1). The process can be expressed linearly under:

[\psi _j = \begin{cases} 1, &\mbox{ if } j = 0\ \theta , &\mbox{ if } j = 1 \ 0, &\mbox{ if } j \ge 2 \end{cases}.]

and $\mu = 0$.

Thus, we have: $X_t = W_t + \theta W_{t-1}$

```{example, label="lpsma", name="Linear Process and Symmetric Moving Average"}

Consider a symmetric moving average given by:

\[{X_t} = \frac{1}{{2q + 1}}\sum\limits_{j =  - q}^q {{W_{t + j}}} \]

Thus, $\left\{X_t\right\}$ is defined for $q + 1 \le t \le n-q$. The above process
would be a linear process since:

\[\psi _j = \begin{cases}
      \frac{1}{{2q + 1}} , &\mbox{ if } -q \le j \le q\\
      0 , &\mbox{ if } |j| > q
\end{cases}.\]

and $\mu = 0$.

In practice, if $q = 1$, we would have:

\[{X_t} = \frac{1}{3}\left( {{W_{t - 1}} + {W_t} + {W_{t + 1}}} \right)\]

```{example, label="lpar1", name="Linear Process of Autoregressive Process of Order 1"} A more intensive application of a linear process is $\left{X_t\right}$ as an AR1 process, defined in \@ref(ar1). The intensity comes from the necessity to define the weights with respect to the time lag.

[\psi _j = \begin{cases} \phi^j , &\mbox{ if } j \ge 0\ 0 , &\mbox{ if } j < 0 \end{cases}.]

and $\mu = 0$.

Under the condition that $\left| \phi \right| < 1$ the process can be considered to be the traditional $X_t = \phi X_{t-1} + W_t$. Furthermore, the process can also be considered to be an MA($\infty$)!

## Linear Operators


## Autoregressive Models (AR(p))

The class of autoregressive models is based on the idea that previous values in the time series are needed to explain current values in the series. For this class of models, we assume that the $p$ previous observations are needed for this purpose and we therefore denote this class as AR($p$). In the previous chapter, the model we introduced was an AR(1) in which only the immediately previous observation is needed to explain the following one and therefore represents a particular model which is part of the more general class of AR(p) models.

```{definition, label="ar_p","Autoregressive Models of Order P"}
The AR(p) models can be formally represented as follows
$${X_t} = {\phi_1}{Y_{t - 1}} + ... + {\phi_p}{X_{t - p}} + {W_t},$$
where $\phi_p \neq 0$ and $W_t$ is a (Gaussian) white noise process with variance $\sigma^2$. 

In general, we will assume that the expectation of the process $({X_t})$, as well as that of the following ones in this chapter, is zero. The reason for this simplification is that if $\mathbb{E} [ X_t ] = \mu$, we can define an AR process around $\mu$ as follows:

$$X_t - \mu = \sum_{i = 1}^p \left(\phi_i X_{t-i} - \mu \right) + W_t,$$

which is equivalent to

$$X_t = \mu^{\star} + \sum_{i = 1}^p \phi_i X_{t-i} + W_t,$$

where $\mu^{\star} = \mu (1 - \sum_{i = 1}^p \phi_i)$. Therefore, to simplify the notation we will generally consider only zero mean processes, since adding means (as well as other deterministic trends) is easy.

A useful way of representing AR processes is through the backshift operator introduced in the previous section and is as follows

[\begin{aligned} {X_t} &= {\phi_1}{X_{t - 1}} + ... + {\phi_p}{y_{t - p}} + {w_t} \ &= {\phi_1}B{X_t} + ... + {\phi_p}B^p{X_t} + {W_t} \ &= ({\phi_1}B + ... + {\phi_p}B^p){X_t} + {W_t} \ \end{aligned},]

which finally yields

$$(1 - {\phi _1}B - ... - {\phi_p}B^p){X_t} = {W_t},$$

which, in abbreviated form, can be expressed as

$$\phi(B){X_t} = W_t.$$

We will see that $\phi(B)$ is important to establish the stationarity of these processes and is called the autoregressive operator. Moreover, this quantity is closely related to another important property of AR processes called causality. Before formally defining this new property, we consider the following example, which provides an intuitive illustration of its importance.

Example: Consider a classical AR(1) model with $|\phi| > 1$. Such a model could be expressed as

$$X_t = \phi^{-1} X_{t+1} - \phi^{-1} W_t = \phi^{-k} X_{t+k} - \sum_{i = 1}^{k-1} \phi^{-i} W_{t+i}.$$

Since $|\phi| > 1$, we obtain

$$X_t = - \sum_{i = 1}^{\infty} \phi^{-j} W_{t-j},$$

which is a linear process and therefore is stationary. Unfortunately, such a model is useless because we need the future to predict the future and such processes are called non-causal.

Properties of AR(p) models

In this section we will describe the main property of the AR(p) model which has already been introduced in the previous paragraphs and therefore let us now introduce the property of causality in a more formal manner.

Definition: An AR(p) model is causal, if the time series ${ X_t }{-\infty}^{\infty}$ can be written as a one-sided linear process: \begin{equation} X_t = \sum{j = 0}^{\infty} \psi_j W_{t-j} = \frac{1}{\phi(B)} W_t = \psi(B) W_t, (#eq:causal) \end{equation} where $\phi(B) = \sum_{j = 0}^{\infty} \phi_j B^j$, and $\sum_{j=0}^{\infty}|\phi_j| < \infty$; we set $\phi_0 = 1$.

As discussed earlier this condition implies that only the past values of the time series can explain the future values of it, and not viceversa. However, it might be difficult and not obvious to show the causality of an AR(p) process by using the above definitions directly, thus the following properties are useful in practice.

Property: Causality If an AR(p) model is causal, then the coefficients of the one-sided linear process given in (\@ref(eq:causal)) can be obtained by solving \begin{equation} \psi(z) = \frac{1}{\sum_{j=0}^{\infty} \phi_j z^j} = \frac{1}{\phi(z)}, \mbox{ } |z| \leq 1. \end{equation}

It can be seen how there is no solution to the above equation if $\phi(z) = 0$ and therefore an AR(p) is causal if and only if $\phi(z) \neq 0$. A condition for this to be respected is for the roots of $\phi(z) = 0$ to lie outside the unit circle.

Estimation of AR(p) models

Given the above defined properties of the AR(p) models, we will now discuss how these models can be estimated, more specifically how the $p+1$ parameters can be obtained from an observed time series. Indeed, a reliable estimation of these models is necessary in order to intepret and describe different natural phenomena and/or forecast possible future values of the time series.

A first approach builds upon the earlier definition of the AR(p) as being a linear process. Recall that \begin{equation} X_t = \sum_{j = 1}^{p} \phi_j X_{t-j} \end{equation} which delivers the following autocovariance function \begin{equation} \gamma(h) = cov(X_{t+h}, X_t) = cov(\sum_{j = 1}^{p} \phi_j X_{t-j}, X_t) = \sum_{j = 1}^{p} \phi_j \gamma(h-j), \mbox{ } h \geq 0. \end{equation} Rearranging the above expressions we obtain the following general equations \begin{equation} \gamma(h) - \sum_{j = 1}^{p} \phi_j \gamma(h-j) = 0, \mbox{ } h \geq 1 \end{equation} and, recalling that $\gamma(h) = \gamma(-h)$, \begin{equation} \gamma(0) - \sum_{j = 1}^{p} \phi_j \gamma(j) = \sigma_w^2. \end{equation} We can now define the Yule-Walker equations.

Definition: The Yule-Walker equations are given by \begin{equation} \gamma(h) = \phi_1 \gamma(h-1) + ... + \phi_p \gamma(h-p), \mbox{ } h = 1,...,p \end{equation} and \begin{equation} \sigma_w^2 = \gamma(0) - \phi_1 \gamma(1) - ... - \phi_p \gamma(p). \end{equation} which in matrix notation can be defined as follows \begin{equation} \Gamma_p \mathbf{\phi} = \mathbf{\gamma}_p \text{and} \sigma_w^2 = \gamma(0) - \mathbf{\phi}'\mathbf{\gamma}_p \end{equation} where $\Gamma_p$ is the $p\times p$ matrix containing the autocovariances $\gamma(k-j), j,k = 1, ...,p$ while $\mathbf{\phi} = (\phi_1,...,\phi_p)'$ and $\mathbf{\gamma}_p = (\gamma(1),...,\gamma(p))'$ are $p\times 1$ vectors.

Considering the Yule-Walker equations, it is possible to use a method of moments approach and simply replace the theoretical quantities given in the previous definition with their empirical (estimated) counterparts that we saw in the previous chapter. This gives us the following Yule-Walker estimators \begin{equation} \hat{\mathbf{\phi}} = \hat{\Gamma}_p^{-1}\hat{\mathbf{\gamma}}_p \text{and} \hat{\sigma}_w^2 = \hat{\gamma}(0) - \hat{\mathbf{\gamma}}_p'\hat{\Gamma}_p^{-1}\hat{\mathbf{\gamma}}_p \end{equation}

These estimators have the following asymptotic properties.

Property: Consistency and Asymptotic Normality of Yule-Walker estimators The Yule-Walker estimators for a causal AR(p) model have the following asymptotic properties:

\begin{equation} \sqrt{T}(\hat{\mathbf{\phi}}- \mathbf{\phi}) \xrightarrow{\mathcal{D}} \mathcal{N}(\mathbf{0},\sigma_w^2\Gamma_p^{-1}) \text{and} \hat{\sigma}_w^2 \xrightarrow{\mathcal{P}} \sigma_w^2 . \end{equation}

Therefore the Yule-Walker estimators have an asymptotically normal distribution and the estimator of the innovation variance is consistent. Moreover, these estimators are also optimal for AR(p) models, meaning that they are also efficient. However, there exists another method which allows to achieve this efficiency also for general ARMA models and this is the maximum likelihood method. Considering an AR(1) model as an example, and assuming without loss of generality that the expectation is zero, we have the following representation of the AR(1) model \begin{equation} X_t = \phi X_{t-1} + w_t \end{equation} where $|\phi|<1$ and $w_t \overset{iid}{\sim} \mathcal{N}(0,\sigma_w^2)$. Supposing we have observations issued from this model $(x_t){t=1,...,T}$, then the likelihood function for this setting is given by \begin{equation} L(\phi,\sigma_w^2) = f(\phi,\sigma_w^2|x_1,...,x_T) \end{equation} which, for an AR(1) model, can be rewritten as follows \begin{equation} L(\phi,\sigma_w^2) = f(x_1)f(x_2|x_1)\cdot \cdot \cdot f(x_T|x_{T-1}). \end{equation} If we define $\Omega_t^p$ as the information contained in the previous $p$ observations to time $t$, the above can be generalized for an AR(p) model as follows \begin{equation} L(\phi,\sigma_w^2) = f(x_1,...,x_p)f(x_{p+1}|\Omega_{p+1}^p)\cdot \cdot \cdot f(x_T|\Omega_{T-1}^p) \end{equation} where $f(x_1,...,x_p)$ is the joint probability distribution of the first $p$ observations. A discussion on how to find $f(x_1,...,x_p)$ will be presented in the following paragraphs based on the approach to find $f(x_1)$ in the AR(1) likelihood. Going back to the latter, we know that $x_t|x{t-1} \sim \mathcal{N}(\phi x_{t-1},\sigma_w^2)$ and therefore we have that \begin{equation} f(x_t|x_{t-1}) = f_w(x_t - \phi x_{t-1}) \end{equation} where $f_w(\cdot)$ is the distribution of $w_t$. This rearranges the likelihood function as follows \begin{equation} L(\phi,\sigma_w^2) = f(x_1)\prod_{t=2}^T f_w(x_t - \phi x_{t-1}) \end{equation} where $f(x_1)$ can be found through the causal representation \begin{equation} x_1 = \sum_{j=0}^{\infty} \phi^j w_{1-j} \end{equation} which implies that $x_1$ follows a normal distribution with zero expectation and a variance given by $\frac{\sigma_w^2}{(1-\phi^2)}$. Based on this, the likelihood function of an AR(1) finally becomes \begin{equation} L(\phi,\sigma_w^2) = (2\pi \sigma_w^2)^{-\frac{n}{2}} (1 - \phi)^2 \exp \left(-\frac{S(\phi)}{2 \sigma_w^2}\right) \end{equation} with $S(\phi) = (1-\phi)^2 x_1^2 + \sum_{t=2}^T (x_t -\phi x_{t-1})^2$. Once the derivative of the logarithm of the likelihood is taken, the minimization of the negative of this function is usually done numerically. However, if we condition on the initial values, the AR(p) models are linear and, for example, we can then define the conditional likelihood of an AR(1) as \begin{equation} L(\phi,\sigma_w^2|x_1) = (2\pi \sigma_w^2)^{-\frac{n-1}{2}} \exp \left(-\frac{S_c(\phi)}{2 \sigma_w^2}\right) \end{equation} where \begin{equation} S_c(\phi) = \sum_{t=2}^T (x_t -\phi x_{t-1})^2 . \end{equation} The latter is called the conditional sum of squares and $\phi$ can be estimated as a straightforward linear regression problem. Once an estimate $\hat{\phi}$ is obtained, this can be used to obtain the conditional maximum likelihood estimate of $\sigma_w^2$ \begin{equation} \hat{\sigma}_w^2 = \frac{S_c(\hat{\phi})}{(n-1)} . \end{equation}

The estimation methods presented so far are standard ones for these kind of models. Nevertheless, if the data suffers from some form of contamination, these methods can become highly biased. For this reason, some robust estimators are available to limit this problematic if there are indeed outliers in the observed time series. A first solution is given by the estimator proposed in Kunsch (1984) who underlines that the MLE score function of an AR(p) is given by \begin{equation} \kappa(\mathbf{\theta}|x_j,...x_{j+p}) = \frac{\partial}{\partial \mathbf{\theta}} (x_{j+p} - \sum_{k=1}^p \phi_k x_{j+p-k})^2 \end{equation} where $\theta$ is the parameter vector containing, in the case of an AR(1) model, the two parameters $\phi$ and $\sigma_w^2$ (i.e.$\theta = [\phi \,\, \sigma_w^2]$). This delivers the estimating equation \begin{equation} \sum_{j=1}^{n-p} \kappa (\hat{\mathbf{\theta}}|x_j,...x_{j+p}) = 0 . \end{equation} The score function $\kappa(\cdot)$ is clearly not bounded, in the sense that if we arbitrarily move a value of $(x_t)$ to infinity then the score function also goes to infinity thereby delivering a biased estimation procedure. To avoid that outlying observations bias the estimation excessively, a bounded score function can be used to deliver an M-estimator given by \begin{equation} \sum_{j=1}^{n-p} \psi (\hat{\mathbf{\theta}}|x_j,...x_{j+p}) = 0, \end{equation} where $\psi(\cdot)$ is a function of bounded variation. When conditioning on the first $p$ observations, this problem can be brought back to a linear regression problem which can be applied in a robust manner using the robust regression tools available in R such as rlm(...) or lmrob(...). However, another available tool in R which can applied directly without conditioning also for general ARMA models is the gmwm(...) function which, by specifying the option robust = TRUE. This function makes use of a quantity called the wavelet variance (denoted as $\nu$) which is estimated robustly and then used to retrieve the parameters $\theta$ of the time series model. The robust estimate is obtained by solving the following minimization problem \begin{equation} \hat{\boldsymbol{\theta}} = \underset{\boldsymbol{\theta} \in \boldsymbol{\Theta}}{\text{argmin}} (\hat{\boldsymbol{\nu}} - \boldsymbol{\nu}(\boldsymbol{\theta}))^T\boldsymbol{\Omega}(\hat{\boldsymbol{\nu}} - \boldsymbol{\nu}({\boldsymbol{\theta}})), \end{equation} where $\hat{\nu}$ is the robustly estimated wavelet variance, $\nu{\theta}$ is the theoretical wavelet variance and $\Omega$ is positive definite weighting matrix. Below we show some simulation studies where we present the results of the above estimation procedures in absence and in presence of contamination in the data.

In particular, we simulate three different processes $X_t, Y_t, Z_t$ from [X_t = 0.5 X_{t-1} - 0.25 X_{t-2} + W_t] with $W_t \sim WN(0, 1)$. Within two of the processes, we apply either a process-wise contamination or a pointwise contamination. The process wise contamination, $Y_t$, has a portion of the original process, $X_t$ replaced with concurrent samples from [U_t = 0.90 U_{t-1} - 0.40 U_{t-2} + V_t], where $V_t \sim WN(0, 9)$ whereas the point-wise contaimination, $Z_t$, has points selected at random from $X_t$ and replaced with $N_t \sim WN(0, 9)$.

``` {r simuAR2data, cache = TRUE} n = 1000 # Sample Size gamma = 0.05 # Level of contamination dispersal = round(gamma*n) # Amount of samples contaminated

set.seed(19) # Set seed for reproducibility

Generate data!

Xt = gen_gts(n, AR(phi = c(0.5,0.25), sigma2 = 1)) Yt = gen_gts(n, AR(phi = c(0.5,0.25), sigma2 = 1)) Zt = gen_gts(n, AR(phi = c(0.5,0.25), sigma2 = 1))

Contaminate a portion of Y_t with a process

Yt[101:135,] = gen_gts(35, AR(phi = c(0.9,-0.4), sigma2 = 9))

Contaminate at random Z_t with noise

Zt[sample(n, dispersal, replace = FALSE),] = gen_gts(dispersal, WN(sigma2 = 9))

```r
# Graph points
par(mfrow = c(3,1))
plot(Xt, main = expression(paste("Simulated process ",X[t])))
plot(Yt, main = expression(paste("Simulated process ",Y[t])))
plot(Zt, main = expression(paste("Simulated process ",Z[t])))

As every process will require the same estimation methodology, we will first define a computational engine that will compute estimates for the parameters across all of the different methods under the simulation study. Doing so provides separate, reusable code that lowers the threshold for error when compared to having multiple copies of the same code scattered about. In addition, we have defined a means to visualize the results obtained from the simulation study through a boxplot. The procedure for creating the graphs is detailed after the simulation engine.

sim_engine = function(Xt){
  # Estimate ARIMA parameters using MLE
  mod = arima(Xt, order = c(2, 0, 0), include.mean  = FALSE)

  # Extract MLE Parameters (including sigma)
  res.MLE = c(mod$coef, mod$sigma)

  # Calculate ACF
  autocorr = as.numeric(acf(Xt, lag.max = 2, plot = FALSE)$acf)
  X = matrix(1,2,2)
  X[1,2] = X[2,1] = autocorr[2]
  y = autocorr[2:3]

  # Compute Least Squares on ACF
  svmat = solve(X)
  phi.LS = svmat%*%y
  sig2.LS = var(Xt)*(1 - t(y)%*%svmat%*%y)
  res.LS = c(phi.LS, sig2.LS)

  # Calculate RACF
  rob.ccf = as.numeric(robcor::robacf(Xt, plot=FALSE, type = "covariance")$acf)
  X[1,2] = X[2,1] = rob.ccf[2]/rob.ccf[1]
  y = rob.ccf[2:3]/rob.ccf[1]

  # Compute Least Squares on RACF
  svmat = solve(X)
  phi.RR = svmat%*%y
  sig2.RR = rob.ccf[1]*(1 - t(y)%*%svmat%*%y)
  res.RR = c(phi.RR, sig2.RR)

  # Compute the GMWM Estimator
  res.GMWM = gmwm2::gmwm(ARMA(2,0), Xt)$estimate
  res.RGMWM = gmwm2::gmwm(ARMA(2,0), Xt, robust = TRUE)$estimate

  # Return results
  list(res.MLE = res.MLE, res.LS = res.LS, res.RR = res.RR,
       res.GMWM = res.GMWM, res.RGMWM = res.RGMWM)
} 
sim_study_graph = function(res.MLE, res.LS, res.RR, res.GMWM, res.RGMWM,
                           Truth = c(0.5, 0.25, 1)){
  # Converts Simulation Matrices to Long Form for use in ggplot2
  d = bmisc::study_df(res.MLE, res.LS, res.RR, res.GMWM, res.RGMWM,
                         data_names = c("MLE","LS","RLS","GMWM","RGMWM"))

  # Reorder factors
  d$Draw = factor(d$Draw, levels = c(1,2,3),
                  labels = expression(phi[1], phi[2], sigma^2))
  d$Type = factor(d$Type, levels = c("MLE","LS","RLS","GMWM","RGMWM"))

  # Add in horizontal lines
  oracle = data.frame(Draw = 1:3, Truth = Truth)
  oracle$Draw = factor(oracle$Draw, levels = c(1,2,3), 
                       labels = expression(phi[1], phi[2], sigma^2))

  # Plot
  ggplot(d, aes(x = Type, y = Values, fill = Type)) + 
    geom_boxplot() + 
    stat_boxplot(geom ='errorbar', width = 0.5) + 
    geom_hline(data = oracle, aes(yintercept = Truth), color = "red") +
    facet_wrap(~Draw, labeller = label_parsed, scales = "free_y") +
    theme_bw() + 
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 50, hjust = 1))
}

With the means to now compute and visualize parameter estimates for each of the given methods, the bootrap simulation study may now commence. Using the simulation engine, we can simultaneously evaluate the processes within one loop iteration. Please take into consideration that the following simulation study is computationally intensive and may require the amount of iteration $B$ to be decreased on a personal computer to complete in a reasonable time.

``` {r simuAR2study, cache = TRUE, fig.height = 4.5, fig.width = 9}

Number of bootstrap iterations

B = 250

Simulation storage

res.xt.MLE = res.xt.LS = res.xt.RR = res.xt.GMWM = res.xt.RGMWM = matrix(NA, B, 3) res.yt.MLE = res.yt.LS = res.yt.RR = res.yt.GMWM = res.yt.RGMWM = matrix(NA, B, 3) res.zt.MLE = res.zt.LS = res.zt.RR = res.zt.GMWM = res.zt.RGMWM = matrix(NA, B, 3)

Begin bootstrap

for (i in seq_len(B)){

# Set seed for reproducibility set.seed(i)

# Generate processes Xt = gen_gts(n, AR(phi = c(0.5, 0.25), sigma2 = 1)) Yt = gen_gts(n, AR(phi = c(0.5, 0.25), sigma2 = 1)) Zt = gen_gts(n, AR(phi = c(0.5, 0.25), sigma2 = 1))

# Generate Ut contamination process that replaces a portion of original signal Yt[101:135] = gen_gts(35, AR(phi = c(0.9,-0.4), sigma2 = 9))

# Generate Nt contamination that inject noise at random Zt[sample(n, dispersal, replace = FALSE)] = gen_gts(dispersal, WN(sigma2 = 9))

# Compute estimates and store in the appropriate matrix res = sim_engine(Xt) res.xt.MLE[i,] = res$res.MLE; res.xt.LS[i,] = res$res.LS; res.xt.RR[i,] = res$res.RR res.xt.GMWM[i,] = res$res.GMWM;res.xt.RGMWM[i,] = res$res.RGMWM

res = sim_engine(Yt) res.yt.MLE[i,] = res$res.MLE; res.yt.LS[i,] = res$res.LS; res.yt.RR[i,] = res$res.RR res.yt.GMWM[i,] = res$res.GMWM;res.yt.RGMWM[i,] = res$res.RGMWM

res = sim_engine(Zt) res.zt.MLE[i,] = res$res.MLE; res.zt.LS[i,] = res$res.LS; res.zt.RR[i,] = res$res.RR res.zt.GMWM[i,] = res$res.GMWM;res.zt.RGMWM[i,] = res$res.RGMWM }

```r
sim_study_graph(res.xt.MLE, res.xt.LS, res.xt.RR, res.xt.GMWM, res.xt.RGMWM) 
sim_study_graph(res.yt.MLE, res.yt.LS, res.yt.RR, res.yt.GMWM, res.yt.RGMWM)
sim_study_graph(res.zt.MLE, res.zt.LS, res.zt.RR, res.zt.GMWM, res.zt.RGMWM)

After performing the estimation, we opt to visually assess how well each methodology performed. We begin by observing the baseline case regarding the the uncontaminated sample $X_t$ given in Figure \@ref(fig:simuAR2XtResults). Within this context, the estimation methods appear to performing equally across all parameters. Note, it may appear as if $\phi_2$ had a noticable departure from the truth, however, the difference presented is minor considering the y-axis of the graph and parameter recovery present elsewhere. Turning our attention to Figure \@ref(fig:simuAR2YtResults), which displays results from estimating $Y_t$, the process with partial contamination from another; the robust methodologies (R-LS, RGMWM) were able to effectively handle the contamination when compared to the classical approaches (LS, MLE, GMWM) missed the mark greatly on several estimates. From the simulations, there is evidence to suggest that additional study should be performed to assess the performance of R-LS vs. RGMWM. Lastly, we consider the results of estimating $Z_t$, the case of random noise injected into the process, given in Figure \@ref(fig:simAR2ZtResults). The noise has a troubling impact with respect to estimating the large $\phi_1$ and $\sigma^2$. However, the estimation procedure appear to be able to reasonably capture $\phi_2$. As shown previously, the robust techniques provide better estimations than their counterparts.

For all the above methods, it would be necessary to understand how "precise" their estimates are. To do so we would need to obtain confidence intervals for these estimates and this can be done mainly in two manners:

The first approach consists in using the asymptotic distribution of the estimators presented earlier to deliver approximations of the confidence intervals which get better as the length of the observed time series increases. Hence, for example, if we wanted to find a 95% confidence interval for the parameter $\phi$, we would use the quantiles of the normal distribution (given that all methods presented earlier present this asymptotic distribution). However, this approach can present some drawbacks, one of which is there behaviour when the parameters are close to the boundaries of the parameter space. Suppose we consider a realization of length $n = 100$ of the following AR(1) model:

[X_t = 0.96 X_{t-1} + W_t, \;\;\;\; W_t \sim \mathcal{N}(0,1),]

which is depicted in Figure \@ref(fig:simAR1ci)

``` {r simAR1ci, cache = TRUE, fig.height= 4.5, fig.width = 9, fig.cap = "AR(1) with $\phi$ close to parameter bound"} set.seed(7) x = gen_gts(n = 100, AR1(0.96, 1)) plot(x)

It can be seen that the parameter $\phi = 0.96$ respects the condition for 
stationarity, $\left| \phi \right| < 1$, but is very close to its boundary.
Using the MLE and the GMWM estimators, we first estimate the parameters and
then compute confidence intervals for $\phi$ using the asymptotic 
normal distribution.

```r
# Compute the parameter estimates using MLE
fit.ML = arima(x, order = c(1,0,0), include.mean = FALSE)
c("phi" = fit.ML$coef, "se" = sqrt(fit.ML$var.coef))

# Construct asymptotic confidence interval for phi
fit.ML$coef + c(-1,1)*1.96*as.numeric(sqrt(fit.ML$var.coef))

# Compute the parameter estimates with inference using GMWM
fit.gmwm = gmwm2::gmwm(AR1(), x)
summary(fit.gmwm, inference = TRUE)$estimate

From the estimation summary, we note that both confidence intervals contain values that would make the AR(1) non-stationary (i.e. values of $\phi$ larger than 1). These estimates are indicating that they represent the estimated (asymptotic) distribution of $\hat{\phi}$ as displayed in Figure \@ref(fig:asymIC).

```r$ for MLE and GMWM parameter estimates. The dashed vertical line represents the true value of $\phi$, the solid line denotes the upper bound of the parameter space for $\phi$ and the vertical ticks represent the limits of the 95% confidence intervals for both methods."}

Define colors

gg_color_hue = function(n, alpha = 1) { hues = seq(15, 375, length = n + 1) hcl(h = hues, l = 65, c = 100, alpha = alpha)[1:n] }

colors = gg_color_hue(6, alpha = 0.2) colors2 = gg_color_hue(6, alpha = 1) phi.sd.gmwm = summary(fit.gmwm)$estimate[,"SE"][1]

par(mfrow = c(1,2)) xx = seq(from = 0, to = 1.2, length.out = 10^3) yy.ML = dnorm(xx, fit.ML$coef, sqrt(fit.ML$var.coef)) plot(NA, xlim = c(0.8,1.04), ylim = range(yy.ML), xlab = expression(phi), ylab = "Density", main = "MLE") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.96, lwd = 2, lty = 2, col = "grey60") polygon(c(xx,rev(xx)), c(rep(0,length(yy.ML)), rev(yy.ML)), border = NA, col = colors[1]) points(qnorm(0.975,fit.ML$coef, sqrt(fit.ML$var.coef)), 0, cex = 3, col = colors2[1], pch = "|") points(qnorm(0.025,fit.ML$coef, sqrt(fit.ML$var.coef)), 0, cex = 3, col = colors2[1], pch = "|")

yy.gmwm = dnorm(xx, fit.gmwm$estimate[1], phi.sd.gmwm) plot(NA, xlim = c(0.7,1.2), ylim = range(yy.gmwm), xlab = expression(phi), ylab = "Density", main = "GMWM") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.96, lwd = 2, lty = 2, col = "grey60") polygon(c(xx,rev(xx)), c(rep(0,length(yy.gmwm)), rev(yy.gmwm)), border = NA, col = colors[2]) points(qnorm(0.975, fit.gmwm$estimate[1], phi.sd.gmwm), 0, cex = 3, col = colors2[2], pch = "|") points(qnorm(0.025, fit.gmwm$estimate[1], phi.sd.gmwm), 0, cex = 3, col = colors2[2], pch = "|")

For this purpose, the confidence intervals are not realistic. Therefore, one viable solution is to employ a parametric bootstrap, detailed in Theorem \@ref(thm:parabootstrap). Indeed, parametric bootstrap takes the estimated parameter values and uses them in order to simulate from an AR(1) process. For each simulation, the parameters are estimated again and stored.  Finally, the empirical quantiles at $\alpha/2$ and $1-\alpha/2$ are taken of the saved estimated parameter values to obtrain a confidence interval which does not suffer from boundary problems. The code below gives an example of how this confidence interval is built based on the same estimation procedure but using parametric bootstrap (using $B = 10000$ bootstrap replicates).


``` {r paraAR1ciest, cache = TRUE, warning = FALSE, fig.height= 6, fig.width= 9}
# Number of Iterations
B = 10000

# Set up storage for results
est.phi.gmwm = rep(NA,B)
est.phi.ML = rep(NA,B)

# Model generation statements
model.gmwm = AR1(fit.gmwm$estimate[1], fit.gmwm$estimate[2])
model.mle = AR1(fit.ML$coef, fit.ML$sigma2)

# Begin bootstrap
for(i in seq_len(B)){

  # Set seed for reproducibility
  set.seed(B + i)

  # Generate process under MLE parameter estimate
  x.star = gen_gts(100, model.mle)

  # Attempt to estimate phi by employing a try
  est.phi.ML[i] = tryCatch(arima(x.star, order = c(1,0,0), include.mean = FALSE)$coef, 
                           error = function(e) NA)

  # Generate process under GMWM parameter estimate
  x.star = gen_gts(100, model.gmwm)
  est.phi.gmwm[i] = gmwm2::gmwm(model.gmwm, x.star)$estimate[1]
}

``` {r paraAR1cigraphs, cache = TRUE, warning = FALSE, echo = FALSE, fig.height = 6, fig.width = 9, fig.cap = "Estimated parametric distributions of $\hat{\phi}$ for MLE and GMWM parameter estimates. The bar plots represent the results from the parametric bootstrap, the densities represent the estimate asymptotic distribution, the vertical line represents the true value of $\phi$, the solid lines denotes the upper bound of the parameter space for $\phi$ and the vertical ticks represent the limits of the 95% confidence intervals under both approaches for the estimation methods."} par(mfrow = c(1,2))

hist.phi.ML = hist(est.phi.ML, plot = FALSE, breaks = 20) hist.phi.ML$counts = hist.phi.ML$counts/sum(hist.phi.ML$counts)/diff(hist.phi.ML$mids)[1] plot(NA, xlim = c(0.65,1.04), ylim = range(yy.ML), xlab = expression(phi), ylab = "Density", main = "MLE") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.96, lwd = 2, lty = 2, col = "grey60") plot(hist.phi.ML, xlim = c(0.6,1.1), ylim = range(yy.ML), add = T, col = colors[3], border = "grey60") polygon(c(xx,rev(xx)), c(rep(0,length(yy.ML)), rev(yy.ML)), border = NA, col = colors[1])

points(qnorm(0.975,fit.ML$coef, sqrt(fit.ML$var.coef)), 0, cex = 3, col = colors2[1], pch = "|") points(qnorm(0.025,fit.ML$coef, sqrt(fit.ML$var.coef)), 0, cex = 3, col = colors2[1], pch = "|")

points(quantile(est.phi.ML, 0.975, na.rm = T), 0, cex = 3, col = colors2[3], pch = "|") points(quantile(est.phi.ML, 0.025, na.rm = T), 0, cex = 3, col = colors2[3], pch = "|")

hist.phi.GMWM = hist(na.omit(est.phi.gmwm), plot = FALSE, breaks = 20) hist.phi.GMWM$counts = hist.phi.GMWM$counts/sum(hist.phi.GMWM$counts)/diff(hist.phi.GMWM$mids)[1] plot(NA, xlim = c(0.53,1.2), ylim = range(hist.phi.GMWM$counts), xlab = expression(phi), ylab = "Density", main = "GMWM") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.96, lwd = 2, lty = 2, col = "grey60") plot(hist.phi.ML, xlim = c(0.53,1.2), ylim = range(hist.phi.GMWM$counts), add = T, col = colors[5], border = "grey60") polygon(c(xx,rev(xx)), c(rep(0,length(yy.gmwm)), rev(yy.gmwm)), border = NA, col = colors[2])

points(qnorm(0.975,fit.gmwm$estimate[1], phi.sd.gmwm), 0, cex = 3, col = colors2[2], pch = "|") points(qnorm(0.025,fit.gmwm$estimate[1], phi.sd.gmwm), 0, cex = 3, col = colors2[2], pch = "|")

points(quantile(est.phi.gmwm, 0.975, na.rm = T), 0, cex = 3, col = colors2[5], pch = "|") points(quantile(est.phi.gmwm, 0.025, na.rm = T), 0, cex = 3, col = colors2[5], pch = "|")

In Figure \@ref(fig:paraAR1cigraphs), we compare the estimated densities
for $\hat{\phi}$ using asymptotic results and bootstrap techniques for the MLE 
and the GMWM estimator. It can be observed that the issue that arose previously 
with unrealistic confidence intervals have been resolved as the interval regions
lie entirely within the boundaries of the parameter space.

To emphasize that the effectiveness of parametric bootstrap, let us
consider one additonal example of an AR(2) process. For this example, discussion
will focus on observing the behavior exhibited solely by the MLE. Having said
this, the AR(2) process is defined to be:

\begin{equation}
{X_t} = {1.98}{X_{t - 1}} - {0.99}{X_{t - 2}} + {W_t}
(\#eq:ciar2)
\end{equation}

where $W_t \sim WN(0, 1)$. The generated process is displayed in Figure \@ref(fig:CIAR2data). 

```r
set.seed(432)
Xt = gen_gts(500, AR(phi = c(1.98, -0.99), sigma2 = 1))
plot(Xt)
mod = arima(Xt, c(2,0,0), include.mean = FALSE)
mod

With the model's coefficients readily available, the parametric bootstrap is able to be constructed. As before, the bootstrap implementation mirrors prior versions with one cavaet regarding the storage of $\phi$ being two dimensional and, thus, requiring a matrix to store the result instead of a traditional atomic vector in R.

B = 10000
est.phi.ML = matrix(NA, B, 2)
model = AR(phi = mod$coef, sigma2 = mod$sigma2)

for(i in seq_len(B)) {
  set.seed(B + i)              # Set Seed for reproducibilty

  x.star = gen_gts(500, model) # Simulate the process 

  # Obtain parameter estimate with protection
  # that defaults to NA if unable to estimate
  est.phi.ML[i,] = tryCatch(arima(x.star, order = c(2,0,0),
                                  include.mean = FALSE)$coef,
                            error = function(e) NA)
}

```r$ and $\hat{\phi_2}$ of the MLE using asymptotic and parametric bootstrap techniques. The colored contours represent the density of distribution and the dark grey lines represent the boundary constraints of $\left|\phi_2\right|<1$ and $\phi_2 = 1 - \phi_1$."}

Graphing

Requires use of MASS

library("MASS") library("RColorBrewer") est.phi.ML = na.omit(est.phi.ML) z = kde2d(est.phi.ML[,1],est.phi.ML[,2], n=50) k = 11 my.cols = rev(brewer.pal(k, "RdYlBu"))

bivn = mvrnorm(100000, mu = mod$coef, Sigma = matrix(c(mod$var.coef), 2)) bivn.kde = kde2d(bivn[,1], bivn[,2], n = 50)

par(mfrow = c(1,2)) plot(NA, xlim = c(1.96,2), ylim = c(-1.02,-0.97), xlab = expression(phi[1]), ylab = expression(phi[2]), cex.lab = 1.5, main = "Asymptotic") grid()

Adding boundary of constraint |phi_2| < 1

abline(h = c(-1,1), lty = 2, col = "darkgrey")

Adding boundary of constraint phi_2 = 1 - phi_1

phi1 = seq(from = -2, to = 2, length.out = 10^3) phi2.c1 = 1 - phi1 lines(phi1, phi2.c1, lty = 2, col = "darkgrey")

Adding boundary of constraint phi_2 = 1 + phi_1

phi1 = seq(from = -2, to = 2, length.out = 10^3) phi2.c2 = 1 + phi1 lines(phi1, phi2.c2, lty = 2, col = "darkgrey") contour(bivn.kde, drawlabels=FALSE, nlevels=k, col=my.cols, add = TRUE)

plot(NA, xlim = c(1.96,2), ylim = c(-1.02,-0.97), xlab = expression(phi[1]), ylab = expression(phi[2]), cex.lab = 1.5, main = "Bootstrap") grid()

Adding boundary of constraint |phi_2| < 1

abline(h = c(-1,1), lty = 2, col = "darkgrey")

Adding boundary of constraint phi_2 = 1 - phi_1

phi1 = seq(from = -2, to = 2, length.out = 10^3) phi2.c1 = 1 - phi1 lines(phi1, phi2.c1, lty = 2, col = "darkgrey")

Adding boundary of constraint phi_2 = 1 + phi_1

phi1 = seq(from = -2, to = 2, length.out = 10^3) phi2.c2 = 1 + phi1 lines(phi1, phi2.c2, lty = 2, col = "darkgrey") contour(z, drawlabels=FALSE, nlevels=k, col=my.cols, add = TRUE)

The cost of this approach is the assumption the model captures the dependency structure that is present within the time series. That is to say, we are able to successfully a regenerate new time series process that follows the appropriate distribution for each sampling phase. However, if we are not confident that the model we have selected is a valid estimation of the truth, then using parametric bootstrap with the estimated model is highly problematic as it does not represent the dependency between observations. To complicate matters further, the traditional bootstrap that  consists of simple random sampling with replacement, in the presence of dependency,  would also be highly inaccurate and downright reckless to use. Therefore, to  preserve the dependency structure of the original data one would have to use _block bootstrapping_, which is a form of non-parametric bootstrap. There  are many different kinds of block bootstraps for time series, which are descendents of the Moving Block Bootstrap (MBB) presented in \@ref(thm:mbb).

```{theorem, label="mbb", "Moving Block Bootstrap"}
Suppose that $\left(X_t\right)$ is weakly stationary time series with $N$ 
observations. 

1. Divide time series into overlapping blocks $\left\{S_1, \ldots, S_M\right\}$ of
   length $\ell$, $1 \le \ell < N$, resulting in $M = N - \ell + 1$ blocks structured as: 

$$\begin{aligned}   
  {S_1}& = & ({X_1}, &{X_2}, \cdots , {X_\ell}) & && \\
  {S_2}& = & &( {X_2}, {X_3}, \cdots , {X_{\ell + 1}}) & && \\
  & \cdots & & {} & \cdots && \\
  {S_M} & = & & & {} &&( {X_{N-\ell+1}}, {X_{N-\ell+2}}, \cdots , {X_{N}})
\end{aligned}$$

2. Draw $M = \left\lfloor {\frac{N}{\ell}} \right\rfloor$ blocks with replacement
   from these $\left\{S_1, \ldots, S_M\right\}$ blocks and place them in order to form
   $(X_t^*)$. 
3. Compute the statistic of interest on the simulated 
   sample $(X_t^*)$.
4. Repeat Steps 2 and 3 $B$ times where $B$ is sufficiently "large" 
   (typically $100 \leq B \leq 10000$).
5. Compute the empirical mean and variance on the statistic of interest based on 
   the $B$ independent replications. 

The approach taken by MBB ensures that within each block the dependency between observations is preserved. Though, one particular issue that now arises is that some inaccuracy is introduced as a result of successive blocks potentially being independent from each other. In reality, this is one of the trade-offs of the MBB approach that can be mitigated by selecting an optimal $\ell$. To lower the inaccuracy of the procedure the selection of $\ell = N^{1/3}$ as $N \to \infty$ is optimal (proof left as an exercise to the reader). An earlier variant of MBB was called nonoverlapping block bootstrap (NBB), which prohbited the sharing of data described in \@ref(thm:nbb).

```{theorem, label="nbb", "Nonoverlapping Block Bootstrap"} Suppose that $\left(X_t\right)$ is weakly stationary time series with $N$ observations.

  1. Divide time series into nonoverlapping blocks $\left{S_1, \ldots, S_M\right}$ of length $\ell$, $1 \le \ell < N$, resulting in $M = \left\lfloor {\frac{N}{\ell}} \right\rfloor$ blocks structured as:

$$\begin{aligned}
{S_1}& = & ({X_1}, {X_2}, \cdots , {X_\ell})& & && \ {S_2}& = & &( {X_{\ell+1}}, {X_{\ell+2}}, \cdots , {X_{2\ell}}) & && \ & \cdots & & {} & \cdots && \ {S_K} & = & & & {} &&( {X_{\left({}\right)}}, {X_{N-\ell+2}}, \cdots , {X_{N}}) \end{aligned}$$

  1. Draw $M$ blocks with replacement from these $\left{S_1, \ldots, S_M\right}$ blocks and place them in order to form $(X_t^*)$.
  2. Compute the statistic of interest on the simulated sample $(X_t^*)$.
  3. Repeat Steps 2 and 3 $B$ times where $B$ is sufficiently "large" (typically $100 \leq B \leq 10000$).
  4. Compute the empirical mean and variance on the statistic of interest based on the $B$ independent replications.
Alternatively, depending on the case one can also use modification of MBB that seeks to change how the beginning and end of the time series is weighted such as a circular block-bootstrap (CBB), that seeks improve dependency by wrapping observations, or a stationary bootstrap (SB), that randomizes block length by under a geometric distribution of mean $\ell$. Regardless, the outcomes from using MBB on time series is considerably better than just resampling, which is also possible if we set $\ell = 1$ since the length of $(X_t^*)$ is $M\times\ell$.

Having said this, the implementation of Theorem \@ref(thm:nbb) follows the
similar mold of generating data, estimating, and returning a value. The most notably deviation from Theorem \@ref(thm:nbb) is the use of an indexing trick to indicate the start period of each block that allows for only one copy of the data to be held within memory instead of $m$ blocks of length $\ell$ in addition to the initial copy.

```r
ar1_blockboot = function(Xt, block_len = 10, B = 500) {

  n = length(Xt)            # Length of Time Series
  res = rep(NA, B)          # Bootstrapped Statistics
  m = floor(n/block_len)    # Amount of Blocks

  for (i in seq_len(B)) {   # Begin MMB

    set.seed(i + 1199)      # Set seed for reproducibility
    x_star = rep(NA, n)     # Setup storage for new TS

    for (j in seq_len(m)) { # Simulate new time series 

      index = sample(m, 1)  # Randomize block starting points

      # Place block into time series
      x_star[(block_len*(j - 1) + 1):(block_len*j)] = 
            Xt[(block_len*(index - 1) + 1):(block_len*index)]
    }

    # Calculate parameter with protection
    res[i] = tryCatch(arima(x_star, order = c(1,0,0), include.mean = FALSE)$coef,
                      error = function(e) NA)
  }

  na.omit(res)              # Release bootstrap result
}

With the block bootstrapping technique in hand, let us consider a scenario where the model's assumption that the residuals will be Gaussian is violated. Consider two AR(1) processes with the same coefficient but different noise generation procedures:

$$ \begin{aligned} \mathcal{M}1:&{}& X_t &=0.5 X{t-1} + W_t,&{}& W_t\sim \mathcal{N} (0,1) \ \mathcal{M}2:&{}& X_t &=0.5 X{t-1} + V_t,&{}& V_t\sim t_4 \end{aligned} $$ The generation procedure for $\mathcal{M}1$ is straightforward, use: gen_gts(). However, the underlying generating mechanism for gen_gts() relies upon the noise being Gaussian. Therefore, to generate $\mathcal{M}_2$, one must use _R's arima.sim() function with a custom noise generator defined to sample from a $t$ distribution.

set.seed(1)                                        # Set seed for reproducibilty
xt_m1 = gen_gts(500, AR1(phi = 0.5, sigma2 = 1))   # Gaussian noise only
xt_m2 = arima.sim(n = 500, list(ar = 0.5, ma = 0), # Multiple noises supported
                  rand.gen = function(n, ...) rt(n, df = 4))
par(mfrow = c(1,2))
plot(xt_m1, main = "Model 1")
plot(xt_m2, main = "Model 2")

From Figure \@ref(fig:mbbdatavis), the time series look remarkably different as a result of the noise process being altered slightly even though the $\phi$ coefficient was held constant. Principally, the difference is the due related to the variance of the $t$ distribution is defined to be $\frac{\nu}{\nu-2}$ for $\nu > 2$ and $\infty$ for $\nu \le 2$. Therefore, the i.i.d white noise processes generated are $\sigma^2_{\mathcal{M}1} = 1$ when compared to the alternative $\sigma^2{\mathcal{M}_2} = 2$. With this being said, both the underlying nonparametric and parametric bootstrap estimation procedures betweenthe models will be the same. The implementation for blocking approach regarding an AR(1) was discussed previously and the parametric approach is as follows:

ar1_paraboot = function(model, B = 10000) {
  est.phi = rep(NA,B)    # Define a storage vector

  for(i in seq_len(B)) { # Perform bootstrap

    set.seed(B + i)      # Set seed for reproducibility

    # Simulate time series underneath the estimated model
    x.star = arima.sim(n = 500, list(ar = model$coef, ma = 0),
                       sd = sqrt(model$sigma2))

    # Attempt to estimate parameters with recovery
    est.phi[i] = tryCatch(arima(x.star, order = c(1,0,0),
                                   include.mean = FALSE)$coef, 
                          error = function(e) NA)

  }

  na.omit(est.phi)       # Return estimated phis.
}

With both functions in hand, the procedure is regulated to calling them on the different sets of data and models.

B = 10000 

# Model 1
fit_m1_mle = arima(xt_m1, order = c(1,0,0), include.mean = FALSE)
para_m1_phi  = ar1_paraboot(fit_m1_mle, B = B)
block_m1_phi = ar1_blockboot(xt_m1, block_len = 25, B = B)

# Model 2
fit_m2_mle = arima(xt_m2, order = c(1,0,0), include.mean = FALSE)
para_m2_phi  = ar1_paraboot(fit_m2_mle, B = B)
block_m2_phi = ar1_blockboot(xt_m2, block_len = 25, B = B) 

```r$ for MLE parameter estimates. The histogram bars represent the empirical results from the bootstraps with the green representing parametric bootstrap and the red represent the block bootstrap approach. The dashed vertical line represents the true value of $\phi$ and the vertical ticks correspond to the limits of the 95% confidence intervals for both estimation techniques."}

Rewrite as a function later...

Has a dependency on previously written code.

par(mfrow = c(1,2))

Model 1

hist.phi.ML = hist(para_m1_phi, plot = FALSE, breaks = 25) hist.phi.ML$counts = hist.phi.ML$counts/sum(hist.phi.ML$counts)/diff(hist.phi.ML$mids)[1]

hist.phi.ML.bboot = hist(block_m1_phi, plot = FALSE, breaks = 25) hist.phi.ML.bboot$counts = hist.phi.ML.bboot$counts/sum(hist.phi.ML.bboot$counts)/diff(hist.phi.ML.bboot$mids)[1]

plot(NA, xlim = c(0.28,0.6), ylim = range(yy.ML), xlab = expression(phi), ylab = "Density", main = "Model 1") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.95, lwd = 2, lty = 2, col = "grey60") plot(hist.phi.ML, xlim = c(0.28,0.6), ylim = range(yy.ML), add = T, col = colors[3], border = "grey60")

plot(hist.phi.ML.bboot, xlim = c(0.28,0.6), ylim = range(yy.ML), add = T, col = colors[1], border = "grey60")

points(quantile(para_m1_phi, 0.975), 0, cex = 3, col = colors[3], pch = "|") points(quantile(para_m1_phi, 0.025), 0, cex = 3, col = colors[3], pch = "|")

points(quantile(block_m1_phi, 0.975), 0, cex = 3, col = colors[1], pch = "|") points(quantile(block_m1_phi, 0.025), 0, cex = 3, col = colors[1], pch = "|") abline(v = 0.5, lwd = 2, lty = 2, col = "grey60")

Model 2

hist.phi.ML = hist(para_m2_phi, plot = FALSE, breaks = 25) hist.phi.ML$counts = hist.phi.ML$counts/sum(hist.phi.ML$counts)/diff(hist.phi.ML$mids)[1]

hist.phi.ML.bboot = hist(block_m2_phi, plot = FALSE, breaks = 25) hist.phi.ML.bboot$counts = hist.phi.ML.bboot$counts/sum(hist.phi.ML.bboot$counts)/diff(hist.phi.ML.bboot$mids)[1]

plot(NA, xlim = c(0.28,0.6), ylim = range(yy.ML), xlab = expression(phi), ylab = "Density", main = "Model 2") grid() abline(v = 1, lwd = 2, col = "grey60") abline(v = 0.95, lwd = 2, lty = 2, col = "grey60") plot(hist.phi.ML, xlim = c(0.28,0.6), ylim = range(yy.ML), add = T, col = colors[3], border = "grey60")

plot(hist.phi.ML.bboot, xlim = c(0.28,0.6), ylim = range(yy.ML), add = T, col = colors[1], border = "grey60")

points(quantile(para_m2_phi, 0.975), 0, cex = 3, col = colors[3], pch = "|") points(quantile(para_m2_phi, 0.025), 0, cex = 3, col = colors[3], pch = "|")

points(quantile(block_m2_phi, 0.975), 0, cex = 3, col = colors[1], pch = "|") points(quantile(block_m2_phi, 0.025), 0, cex = 3, col = colors[1], pch = "|")

abline(v = 0.5, lwd = 2, lty = 2, col = "grey60")

The results from the parametric and nonparametric block bootstrapping techniques
are displayed in Figure \@ref(fig:blockbootmodels). In both instances, we can 
see that the block bootstrap had a narrower distribution. However, there was a
considerable bias that lead the distribution to be off-centered. The origins of
bias come from the independence associated between blocks as they are placed
into $(X^*_t)$. 


### Forecasting AR(p) Models

One of the most interesting things in time series analysis is to predict the future unobserved values based on the observed values up to now. However, it is not possible if the underlying model is unknown, thus in this section we assume the time series $X_t$ is stationary and its model is known. In particular, we denote forecasts by $X^{n}_{n+m}$, where $n$ represents the data points collected (e.g. $\bm{X} = \{X_{1}, X_{2}, \cdots , X_{n-1}, X_n\}$) and $m$ represents the amount of points in the future we wish to predict. So, $X^{n}_{n+1}$ represents a one-step-ahead prediction $X_{n+1}$ given data $\{X_{1}, X_{2}, \cdots, X_{n-1}, X_{n}\}$.

To obtain forecasts, we rely upon the best linear prediction (BLP). Recall that the BLP Definition \@ref(def:blp) is mentioned when we calculate the PACF of AR model. In general, the best approach to finding the BLP is to use the Theorem \@ref(thm:projtheo).

```{theorem, label="projtheo", name="Projection Theorem"}
Let $\mathcal{M} \subset \mathcal{L}_2$ be a closed linear subspace of Hibert space. 
For every $y \in \mathcal{L}_2$, there exists a unique element $\hat{y} \in \mathcal{M}$ that
minimizes $||y - l||^2$ over $l \in \mathcal{M}$. This element is uniquely 
determined by the requirements

1. $\hat{y} \in \mathcal{M}$ and 
2. $(y - \hat{y}) \perp \mathcal{M}$.

Projection theorem natually leads to an equivalent way to find the best linear predictor by solving the prediction equations, [ \mathbb{E}(X_{t+h} - \hat{X}{t+h}) = 0, \mbox{( it is trivial for TS with zero mean)}] and [ \mathbb{E} [(X{t+h} - \hat{X}_{t+h})X_j ] = 0, \mbox{ for } i = 1, \dots, t.]

Write $\mathbb{E}(X_{i}, X_{j})$ as $\gamma(|i - j|)$, predition equations can be represented by the following form

\begin{equation} \begin{aligned} \begin{pmatrix} \gamma(0) & \gamma(1) & \cdots & \gamma(N-1) \ \gamma(1) & \gamma(0) & \cdots & \gamma(N-2) \ \vdots & \vdots & \ddots & \vdots \ \gamma(N-1) & \gamma(N-2) & \cdots &\gamma(0) \end{pmatrix}{N \times N} \begin{pmatrix} \alpha_1 \ \vdots \ \alpha_N \end{pmatrix}{N \times 1} &= \begin{pmatrix} \gamma(N+h-1) \ \vdots \ \gamma(h) \end{pmatrix}_{N \times 1} \ \Gamma_N \bm{\alpha}_N &= \bm{\gamma}_N \end{aligned} \end{equation}.

Assuming that $\Gamma_N$ is non-singular, then the values of $\bm{\alpha}_N$ are given as:

$$\bm{\alpha}_N = \Gamma^{-1}_N\bm{\gamma}_N$$

\[X_{t + j}^t = E\left[ {{X_{t + j}}|{X_t}, \cdots ,{X_1}} \right] = {E_t}\left[ {{X_{t + j}}} \right],j > 0\]

Then \[E\left[ {{{\left( {{X_{t + j}} - m\left( {{X_1}, \cdots ,{X_t}} \right)} \right)}^2}} \right] \ge E\left[ {{{\left( {{X_{t + j}} - X_{t + j}^t} \right)}^2}} \right]\] for function $m(.)$.
Consider:

$\widetilde{\beta}^{*} = \mathop {\arg \min }\limits_\beta  \underbrace {E\left[ {{{\left( {{X_{t + h}} - m\left( {{X_{t + 1}}, \cdots ,{X_{t + h}}} \right)} \right)}^2}} \right]}_{MSE\left( \beta  \right)}$

and

$\widetilde{\beta} = \mathop {\arg \min }\limits_\beta  \underbrace {E\left[ {{{\left( {{X_{t + h}} - \sum\limits_{j = 1}^{h - 1} {{\beta _j}{X_{t + j}}} } \right)}^2}} \right]}_{MS{E_L}\left( \beta  \right)}$


It is clear that:

\[MS{E_L}\left( {\tilde \beta } \right) \ge MSE\left( {{{\tilde \beta }^*}} \right)\]

Let's now consider 

\[{\left( {{X_t} - m} \right)^2} = {\left[ {\left( {{X_t} - E\left[ {{X_t}|{\Omega _t}} \right]} \right) + \left( {E\left[ {{X_t}|{\Omega _t}} \right] - m} \right)} \right]^2}\]

where we have: $m = m\left( {{X_{t + 1}}, \cdots ,{X_{t + h}}} \right)$ and ${\Omega _t} = \left( {{X_{t + 1}}, \cdots ,{X_{t + h}}} \right)$.

Therefore, 

\[{\left( {{X_t} - m} \right)^2} = {\left( {{X_t} - E\left[ {{X_t}|{\Omega _T}} \right]} \right)^2} + {\left( {E\left[ {{X_t}|{\Omega _T}} \right] - m} \right)^2} + 2\left( {{X_t} - E\left[ {{X_t}|{\Omega _t}} \right]} \right)\left( {E\left[ {{X_t}|{\Omega _T}} \right] - m} \right)\]


Focusing on only the last term, we have that:

\[\underbrace {\left( {{X_t} - E\left[ {{X_t}|{\Omega _t}} \right]} \right)}_{ = {\varepsilon _t}}\left( {E\left[ {{X_t}|{\Omega _T}} \right] - m} \right)\]

\[E\left[ {{\varepsilon _t}|{\Omega _t}} \right] = E\left[ {{X_t} - E\left[ {{X_t}|{\Omega _t}} \right]|{\Omega _t}} \right] = 0\]

So, by the decomposition property, we have that:

\[E\left[ {{\varepsilon _t}\left( {E\left[ {{X_t}|{\Omega _t}} \right] - m} \right)} \right] = E\left[ {E\left[ {{\varepsilon _t}\left( {E\left[ {{X_t}|{\Omega _t}} \right] - m} \right)|{\Omega _t}} \right]} \right] = E\left[ {\underbrace {E\left[ {{\varepsilon _t}|{\Omega _t}} \right]}_{ = 0}\left( {E\left[ {{X_t}|{\Omega _t}} \right] - m} \right)} \right] = 0\]

By the previous discussion, we have that

\[{\left( {{X_t} - m} \right)^2} = {\left( {{X_t} - E\left[ {{X_t}|{\Omega _t}} \right]} \right)^2} + {\left( {E\left[ {{X_t}|{\Omega _t}} \right] - m} \right)^2}\]

is therefore minimized for $m = E\left[ {{X_t}|{\Omega _t}} \right]$.

```{example, label="ar1forecast", name="Forecasting with an AR(1)"} We begin the forecasting examples by considering a first order AR(1) process.

$${X_t} = \phi {X_{t - 1}} + {W_t}$$

where $W_t \sim WN(0, \sigma^2_W)$.

From here, the conditional expected mean and variance are given as:

[\begin{aligned} {E_t}\left[ {{X_{t + j}}} \right] &= {\phi ^j}{X_t} \ Va{r_t}\left[ {{X_{t + j}}} \right] &= \left( {1 + {\phi ^2} + {\phi ^4} + \cdots + {\phi ^{2\left( {j - 1} \right)}}} \right){\sigma ^2} \ \end{aligned} ]

Within this derivation, it is important to remember that:

[\begin{aligned} \mathop {\lim }\limits_{j \to \infty } {E_t}\left[ {{X_{t + j}}} \right] &= 0 = E\left[ {{X_t}} \right] \ \mathop {\lim }\limits_{j \to \infty } Va{r_t}\left[ {{X_{t + j}}} \right] &= \frac{{{\sigma ^2}}}{{1 - {\phi ^2}}} = \operatorname{var} \left( {{X_t}} \right)
\end{aligned} ]

```{example, label="ar2forecast", name="Forecasting with an AR(2)"}
Consider an AR(2) process with the following parametrization:

$${X_t} = {\phi _1}{X_{t - 1}} + {\phi _2}{X_{t - 2}} + {W_t}$$

where $W_t \sim WN(0, \sigma^2_W)$.

Then, we are able to find the BLP for each step by:

\[\begin{aligned}
  {E_t}\left[ {{X_{t + 1}}} \right] &= {\phi _1}{X_t} + {\phi _2}{X_{t - 1}} \\
  {E_t}\left[ {{X_{t + 2}}} \right] &= {E_t}\left[ {{\phi _1}{X_{t + 1}} + {\phi _2}{X_t}} \right] = {\phi _1}{E_t}\left[ {{X_{t + 1}}} \right] + {\phi _2}{X_t} \\
   &= {\phi _1}\left( {{\phi _1}{X_t} + {\phi _2}{X_{t - 1}}} \right) + {\phi _2}{X_t} \\
   &= \left( {\phi _1^2 + {\phi _2}} \right){X_t} + {\phi _1}{\phi _2}{X_{t - 1}} \\ 
\end{aligned} \]

```{example, label="arpforecast", name="Forecasting with an AR(P)"} Consider AR(P) process given as:

$${X_t} = {\phi 1}{X{t - 1}} + {\phi 2}{X{t - 2}} + \cdots + {\phi p}{X{t - p}} + {W_t}$$

where $W_t \sim WN(0, \sigma^2_W)$.

The process can be rearranged into matrix form denoted by:

[\begin{aligned} \underbrace {\left[ {\begin{array}{{20}{c}} {{X_t}} \ \vdots \ \vdots \ {{X_{t - p + 1}}} \end{array}} \right]}_{{Y_t}} &= \underbrace {\left[ {\begin{array}{{20}{c}} {{\phi 1}}& \cdots &{}&{{\phi _p}} \ {}&{}&{}&0 \ {}&{{I{p - 1}}}&{}& \vdots \ {}&{}&{}&0 \end{array}} \right]}A\underbrace {\left[ {\begin{array}{{20}{c}} {{X_{t - 1}}} \ \vdots \ \vdots \ {{X_{t - p}}} \end{array}} \right]}{{Y{t - 1}}} + \underbrace {\left[ {\begin{array}{{20}{c}} 1 \ 0 \ \vdots \ 0 \end{array}} \right]}_C{W_t} \ {Y_t} &= A{Y{t - 1}} + C{W_t} \end{aligned}]

From here, the conditional mean and variance are obtainable via:

[\begin{aligned} {E_t}\left[ {{Y_{t + j}}} \right] &= {E_t}\left[ {A{Y_{t + j - 1}} + C{W_{t + j}}} \right] = {E_t}\left[ {A{Y_{t + j - 1}}} \right] + \underbrace {{E_t}\left[ {C{W_{t + j}}} \right]}{ = 0} \ &= {E_t}\left[ {A\left( {A{Y{t + j - 2}} + C{W_{t + j - 1}}} \right)} \right] = {E_t}\left[ {{A^2}{Y_{t + j - 2}}} \right] = {A^j}{Y_t} \ {\operatorname{var} t}\left( {{Y{t + j}}} \right) &= {\operatorname{var} t}\left( {A{Y{t + j - 1}} + C{W_{t + j}}} \right) \ &= {\sigma ^2}C{C^T} + {\operatorname{var} t}\left( {A{Y{t + j - 1}}} \right) = {\sigma ^2}A{\operatorname{var} t}\left( {{Y{t + j - 1}}} \right){A^T} \ &= {\sigma ^2}C{C^T} + {\sigma ^2}AC{C^T}A + {\sigma ^2}{A^2}{\operatorname{var} t}\left( {{Y{t + j - 2}}} \right){\left( {{A^2}} \right)^T} \ &= {\sigma ^2}\sum\limits_{k = 0}^{j - 1} {{A^k}C{C^T}{{\left( {{A^K}} \right)}^T}} \ \end{aligned} ]

By the recursive properties evident within both the conditional mean and variance, the evaluation can be written to take advantage of these features by the following recursive formulation: [\begin{aligned} {E_t}\left[ {{Y_{t + j}}} \right] &= A{E_t}\left[ {{Y_{t + j - 1}}} \right] \ {\operatorname{var} t}\left[ {{Y{t + j}}} \right] &= {\sigma ^2}C{C^T} + A{\operatorname{var} t}\left( {{Y{t + j - 1}}} \right){A^T} \ \end{aligned} ]

```{example, label="rear2forecast", name="Forecasting with an AR(2) in Matrix Form"}
With the newfound knowledge regarding the recursive form of matrix predictions in hand,
we return to the previous AR(2) forecasting example.

\[\begin{aligned} \underbrace {\left[ {\begin{array}{*{20}{c}}
  {{X_t}} \\ 
  {{X_{t - 1}}} 
\end{array}} \right]}_{{Y_t}} &= \underbrace {\left[ {\begin{array}{*{20}{c}}
  {{\phi _1}}&{{\phi _2}} \\ 
  1&0 
\end{array}} \right]}_A\underbrace {\left[ {\begin{array}{*{20}{c}}
  {{X_{t - 1}}} \\ 
  {{X_{t - 2}}} 
\end{array}} \right]}_{{Y_{t - 1}}} + \underbrace {\left[ {\begin{array}{*{20}{c}}
  1 \\ 
  0 
\end{array}} \right]}_C{W_t} \\
  {Y_t} &= A{Y_{t - 1}} + C{W_t} 
\end{aligned}\]

Then, we are able to calculate the BLP as:

\[\begin{aligned}
  {E_t}\left[ {{Y_{t + 2}}} \right] &= {A^2}{Y_t} = \left[ {\begin{array}{*{20}{c}}
  {{\phi _1}}&{{\phi _2}} \\ 
  1&0 
\end{array}} \right]\left[ {\begin{array}{*{20}{c}}
  {{\phi _1}}&{{\phi _2}} \\ 
  1&0 
\end{array}} \right]\left[ {\begin{array}{*{20}{c}}
  {{X_t}} \\ 
  {{X_{t - 1}}} 
\end{array}} \right] \hfill \\
   &= \left[ {\begin{array}{*{20}{c}}
  {\phi _1^2 + {\phi _2}}&{{\phi _1}{\phi _2}} \\ 
  {{\phi _1}}&{{\phi _2}} 
\end{array}} \right]\left[ {\begin{array}{*{20}{c}}
  {{X_t}} \\ 
  {{X_{t - 1}}} 
\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}
  {\left( {\phi _1^2 + {\phi _2}} \right){X_t} + {\phi _1}{\phi _2}{X_{t - 1}}} \\ 
  {{\phi _1}{X_t} + {\phi _2}{X_{t - 1}}} 
\end{array}} \right] \hfill \\ 
\end{aligned} \]

Under the recursive formulation, we can see through a simulation study how the predictions for an AR(2) process with $\phi _1 = 0.75$ and $\phi _2 = 0.2$ vary across time in Figure \@ref(fig:predplot).

gg_color_hue <- function(n, alpha = 1) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100, alpha = alpha)[1:n]
}

color = gg_color_hue(5, alpha = 1)
color2 = gg_color_hue(3, alpha = 0.01)

set.seed(2)
Xt = gen_gts(n = 200, AR(phi = c(0.75, 0.2), sigma2 = 1))

plot(NA, xlim = c(0,250), ylim  = c(-10,15), ylab="Prediction", xlab="Time")
grid()
lines(1:200, Xt, col = color[4])


B = 5000
mat = matrix(0,B,52)
mat[,1] = rep(Xt[199], B)
mat[,2] = rep(Xt[200], B)

for (i in 1:B){
  for (j in 3:52){
    mat[i,j] = 0.75*mat[i,(j-1)] + 0.2*mat[i,(j-2)] + rnorm(1)
  }
  lines(200:250, mat[i,2:52], col = color2[3])
}

quant = matrix(NA, 51,5)
quant[1,] = rep(Xt[200],1)

for (i in 2:51){
  quant[i,] = quantile(mat[,i+1], c(0.95,0.75,0.5,0.25,0.05))
}

lines(200:250, quant[,1], col = color[1], lty = 1)
lines(200:250, quant[,5], col = color[1], lty = 1)

lines(200:250, quant[,2], col = color[3], lty = 1)
lines(200:250, quant[,4], col = color[3], lty = 1)

lines(200:250, quant[,3], col = color[5], lty = 1)

Measuring Forecast Accuracy

Within time series, it is often difficult to apply traditional methodology relating concept of "training" and "test" data set. The reason why these concepts are a bit foreign in time series due to the temporal dependence of the data. For instance, if we randomly sampled points within the time series, then there would be some missing values that would need to be imputed. Therefore, to obtain an out-of-sample test the only viable option would be to obtain a window period, say $t = 1, \cdots , N − 1$, to predict $t = N$. That is to say, the training sample runs from the beginning up until "today" and then the hold out data is then "today" or "tomorrow".

Thus, time series models are validated on a notion of a "rolling forecasting origin." However, depending on the context, they may also be referred to as "walk forward optimization", "rolling horizon" or simply "moving origin". The traditional rule of thumb regarding having a training data set of about 2/3rds of the data and a test set of about 1/3 still holds.

There are different forecast measures that should be used to measure the strength of the window and, subsequently, the model. There are two prevalent measures that will be emphasized within this text: Median Prediction Error (MAPE) and Mean-squared Prediction Error (MSPE). Both are defined as follows:

\begin{align} MAPE &= \mathop {median}\limits_{t = 1, \cdots ,t - 1} \left| {{{\hat E}t}\left[ {{X{t + 1}}} \right] - {X_{t + 1}}} \right| = \mathop {median}\limits_{t = 1, \cdots ,t - 1} \left| {{{\hat E}t}\left[ {r_i}\right]} \right| \ MSPE &= \frac{1}{{n - m}}\sum\limits{t = m,n}^{n - 1} {{{\left( {{{\hat E}t}\left[ {{X{t + 1}}} \right] - {X_{t + 1}}} \right)}^2}} = \frac{1}{{n - m}}\sum\limits_{t = m,n}^{n - 1} {{r_i}^2} \end{align}

```{proposition, name="Rolling Forecasting Origin"} The rolling forecast origin algorithm can be described as follows:

  1. Divide the data into a "training data" set that ranges from $t = 1, \ldots, m$ where $m$ is obtained by $m=\left \lfloor \frac{2N}{3} \right \rfloor$ and a testing data set $t = m+1, \ldots, N-1$
  2. Compute $\hat{\theta}$ on $X_t$ where $t = 1, \ldots, m$
  3. Using $\hat{\theta}$, compute ${\hat E_m}\left[ {{X_{m + 1}}} \right] = {\hat \phi 1}{X_m} + \cdots + {\hat \phi _p}{X{m + 1 - p}}$.
  4. ${r_i} = {\hat E_m}\left[ {{X_{m + 1}}} \right] - {X_{m + 1}}$
  5. $i = i + 1$, $m = m + 1$, go to Step 2. until $m = N - 1$ then go to 6.
  6. Compute desired forcast measure
Given all of the above discussion, there is one caveat that would enable data
to be separate completely. The caveat is based upon the ability to take 
distinct temporal sets such as year 1, year 2, and so on to fit individual models
to each time period. The stability of parameter estimates could then be
compared across years.


### Model Diagnostics

Once a model has been obtained, the validity of the model must be assessed.
Indeed, one of corner stones of statistics is in assess how well
the model fits the data. The diagnostic process associated with verifying that
the model  provides an appropriate approximation of the true process relies upon
an analysis of the residuals from the fit.

Now, the diagnostic analysis is somewhat related to the concept of model 
selection addressed within the next subsection \@ref(model-selection). However, 
they are distinctly different topics. For instance, the best model selected may 
lead to residuals that do not conform to the appropriate assumptions made by the
time series model. 

With this in mind, we begin by stating that the assumptions each models in time
series makes about residuals:

1. Between all time lags the residuals have no dependency.
2. There is a constant variance among residuals.
3. The mean of the residuals is zero.
4. The residuals are normally distributed.

Typically, within the diagnostics for time series, the violations occur
around point 1 and 2. Generally, if 1 is violated, this indicates that a
better model can be found. If 2 or 3 is present that indicates a trend is still
present within the time series. The last point is a nice benefit.

For addressing time series diagnostics, we've opted to create a package 
called `exts` that houses a  majority of time series specific diagnostic and
exploratory functions. These functions augument greatly that of what is
currently available in base R with convenient syntax. More details can be
found at <http://smac-group.com/docs/exts>.

The motivating process that we will use to explore the diagnostic information
will be an AR(3) process defined as:

$${X_t} = {.8}{X_{t - 3}} + {W_t}$$

where $W_t \sim WN(0, \sigma^2_W)$.

```r
# Set seed for reproducibility
set.seed(9123)

# Generate AR(3)
Xt = gen_gts(300, AR(phi = c(0, 0, 0.8), sigma2 = 1))

# View time series
plot(Xt)

With the data in hand, we will use our apriori knowledge to fit an AR(3) model.

# Fit time series without mean.
model = arima(Xt, order = c(3,0,0), include.mean = T)
model

Now, there are two ways to view residuals from the model: untouched or standardized. The later is the preferred method and will be used throughout this section.

To address the first point regarding the residual dependency across time lags, we turn to the ACF. In particular, we are interested in observing both the classical and robust versions of the ACF to see if there may be problematic values within the residual time period in addition to dependency.

# TO MODIFY WITH SIMTS FUNCTIONS

# Standardize the residuals
#stdres = model$residuals/sd(model$residuals)

# Graph both R/ACF
#autoplot(compare_acf(stdres))

However, by solely observing a graph, we neglecting the efficient portmanteau tests discussed within Section \@ref(portmanteau-test). In particular, these forms of test provide inference as to whether or not serial dependency is present at a given lag. Most prominently, there are two forms to assess the appearance of white noise within residuals: Ljung-Box and Box-Pierce. The results of which can be obtained by:

# TO MODIFY WITH SIMTS FUNCTIONS
#autoplot(diag_ljungbox(model, stdres = T))
#autoplot(diag_boxpierce(model, stdres = T))

Within both graphs, note that at the first test point there is evidence to suggest that serial correlation is present. However, the majority of points all indicate that the model provides a good fit. Thus, without knowing if the model true model parametrization, one may wish to seek a higher order to verify the residuals have no dependency.

After having assessed the serially correlation aspect of the residuals, the attention should now be turned to observing the normality and level of skew associated with the residuals. Both of these traits are akin to what is traditionally associated with a linear regression model. That is, one seeks to assess a histogram or quantile-quantile (qq) plot of the residuals.

To assist in this assessment, the traditional histogram has been modified to have two different density curves. The blue curve matches the theoretical normal density while the grey curve is a kernel approximation of the density associated with the empirical residuals. Generally, both of these lines should be quite similar. In addition, the empirical results must possess the appearance of a normal bell-shaped distribution.

For the QQ Plot, the main principle is that the residuals adhere to the QQ Line that is formed from the first quantile to the third quantile of the distribution. If residuals are either too high, low, or both the tails fall off of the QQ line, then there is likely an issue with the underlying distribution. The issue can be a variety of things from light to heavy tails, right to left skewed, or even being bimodal.

# TO MODIFY WITH SIMTS FUNCTIONS
#autoplot(diag_resid(model, std = TRUE))
#autoplot(diag_qq(model, std = TRUE))

Now, each of these graphs taken independently are only one part of the larger picture. Often, it is better to view each graph in aggregation alongside other diagnostic information. Therefore, the ideal function to use is diag_ts(). The function provides all of the above graphs embedded within the same display to enable a quick scan of the model information.

# TO MODIFY WITH SIMTS FUNCTIONS
#autoplot(diag_ts(model, Xt, std = TRUE))

Moving Average Models

A moving average model can be interpreted in a similar way to an AR(p) model, except that in this case the time series is the result of a linear operation on the innovation process rather than on the time series itself.

```{definition, label="maq", name="Moving Average of Order Q"} A Moving Average of Order Q or MA(q) model is defined as follows:

$${X_t} = \theta_1 W_{t-1} + ... + \theta_q W_{t-q} + W_t$$.

```{definition, label="maqo", name="Moving Average Operator"}
Following the same logic as for the AR(p) models and using the backshift operator
as before, we can re-express these moving average processes as follows
$$X_t = \theta(B)W_t$$
where $\theta(B) = 1 + \theta_1B + ... + \theta_qB^q$.

These processes are always stationary, no matter the values that $\theta_q$ takes. However, the MA(q) processes may not be identifiable through their autocovariance functions. By the latter we mean that different parameteres for a same order MA(q) model can deliver the exact same autocovariance function and it would therefore be impossible to retrieve the parameters of the model by only looking at the autocovariance function.

```{example, name="Writing a Moving Average in Operator Form"} Consider the following MA(q) process:

$${X_t} = \theta_1 W_{t-1} + ... + \theta_q W_{t-q} + W_t$$

With a little bit of work this can be condensed to:

$${X_t} = {W_t} + \sum\limits_{i = 1}^q {{\theta i}{W{t - i}}}$$

The operator form then follows from:

[{X_t} = {W_t} + \sum\limits_{i = 1}^q {{\theta i}{B^i}{W_t}} \Leftrightarrow {X_t} = \left( {1 + \sum\limits{i = 1}^q {{\theta _i}{B^i}} } \right){W_t} \Leftrightarrow {X_t} = \theta \left( B \right){W_t}]

```{example, name="Stationarity of an MA(Q) Process"}
The stationarity of an MA(q) process is able to be shown assuming that $\sum\limits_{j = 1}^q {\theta _j^2}  < \infty$.

First, the expectation is able to be derived to be $E\left[ {{X_t}} \right] = 0$.

Prior to proceeding to the computation for autocovariance, note that if $\theta_0 = 1$, then 

$${X_t} = {W_t} + \sum\limits_{i = 1}^q {{\theta _i}{W_{t - i}}}$$

may be written succiently as $X_t = \sum\limits_{i = 0}^q {{\theta _i}{W_{t - i}}}$.

Therefore, the autocovariance is obtainable by:

\begin{align}
\cov \left( {{X_{t + h}},{X_t}} \right) &= \cov \left( {\sum\limits_{j = 0}^q {{\theta _j}{W_{t + h - j}}} ,\sum\limits_{i = 0}^q {{\theta _i}{W_{t - i}}} } \right) \\
&= \sum\limits_{j = 0}^q {\sum\limits_{i = 0}^q {{\theta _j}{\theta _i}\cov \left( {{W_{t + h - j}},{W_{t - j}}} \right)} }  \\
&= \underbrace {\sum\limits_{j = 0}^q {\sum\limits_{i = 0}^q {{\theta _j}{\theta _i}\underbrace {\cov \left( {{W_{t + h - j}},{W_{t - j}}} \right)}_{ = 0}} } }_{j \ne i - h} + {1_{\left\{ {\left| h \right| \leqslant q} \right\}}}\sum\limits_{j = 0}^{q - \left| h \right|} {{\theta _{j + \left| h \right|}}{\theta _j}\cov \left( {{W_t},{W_t}} \right)} \\
&= {1_{\left\{ {\left| h \right| \leqslant q} \right\}}}{\sigma ^2}\sum\limits_{j = 0}^{q - \left| h \right|} {{\theta _{j + \left| h \right|}}{\theta _j}}
\end{align}

As a result, we have:

\[{1_{\left\{ {\left| h \right| \leqslant q} \right\}}}{\sigma ^2}\sum\limits_{j = 0}^{q - \left| h \right|} {{\theta _{j + \left| h \right|}}{\theta _j}}  \leqslant {\sigma ^2}\sum\limits_{j = 0}^q {\theta _j^2}  < \infty \]

Hence, the process is stationary.

```{example, name="Non-uniqueness of MA models"} One particular issue of MA models is the fact that they are not unique. In essence, one is not able to correctly tell if the process is of one model or another. Consider the following two models:

$$ \begin{aligned} \mathcal{M}1:&{}& X_t &= W{t-1} + \frac{1}{\theta}W_t,&{}& W_t\sim \mathcal{N} (0, \sigma^2\theta^2) \ \mathcal{M}2:&{}& Y_t &= V{t-1} + \theta V_t,&{}& V_t\sim \mathcal{N} (0,\sigma^2) \end{aligned} $$

By observation, one can note that the models share the same expectation:

[E\left[ {{X_t}} \right] = E\left[ {{Y_t}} \right] = 0]

However, for the autocovariance, the process requires a bit more effort.

\begin{align} \cov \left( {{X_t},{X_{t + h}}} \right) &= \cov \left( {{W_t} + \frac{1}{\theta }{W_{t - 1}},{W_{t + h}} + \frac{1}{\theta }{W_{t + h - 1}}} \right) = {1_{\left{ {h = 0} \right}}}{\sigma ^2}{\theta ^2} + {\sigma ^2} + {1_{\left{ {\left| h \right| = 1} \right}}}\frac{{{\sigma ^2}{\theta ^2}}}{\theta } = {\sigma ^2}\left( {{1_{\left{ {h = 0} \right}}}{\theta ^2} + 1 + {1_{\left{ {\left| h \right| = 1} \right}}}\theta } \right) \ \cov \left( {{Y_t},{Y_{t + h}}} \right) &= \cov \left( {{V_t} + \theta {V_{t - 1}},{V_{t + h}} + \theta {V_{t + h - 1}}} \right) = {1_{\left{ {h = 0} \right}}}{\sigma ^2}{\theta ^2} + {\sigma ^2} + {1_{\left{ {\left| h \right| = 1} \right}}}{\sigma ^2}\theta = {\sigma ^2}\left( {{1_{\left{ {h = 0} \right}}}{\theta ^2} + 1 + {1_{\left{ {\left| h \right| = 1} \right}}}\theta } \right) \end{align}

Therefore, $\cov \left( {{X_t},{X_{t + h}}} \right) = \cov \left( {{Y_t},{Y_{t + h}}} \right)$! Moreover, since $W_t$ and $V_t$ are Gaussian the models are viewed as being similar and, thus, cannot be distinguished. ```

The implication of the last example is rather profound. In particular, consider

[\vec X = \left[ {\begin{array}{{20}{c}} {{X_1}} \ \vdots \ {{X_N}} \end{array}} \right],\vec Y = \left[ {\begin{array}{{20}{c}} {{Y_1}} \ \vdots \ {{Y_N}} \end{array}} \right]]

Thus, the covariance matrix is given by:

[\cov \left( {\vec X} \right) = {\sigma ^2}\left[ {\begin{array}{*{20}{c}} {\left( {1 + {\theta ^2}} \right)}&\theta &0& \cdots &0 \ \theta &{\left( {1 + {\theta ^2}} \right)}&\theta &{}& \vdots \ 0&\theta &{\left( {1 + {\theta ^2}} \right)}&{}&{} \ \vdots &{}&{}& \ddots &{} \ 0& \cdots &{}&{}&{\left( {1 + {\theta ^2}} \right)} \end{array}} \right] = \cov \left( {\vec Y} \right) = \Omega ]

Now, consider the $\vec \beta$ to be the parameter vector for estimates and the approach to estimate is via the MLE:

[L\left( {\vec \beta |\vec X} \right) = {\left( {2\pi } \right)^{ - \frac{N}{2}}}{\left| \Omega \right|^{ - \frac{1}{2}}}\exp \left( { - \frac{1}{2}{{\vec X}^T}{\Omega ^{ - 1}}\vec X} \right)]

If for both models the following parameters ${{\vec \beta }_1} = \left[ {\begin{array}{{20}{c}} \theta \ {{\sigma ^2}} \end{array}} \right],{{\vec \beta }_2} = \left[ {\begin{array}{{20}{c}} {\frac{1}{\theta }} \ {{\sigma ^2}\theta } \end{array}} \right]$ are set, then

[L\left( {{{\vec \beta }_1}|\vec X} \right) = L\left( {{{\vec \beta }_2}|\vec X} \right)]

There is a huge problem being able to identify what the values of the parameters are. To ensure that this problem does not arise in practice, there is the requirement for invertibility, or being able to transform an MA(q) into an AR($\infty$).



coatless/ITS documentation built on May 13, 2019, 8:45 p.m.