In coatless/ITS: A Tour of Time Series Analysis with R

Autocorrelation and Stationarity

library(simts)
library(ggplot2)
library(wv)

"One of the first things taught in introductory statistics textbooks is that correlation is not causation. It is also one of the first things forgotten." – Thomas Sowell

In this chapter we will discuss and formalize how knowledge about $X_{t-1}$ (or more generally about all the information from the past, $\Omega_t$) can provide us with some information about the properties of $X_t$. In particular, we will consider the correlation (or covariance) of $X_t$ at different times such as $\text{corr} \left(X_t, X_{t+h}\right)$. This "form" of correlation (covariance) is called the autocorrelation (autocovariance) and is a very useful tool in time series analysis. However, if we do not assume that a time series is characterized by a certain form of "stability", it would be rather difficult to estimate $\text{corr} \left(X_t, X_{t+h}\right)$ as this quantity would depend on both $t$ and $h$ leading to more parameters to estimate than observations available. Therefore, the concept of stationarity is convenient in this context as it allows (among other things) to assume that

[\text{corr} \left(X_t, X_{t+h}\right) = \text{corr} \left(X_{t+j}, X_{t+h+j}\right), \;\;\; \text{for all $j$},]

implying that the autocorrelation (or autocovariance) is only a function of the lag between observations, rather than time itself. These two concepts (i.e. autocorrelation and stationarity) will be discussed in this chapter. Before moving on, it is helpful to remember that correlation (or autocorrelation) is only appropriate to measure a very specific kind of dependence, i.e. the linear dependence. There are many other forms of dependence as illustrated in the bottom panels of the graph below, which all have a (true) zero correlation:

knitr::include_graphics("images/corr_example.png")

Several other metrics have been introduced in the literature to assess the degree of "dependence" of two random variables, however this goes beyond the material discussed in this chapter.

The Autocorrelation and Autocovariance Functions

```{definition, label="acvf"} The autocovariance function of a series $(X_t)$ is defined as

[{\gamma_x}\left( {t,t+h} \right) = \text{cov} \left( {{X_t},{X_{t+h}}} \right),]

where the definition of covariance is given by:

  \[
    \text{cov} \left( {{X_t},{X_{t+h}}} \right) = \mathbb{E}\left[ {{X_t}{X_{t+h}}} \right] - \mathbb{E}\left[ {{X_t}} \right]\mathbb{E}\left[ {{X_{t+h}}} \right].
    \]

Similarly, the above expectations are defined to be:

  \[\begin{aligned}
     \mathbb{E}\left[ {{X_t}} \right] &= \int\limits_{ - \infty }^\infty  {x \cdot {f_t}\left( x \right)dx},  \\
     \mathbb{E}\left[ {{X_t}{X_{t+h}}} \right] &= \int\limits_{ - \infty }^\infty  {\int\limits_{ - \infty }^\infty  {{x_1}{x_2} \cdot f_{t,t+h}\left( {{x_1},{x_2}} \right)d{x_1}d{x_2}} } ,
     \end{aligned} \]

where ${f_t}\left( x \right)$ and $f_{t,t+h}\left( {{x_1},{x_2}} \right)$ denote,
respectively, the density of $X_t$ and the joint density of the 
pair $(X_t, X_{t+h})$. For the notation, it should be clear that $X_t$ is assumed to be a continous random variable. Since we generally consider stochastic processes with
constant zero mean, we often have

\[{\gamma_x}\left( {t,t+h} \right) = \mathbb{E}\left[X_t X_{t+h} \right]. \]

In addition, we normally drop the subscript referring to the time
series (i.e. $x$ in this case) if it is clear from the context which time
series the autocovariance refers to. For example, we generally 
use ${\gamma}\left( {t,t+h} \right)$ instead
of ${\gamma_x}\left( {t,t+h} \right)$. Moreover, the notation is even further
simplified when the covariance of $X_t$ and $X_{t+h}$ is the same as that
of $X_{t+j}$ and $X_{t+h+j}$ (for all $j$), i.e. the covariance depends only 
on the time between observations and not on the specific time $t$. This is an 
important property called *stationarity*, which will be discussed in the next
section. In this case, we simply use to following notation:
  \[\gamma \left( {h} \right) = \text{cov} \left( X_t , X_{t+h} \right). \]

This notation will generally be used throughout the text and implies 
certain properties (i.e. stationarity) on the process $(X_t)$. 
Several remarks can be made on the autocovariance:

  1. The autocovariance function is *symmetric*. 
That is, ${\gamma}\left( {h} \right) = {\gamma}\left( -h \right)$ 
  since $\text{cov} \left( {{X_t},{X_{t+h}}} \right) = \text{cov} \left( X_{t+h},X_{t} \right)$.
2. The autocovariance function "contains" the variance of the process 
as $\text{var} \left( X_{t} \right) = {\gamma}\left( 0 \right)$.
3. We have that $|\gamma(h)| \leq \gamma(0)$ for all $h$. The proof of this 
inequality is direct and follows from the Cauchy-Schwarz inequality, i.e.
\[ \begin{aligned}
  \left(|\gamma(h)| \right)^2 &= \gamma(h)^2 = \left(\mathbb{E}\left[\left(X_t - \mathbb{E}[X_t] \right)\left(X_{t+h} - \mathbb{E}[X_{t+h}] \right)\right]\right)^2\\
  &\leq \mathbb{E}\left[\left(X_t - \mathbb{E}[X_t] \right)^2 \right] \mathbb{E}\left[\left(X_{t+h} - \mathbb{E}[X_{t+h}] \right)^2 \right] =  \gamma(0)^2. 
  \end{aligned}
  \] 
4. Just as any covariance, ${\gamma}\left( {h} \right)$ is "scale dependent"
since ${\gamma}\left( {h} \right) \in \mathbb{R}$, 
or $-\infty \le {\gamma}\left( {h} \right) \le +\infty$. We therefore have:
  - if $\left| {\gamma}\left( {h} \right) \right|$ is "close" to zero, 
then $X_t$ and $X_{t+h}$ are "weakly" (linearly) dependent;
- if $\left| {\gamma}\left( {h} \right) \right|$ is "far" from zero, 
then the two random variable present a "strong" (linear) dependence. 
However it is generally difficult to asses what "close" and "far" from 
zero means in this case. 
5. ${\gamma}\left( {h} \right)=0$ does not imply that $X_t$ and $X_{t+h}$ are 
independent but simply $X_t$ and $X_{t+h}$ are uncorrelated. 
The independence is only implied by ${\gamma}\left( {h} \right)=0$ in
the jointly Gaussian case.

As hinted in the introduction, an important related statistic is the correlation
of $X_t$ with $X_{t+h}$ or *autocorrelation*, which is defined as

$$\rho \left(  h \right) = \text{corr}\left( {{X_t},{X_{t + h}}} \right) = \frac{{\text{cov}\left( {{X_t},{X_{t + h}}} \right)}}{{{\sigma _{{X_t}}}{\sigma _{{X_{t + h}}}}}} = \frac{\gamma(h) }{\gamma(0)}.$$

Similarly to $\gamma(h)$, it is important to note that the above notation 
implies that the autocorrelation function is only a function of the 
lag $h$ between observations. Thus, autocovariances and autocorrelations are one
possible way to describe the joint distribution of a time series. Indeed, the 
correlation of $X_t$ with $X_{t+1}$ is an obvious measure of how *persistent* a
time series is. 

Remember that just as with any correlation:

1. $\rho \left( h \right)$ is "scale free" so it is much easier to interpret 
than $\gamma(h)$.
2. $|\rho \left( h \right)| \leq 1$ since $|\gamma(h)| \leq \gamma(0)$.
3. **Causation and correlation are two very different things!**

### A Fundamental Representation

Autocovariances and autocorrelations also turn out to be very useful tools as
they are one of the *fundamental representations* of time series. Indeed, if we
consider a zero mean normally distributed process, it is clear that its joint
distribution is fully characterized by the 
autocovariances $\mathbb{E}[X_t X_{t+h}]$ (since the joint probability density
                                           only depends of these covariances). Once we know the autocovariances we
know *everything* there is to know about the process and therefore:

*if two processes have the same autocovariance function, then they are the 
same process.*

### Admissible Autocorrelation Functions

Since the autocorrelation is related to a fundamental representation of time
series, it implies that one might be able to define a stochastic process by 
picking a set of autocorrelation values (assuming for example
                                         that $\text{var}(X_t) = 1$). However, it turns out that not every collection of
numbers, say $\{\rho_1, \rho_2, ...\}$, can represent the autocorrelation of a
process. Indeed, two conditions are required to ensure the validity of an
autocorrelation sequence:

1. $\operatorname{max}_j \; | \rho_j| \leq 1$.
2. $\text{var} \left[\sum_{j = 0}^\infty \alpha_j X_{t-j} \right] \geq 0 \;$ for all $\{\alpha_0, \alpha_1, ...\}$.

The first condition is obvious and simply reflects the fact 
that $|\rho \left( h \right)| \leq 1$ but the second is far more difficult to
verify. To further our understanding of the latter we let $\alpha_j = 0$ for $j > 1$,
then condition two implies that

\[\text{var} \left[ \alpha_0 X_{t} + \alpha_1 X_{t-1}  \right] = \gamma_0 \begin{bmatrix}
   \alpha_0 & \alpha_1
   \end{bmatrix}   \begin{bmatrix}
   1 & \rho_1\\
   \rho_1 & 1
   \end{bmatrix} \begin{bmatrix}
   \alpha_0 \\
   \alpha_1
   \end{bmatrix} \geq 0. \]

Thus, the matrix 

\[ \boldsymbol{A}_1 = \begin{bmatrix}
  1 & \rho_1\\
  \rho_1 & 1
  \end{bmatrix} \]

must be positive semi-definite. Taking the determinant we have 

\[\operatorname{det} \left(\boldsymbol{A}_1\right) = 1 - \rho_1^2 \]

implying that the condition $|\rho_1| \leq 1$ must be respected. 
Now, let $\alpha_j = 0$ for $j > 2$, then we must verify that:

  \[\text{var} \left[ \alpha_0 X_{t} + \alpha_1 X_{t-1}  + \alpha_2 X_{t-2} \right] = \gamma_0 \begin{bmatrix}
     \alpha_0 & \alpha_1 &\alpha_2
     \end{bmatrix}   \begin{bmatrix}
     1 & \rho_1 & \rho_2\\
     \rho_1 & 1 & \rho_1 \\
     \rho_2 & \rho_1 & 1
     \end{bmatrix} \begin{bmatrix}
     \alpha_0 \\
     \alpha_1 \\
     \alpha_2
     \end{bmatrix} \geq 0. \]

Again, this implies that the matrix

\[ \boldsymbol{A}_2 = \begin{bmatrix}
  1 & \rho_1 & \rho_2\\
  \rho_1 & 1 & \rho_1 \\
  \rho_2 & \rho_1 & 1
  \end{bmatrix} \]

must be positive semi-definite and it is easy to verify that

\[\operatorname{det} \left(\boldsymbol{A}_2\right) = \left(1 - \rho_2 \right)\left(- 2 \rho_1^2 + \rho_2 + 1\right). \]

Thus, this implies that 

\[\begin{aligned} &- 2 \rho_1^2 + \rho_2 + 1 \geq 0 \Rightarrow 1 \geq \rho_2 \geq 2 \rho_1^2 - 1 \\
   &\Rightarrow 1 - \rho_1^2 \geq \rho_2 - \rho_1^2 \geq -(1 - \rho_1^2)\\
   &\Rightarrow 1 \geq \frac{\rho_2 - \rho_1^2 }{1 - \rho_1^2} \geq -1.
   \end{aligned}\]

Therefore, $\rho_1$ and $\rho_2$ must lie in a parabolic shaped region defined
by the above inequalities as illustrated in Figure \@ref(fig:admissibility).

```r
plot(NA, xlim = c(-1.1,1.1), ylim = c(-1.1,1.1), xlab = expression(rho[1]),
     ylab = expression(rho[2]), cex.lab = 1.5)
grid()

# Adding boundary of constraint |rho_1| < 1
abline(v = c(-1,1), lty = 2, col = "darkgrey")

# Adding boundary of constraint |rho_2| < 1
abline(h = c(-1,1), lty = 2, col = "darkgrey")

# Adding boundary of non-linear constraint
rho1 = seq(from = -1, to = 1, length.out = 10^3)
rho2 = (rho1^2 - 1) + rho1^2 
lines(rho1, rho2, lty = 2, col = "darkgrey")

# Adding admissible region
polygon(c(rho1,rev(rho1)),c(rho2,rep(1,10^3)),
        border = NA, col= rgb(0,0,1, alpha = 0.1))

# Adding text
text(0,0, c("Admissible Region"))

From our derivation, it is clear that the restrictions on the autocorrelation are very complicated, thereby justifying the need for other forms of fundamental representation which we will explore later in this text. Before moving on to the estimation of the autocorrelation and autocovariance functions, we must first discuss the stationarity of $(X_t)$, which will provide a convenient framework in which $\gamma(h)$ and $\rho(h)$ can be used (rather that $\gamma(t,t+h)$ for example) and (easily) estimated.

Stationarity {#stationary}

There are two kinds of stationarity that are commonly used. They are defined as follows:

```{definition, label="strongstationarity"} A process $(X_t)$ is strongly stationary or strictly stationary if the joint probability distribution of $(X_{t-h}, ..., X_t, ..., X_{t+h})$ is independent of $t$ for all $h$.

```{definition, label="weakstationarity"}
A process $(X_t)$ is *weakly stationary*, *covariance stationary* or *second order stationary* if $\mathbb{E}[X_t]$, $\mathbb{E}[X_t^2]$ are finite and $\mathbb{E}[X_t X_{t-h}]$ depends only on $h$ and not on $t$.

These types of stationarity are not equivalent and the presence of one kind of stationarity does not imply the other. That is, a time series can be strongly stationary but not weakly stationary and vice versa. In some cases, a time series can be both strongly and weakly stationary and this is occurs, for example, in the (jointly) Gaussian case. Stationarity of $(X_t)$ matters because it provides the framework in which averaging dependent data makes sense, thereby allowing us to easily obtain estimates for certain quantities such as autocorrelation.

Several remarks and comments can be made on these definitions:

As mentioned earlier, strong stationarity does not imply weak stationarity.

```{example, label="strongnotweak"} an $iid$ Cauchy process is strongly but not weakly stationary.

- Weak stationarity *does not imply* strong stationarity. 

```{example, label="weaksplit"}
Consider the following weak white noise process:
  \begin{equation*}
X_t = \begin{cases}
U_{t}      & \quad \text{if } t \in \{2k:\, k\in \mathbb{Z} \}, \\
V_{t}      & \quad \text{if } t \in \{2k+1:\, k\in \mathbb{Z} \},\\
\end{cases}
\end{equation*}
where ${U_t} \mathop \sim \limits^{iid} N\left( {1,1} \right)$ and ${V_t}\mathop \sim \limits^{iid} \mathcal{E}\left( 1 \right)$ is a weakly stationary process that is *not* strongly stationary.

Strong stationarity combined with bounded values $\mathbb{E}[X_t]$ and $\mathbb{E}[X_t^2]$ implies weak stationarity.
Weak stationarity combined with normally distributed processes implies strong stationarity.

Assessing Weak Stationarity of Time Series Models

It is important to understand how to verify if a postulated model is (weakly) stationary. In order to do so, we must ensure that our model satisfies the following three properties:

$\mathbb{E}\left[X_t \right] = \mu_t = \mu < \infty$,
$\text{var}\left[X_t \right] = \sigma^2_t = \sigma^2 < \infty$,
$\text{cov}\left(X_t, X_{t+h} \right) = \gamma \left(h\right)$.

In the following examples, we evaluate the stationarity of the processes introduced in Section \@ref(basicmodels).

```{example, label="gwn", name = "Gaussian White Noise"} It is easy to verify that this process is stationary. Indeed, we have:

$\mathbb{E}\left[ {{X_t}} \right] = 0$,
$\gamma(0) = \sigma^2 < \infty$,
$\gamma(h) = 0$ for $|h| > 0$.

```{example, label="srw", name = "Random Walk"}
To evaluate the stationarity of this process, we first derive its properties:

1. We begin by calculating the expectation of the process:
\[
  \mathbb{E}\left[ {{X_t}} \right] = \mathbb{E}\left[ {{X_{t - 1}} + {W_t}} \right]
  = \mathbb{E}\left[ {\sum\limits_{i = 1}^t {{W_i}}  + {X_0}} \right] 
  = \mathbb{E}\left[ {\sum\limits_{i = 1}^t {{W_i}} } \right] + {c} 
  = c.  \] 

Observe that the mean obtained is constant since it depends only on the value
of the first term in the sequence.

2. Next, after finding the mean to be constant, we calculate the variance to check stationarity:
  \[\begin{aligned}
     \text{var}\left( {{X_t}} \right) &= \text{var}\left( {\sum\limits_{i = 1}^t {{W_t}}  + {X_0}} \right) 
     = \text{var}\left( {\sum\limits_{i = 1}^t {{W_i}} } \right) + \underbrace {\text{var}\left( {{X_0}} \right)}_{= 0} \\
     &= \sum\limits_{i = 1}^t {\text{var}\left( {{W_i}} \right)} 
     = t \sigma_w^2,
     \end{aligned}\] 
where $\sigma_w^2 = \text{var}(W_t)$. Therefore, the variance depends on time $t$, contradicting our
second property. Moreover, we have:
  \[\mathop {\lim }\limits_{t \to \infty } \; \text{var}\left(X_t\right) = \infty.\]
This process is therefore not weakly stationary.

3. Regarding the autocovariance of a random walk, we have:
  \[\begin{aligned}
     \gamma \left( h \right) &= \text{cov}\left( {{X_t},{X_{t + h}}} \right) 
     = \text{cov}\left( {\sum\limits_{i = 1}^t {{W_i}} ,\sum\limits_{j = 1}^{t + h} {{W_j}} } \right) 
     = \text{cov}\left( {\sum\limits_{i = 1}^t {{W_i}} ,\sum\limits_{j = 1}^t {{W_j}} } \right)\\ 
     &= \min \left( {t,t + h} \right)\sigma _w^2
     = \left( {t + \min \left( {0,h} \right)} \right)\sigma _w^2,
     \end{aligned} \]
which further illustrates the non-stationarity of this process.

Moreover, the autocorrelation of this process is given by

\[\rho (h) = \frac{t + \min \left( {0,h} \right)}{\sqrt{t}\sqrt{t+h}},\]

implying (for a fixed $h$) that

\[\mathop {\lim }\limits_{t \to \infty } \; \rho(h) = 1.\]

In the following simulated example, we illustrate the non-stationary feature of such a process:

# Number of simulated processes
B = 200

# Length of random walks
n = 1000

# Output matrix
out = matrix(NA,B,n)

# Set seed for reproducibility
set.seed(6182)

# Simulate Data
for (i in seq_len(B)){
  # Simulate random walk
  Xt = gen_gts(n, RW(gamma = 1))

  # Store process
  out[i,] = Xt
}

# Plot random walks
plot(NA, xlim = c(1,n), ylim = range(out), xlab = "Time", ylab = " ")
grid()
color = sample(topo.colors(B, alpha = 0.5))
grid()
for (i in seq_len(B)){
  lines(out[i,], col = color[i])
}

# Add 95% confidence region
lines(1:n, 1.96*sqrt(1:n), col = 2, lwd = 2, lty = 2)
lines(1:n, -1.96*sqrt(1:n), col = 2, lwd = 2, lty = 2)

In the plot above, two hundred simulated random walks are plotted along with theoretical 95% confidence intervals (red-dashed lines). The relationship between time and variance can clearly be observed (i.e. the variance of the process increases with the time).

```{example, label="exma1", name = "Moving Average of Order 1"}

Similarly to our previous examples, we attempt to verify the stationary properties for the MA(1) model defined in Section \@ref(ma1):

[ \mathbb{E}\left[ {{X_t}} \right] = \mathbb{E}\left[ {{\theta_1}{W_{t - 1}} + {W_t}} \right] = {\theta_1} \mathbb{E} \left[ {{W_{t - 1}}} \right] + \mathbb{E}\left[ {{W_t}} \right] = 0. ]
[\text{var} \left( {{X_t}} \right) = \theta_1^2 \text{var} \left( W_{t - 1}\right) + \text{var} \left( W_{t}\right) = \left(1 + \theta^2 \right) \sigma^2_w.]
Regarding the autocovariance, we have [\begin{aligned} \text{cov}\left( {{X_t},{X_{t + h}}} \right) &= \mathbb{E}\left[ {\left( {{X_t} - \mathbb{E}\left[ {{X_t}} \right]} \right)\left( {{X_{t + h}} - \mathbb{E}\left[ {{X_{t + h}}} \right]} \right)} \right] = \mathbb{E}\left[ {{X_t}{X_{t + h}}} \right] \ &= \mathbb{E}\left[ {\left( {{\theta}{W_{t - 1}} + {W_t}} \right)\left( {{\theta }{W_{t + h - 1}} + {W_{t + h}}} \right)} \right] \ &= \mathbb{E}\left[ {\theta^2{W_{t - 1}}{W_{t + h - 1}} + \theta {W_t}{W_{t + h}} + {\theta}{W_{t - 1}}{W_{t + h}} + {W_t}{W_{t + h}}} \right]. \ \end{aligned} ] It is easy to see that $\mathbb{E}\left[ {{W_t}{W_{t + h}}} \right] = {\boldsymbol{1}{\left{ {h = 0} \right}}}\sigma _w^2$ and therefore, we obtain [\text{cov} \left( {{X_t},{X{t + h}}} \right) = \left( {\theta^2{ \boldsymbol{1}{\left{ {h = 0} \right}}} + {\theta}{\boldsymbol{1}{\left{ {h = 1} \right}}} + {\theta}{\boldsymbol{1}{\left{ {h = - 1} \right}}} + {\boldsymbol{1}{\left{ {h = 0} \right}}}} \right)\sigma _w^2] implying the following autocovariance function: [\gamma \left( h \right) = \left{ {\begin{array}{{20}{c}} {\left( {\theta^2 + 1} \right)\sigma _w^2}&{h = 0} \ {{\theta}\sigma _w^2}&{\left| h \right| = 1} \ 0&{\left| h \right| > 1} \end{array}} \right. .] Therefore, an MA(1) process is weakly stationary since both the mean and variance are constant over time and its covariance function is only a function of the lag $(h)$. Finally, we can easily obtain the autocorrelation for this process, which is given by $$\rho \left( h \right) = \left{ {\begin{array}{{20}{c}} 1&{h = 0} \ {\frac{{{\theta}\sigma _w^2}}{{\left( {\theta^2 + 1} \right)\sigma _w^2}} = \frac{{{\theta}}}{{\theta^2 + 1}}}&{\left| h \right| = 1} \ 0&{\left| h \right| > 1} \end{array}} \right. .$$ Interestingly, we can note that $|\rho(1)| \leq 0.5$.

```{example, label="exar1", name = "Autoregressive of Order 1"}
  As another example, we shall verify the stationary properties for the AR(1)
  model defined in Section \@ref(ar1).

  Using the *backsubstitution* technique, we can rearrange an AR(1) process so 
  that it is written in a more compact form, i.e.

  \[\begin{aligned}
     {X_t} & =  {\phi }{X_{t - 1}} + {W_t} = \phi \left[ {\phi {X_{t - 2}} + {W_{t - 1}}} \right] + {W_t} 
     =  {\phi ^2}{X_{t - 2}} + \phi {W_{t - 1}} + {W_t}  \\
     &  \vdots  \\
     & =  {\phi ^k}{X_{t-k}} + \sum\limits_{j = 0}^{k - 1} {{\phi ^j}{W_{t - j}}} .
     \end{aligned} \]

  By taking the limit in $k$ (which is perfectly valid as we 
                              assume $t \in \mathbb{Z}$) and assuming $|\phi|<1$, we obtain

  \[\begin{aligned}
     X_t = \mathop {\lim }\limits_{k \to \infty} \; {X_t}  =  \sum\limits_{j = 0}^{\infty} {{\phi ^j}{W_{t - j}}} 
     \end{aligned} \]

  and therefore such process can be interpreted as a linear combination of the 
  white noise $(W_t)$ and corresponds (as we will observe later on) to an MA($\infty$). 
  In addition, the requirement $\left| \phi  \right| < 1$ turns out to be 
  extremely useful as the above formula is related to a **geometric series**, which
  would diverge if $\phi \geq 1$. Indeed, remember that
  an infinite (converging) geometric series is given by

  \[\sum\limits_{k = 0}^\infty  \, a{{r^k}}  = \frac{a}{{1 - r}}, \; {\text{ if }}\left| r \right| < 1.\]

  <!--
    The origin of the requirement comes from needing to ensure that the characteristic polynomial solution for an AR1 lies outside of the unit circle. Subsequently, stability enables the process to be stationary. If $\phi  \ge 1$, the process would not converge. Under the requirement, the process can represented as a
  -->

With this setup, we demonstrate how crucial this property is by calculating each of the requirements of a stationary process.

  1. First, we will check if the mean is stationary. 
  In this case, we opt to use limits to derive the expectation
  \[\begin{aligned}
     \mathbb{E}\left[ {{X_t}} \right] &= \mathop {\lim }\limits_{k \to \infty } \mathbb{E}\left[ {{\phi^k}{X_{t-k}} + \sum\limits_{j = 0}^{k - 1} {\phi^j{W_{t - j}}} } \right] \\
     &= \mathop {\lim }\limits_{k \to \infty } \, \underbrace {{\phi ^k}{\mathbb{E}[X_{t-k}]}}_{= 0} + \mathop {\lim }\limits_{k \to \infty } \, \sum\limits_{j = 0}^{k - 1} {\phi^j\underbrace {\mathbb{E}\left[ {{W_{t - j}}} \right]}_{ = 0}}
     = 0.
     \end{aligned} \] As expected, the mean is zero and, hence, the first criteria for weak stationarity is satisfied. 
  2. Next, we opt to determine the variance of the process
  \[\begin{aligned}
     \text{var}\left( {{X_t}} \right) &= \mathop {\lim }\limits_{k \to \infty } \text{var}\left( {{\phi^k}{X_{t-k}} + \sum\limits_{j = 0}^{k - 1} {\phi^j{W_{t - j}}} } \right)
     = \mathop {\lim }\limits_{k \to \infty } \sum\limits_{j = 0}^{k - 1} {\phi ^{2j} \text{var}\left( {{W_{t - j}}} \right)}  \\
     &= \mathop {\lim }\limits_{k \to \infty } \sum\limits_{j = 0}^{k - 1} \sigma _W^2 \, {\phi ^{2j}}  =  
       \underbrace {\frac{\sigma _W^2}{{1 - {\phi ^2}}}.}_{\begin{subarray}{l} 
         {\text{Geom. Serie}} 
         \end{subarray}}
     \end{aligned} \] Once again, the above result only holds because we are able to use the geometric series convergence as a result of $\left| \phi  \right| < 1$.
  3. Finally, we consider the autocovariance of an AR(1). For $h > 0$, we have
  \[\gamma \left( h \right) =  \text{cov}\left( {{X_t},{X_{t + h}}} \right) = \phi \text{cov}\left( {{X_t},{X_{t + h - 1}}} \right) = \phi \, \gamma \left( h-1 \right).\]
  Therefore, using the symmetry of autocovariance, we find that
  \[\gamma \left( h \right) = \phi^{|h|} \, \gamma(0).\]

  Both the mean and variance do not depend on time. In addition, the autocovariance function can be viewed as a function dependent entirely on lags and, thus, the AR(1) process is weakly stationary if $\left| \phi  \right| < 1$.  Lastly, we can obtain the autocorrelation for this process. Indeed, for $h > 0$, we have

  \[\rho \left( h \right) = \frac{{\gamma \left( h \right)}}{{\gamma \left( 0 \right)}} = \frac{{\phi \gamma \left( {h - 1} \right)}}{{\gamma \left( 0 \right)}} = \phi \rho \left( {h - 1} \right).\]

  After fully simplifying, we obtain

  \[\rho \left( h \right) = {\phi^{|h|}}.\]

  Thus, the autocorrelation function for an AR(1) exhibits a _geometric decay_,
  meaning, the smaller the $|\phi|$, the faster the autocorrelation reaches zero. 
  If $|\phi|$ is close to 1, then the decay rate is slower.

Estimation of Moments of Stationary Processes

In this section, we discuss how moments and related quantities of stationary process can be estimated. Informally speaking, the use of "averages" is meaningful for such processes suggesting that classical moments estimators can be employed. Indeed, suppose that one is interested in estimating $\alpha \equiv \mathbb{E}[m (X_t)]$, where $m(\cdot)$ is a known function of $X_t$. If $X_t$ is a strongly stationary process, we have

[\alpha = \int m(x) \, f(x) dx]

where $f(x)$ denotes the density of $X_t, \; \forall t$. Replacing $f(x)$ by $f_n(x)$, the empirical density, we obtain the following estimator

[\hat{\alpha} = \frac{1}{n} \sum_{i = 1}^n m\left(x_i\right).]

In the next subsection, we examine how this simple idea can be used to estimate the mean, autocovariance and autocorrelation functions. Moreover, we discuss some of the properties of these estimators.

Estimation of the Mean Function

If a time series is stationary, the mean function is constant and a possible estimator of this quantity is, as discussed above, given by

[\bar{X} = {\frac{1}{n}\sum\limits_{t = 1}^n {{X_t}} }.]

Naturally, the $k$-th moment, say $\beta_k \equiv \mathbb{E}[X_t^k]$ can be estimated by

[\hat{\beta}k = {\frac{1}{n}\sum\limits{t = 1}^n {{X_t^k}} }, \;\; k \in \left{x \in \mathbb{N} : \, 0 < x < \infty \right}.]

The variance of such an estimator can be derived as follows:

\begin{equation} \begin{aligned} \text{var} \left( \hat{\beta}k \right) &= \text{var} \left( {\frac{1}{n}\sum\limits{t = 1}^n {{X_t^k}} } \right) \ &= \frac{1}{{{n^2}}}\text{var} \left( {{{\left[ {\begin{array}{{20}{c}} 1& \cdots &1 \end{array}} \right]}_{1 \times n}}{{\left[ {\begin{array}{{20}{c}} {{X_1^k}} \ \vdots \ {{X_n^k}} \end{array}} \right]}{n \times 1}}} \right) \ &= \frac{1}{{{n^2}}}{\left[ {\begin{array}{{20}{c}} 1& \cdots &1 \end{array}} \right]_{1 \times n}} \, \boldsymbol{\Sigma}(k) \, {\left[ {\begin{array}{{20}{c}} 1 \ \vdots \ 1 \end{array}} \right]{n \times 1}}, \end{aligned} (#eq:chap2VarMoment) \end{equation}

where $\boldsymbol{\Sigma}(k) \in \mathbb{R}^{n \times n}$ and its $i$-th, $j$-th element is given by

[ \left(\boldsymbol{\Sigma}(k)\right)_{i,j} = \text{cov} \left(X_i^k, X_j^k\right).]

In the case $k = 1$, \@ref(eq:chap2VarMoment) can easily be further simplified. Indeed, we have

[\begin{aligned} \text{var} \left( {\bar X} \right) &= \text{var} \left( {\frac{1}{n}\sum\limits_{t = 1}^n {{X_t}} } \right) \ &= \frac{1}{{{n^2}}}{\left[ {\begin{array}{{20}{c}} 1& \cdots &1 \end{array}} \right]_{1 \times n}}\left[ {\begin{array}{{20}{c}} {\gamma \left( 0 \right)}&{\gamma \left( 1 \right)}& \cdots &{\gamma \left( {n - 1} \right)} \ {\gamma \left( 1 \right)}&{\gamma \left( 0 \right)}&{}& \vdots \ \vdots &{}& \ddots & \vdots \ {\gamma \left( {n - 1} \right)}& \cdots & \cdots &{\gamma \left( 0 \right)} \end{array}} \right]{n \times n}{\left[ {\begin{array}{*{20}{c}} 1 \ \vdots \ 1 \end{array}} \right]{n \times 1}} \ &= \frac{1}{{{n^2}}}\left( {n\gamma \left( 0 \right) + 2\left( {n - 1} \right)\gamma \left( 1 \right) + 2\left( {n - 2} \right)\gamma \left( 2 \right) + \cdots + 2\gamma \left( {n - 1} \right)} \right) \ &= \frac{1}{n}\sum\limits_{h = - n}^n {\left( {1 - \frac{{\left| h \right|}}{n}} \right)\gamma \left( h \right)} . \ \end{aligned} ]

Obviously, when $X_t$ is a white noise process, the above formula reduces to the usual $\text{var} \left( {\bar X} \right) = \sigma^2_w/n$. In the following example, we consider the case of an AR(1) process and discuss how $\text{var} \left( {\bar X} \right)$ can be obtained or estimated.

```{example, label="exactvbootstrap"} For an AR(1), we have $\gamma(h) = \phi^h \sigma_w^2 \left(1 - \phi^2\right)^{-1}$. Therefore, we obtain (after some computations):

\begin{equation} \text{var} \left( {\bar X} \right) = \frac{\sigma_w^2 \left( n - 2\phi - n \phi^2 + 2 \phi^{n + 1}\right)}{n^2\left(1-\phi^2\right)\left(1-\phi\right)^2}. \end{equation}

Unfortunately, deriving such an exact formula is often difficult when considering more complex models. However, asymptotic approximations are often employed to simplify the calculation. For example, in our case we have

[\mathop {\lim }\limits_{n \to \infty } \; n \text{var} \left( {\bar X} \right) = \frac{\sigma_w^2}{\left(1-\phi\right)^2},]

providing the following approximate formula:

[\text{var} \left( {\bar X} \right) \approx \frac{\sigma_w^2}{n \left(1-\phi\right)^2}.]

Alternatively, simulation methods can also be employed. For example, a possible strategy would be parametric bootstrap.

```{theorem, label="parabootstrap", "Parametric Bootstrap"}
1. Simulate a new sample under the postulated model, 
    i.e. $X_t^* \sim F_{\boldsymbol{theta}}{$ (*note:* if $\boldsymbol{theta}$ is unknown it can be
                                   replace by $\hat{\boldsymbol{theta}}$, a suitable estimator).
2. Compute the statistics of interest on the simulated 
    sample $(X_t^*)$.
3. Repeat Steps 1 and 2 $B$ times where $B$ is sufficiently "large" 
    (typically $100 \leq B \leq 10000$).
4. Compute the empirical variance of the statistics of interest based on 
    the $B$ independent replications.

In our example, we would consider $(X_t^)$ to be ${\bar{X}^}$ and seek to obtain:

[\hat{\sigma}^2_B = \frac{1}{B-1} \sum_{i = 1}^B \left(\bar{X}^_i - \bar{X}^ \right)^2, \;\;\; \text{where} \;\;\; \bar{X}^ = \frac{1}{B} \sum_{i=1}^B \bar{X}^_i,]

where $\bar{X}^*_i$ denotes the value of the mean estimated on the $i$-th simulated sample.

The figure below generated by the following code compares these three methods for $n = 10$, $B = 1000$, $\sigma^2 = 1$ and a grid of values for $\phi$ going from $-0.95$ to $0.95$:

# Define sample size
n = 10

# Number of Monte-Carlo replications
B = 5000

# Define grid of values for phi
phi = seq(from = 0.95, to = -0.95, length.out = 30)

# Define result matrix
result = matrix(NA,B,length(phi))

# Start simulation
for (i in seq_along(phi)){
  # Define model
  model = AR1(phi = phi[i], sigma2 = 1)

  # Monte-Carlo
  for (j in seq_len(B)){
    # Simulate AR(1)
    Xt = gen_gts(n, model)

    # Estimate Xbar
    result[j,i] = mean(Xt)
  }
}

# Estimate variance of Xbar
var.Xbar = apply(result,2,var)

# Compute theoretical variance
var.theo = (n - 2*phi - n*phi^2 + 2*phi^(n+1))/(n^2*(1-phi^2)*(1-phi)^2)

# Compute (approximate) variance
var.approx = 1/(n*(1-phi)^2)

# Compare variance estimations
plot(NA, xlim = c(-1,1), ylim = range(var.approx), log = "y", 
     ylab = expression(paste("var(", bar(X), ")")),
     xlab= expression(phi), cex.lab = 1)
grid()
lines(phi,var.theo, col = "deepskyblue4")
lines(phi, var.Xbar, col = "firebrick3")
lines(phi,var.approx, col = "springgreen4")
legend("topleft",c("Theoretical variance","Bootstrap variance","Approximate variance"), 
       col = c("deepskyblue4","firebrick3","springgreen4"), lty = 1,
       bty = "n",bg = "white", box.col = "white", cex = 1.2)

It can be observed that the variance of $\bar{X}$ typically increases with $\phi$. As expected when $\phi = 0$, we have $\text{var}(\bar{X}) = 1/n$ — in this case the process is a white noise. Moreover, the bootstrap approach appears to approximate well the curve of (\@ref(eq:chap2_exAR1)), while the asymptotic formula provides a reasonable approximation for $\phi$ being between -0.5 and 0.5. Naturally, the quality of this approximation would be far better for a larger sample size (here we consider $n = 10$, which is a little "extreme").

Sample Autocovariance and Autocorrelation Functions

A natural estimator of the autocovariance function is given by:

[\hat \gamma \left( h \right) = \frac{1}{T}\sum\limits_{t = 1}^{T - h} {\left( {{X_t} - \bar X} \right)\left( {{X_{t + h}} - \bar X} \right)} ]

leading to the following "plug-in" estimator of the autocorrelation function:

[\hat \rho \left( h \right) = \frac{{\hat \gamma \left( h \right)}}{{\hat \gamma \left( 0 \right)}}.]

A graphical representation of the autocorrelation function is often the first step for any time series analysis (again assuming the process to be stationary). Consider the following simulated example:

# Set seed for reproducibility
set.seed(2241)

# Simulate 100 observation from a Gaussian white noise
Xt = gen_gts(100, WN(sigma2 = 1))

# Compute autocorrelation
acf_Xt = ACF(Xt)

# Plot autocorrelation
plot(acf_Xt, show.ci = FALSE)

In this example, the true autocorrelation is equal to zero at any lag $h \neq 0$, but obviously the estimated autocorrelations are random variables and are not equal to their true values. It would therefore be useful to have some knowledge about the variability of the sample autocorrelations (under some conditions) to assess whether the data comes from a completely random series or presents some significant correlation at certain lags. The following result provides an asymptotic solution to this problem:

```{theorem, label="approxnormal"} If $X_t$ is a strong white noise with finite fourth moment, then $\hat{\rho}(h)$ is approximately normally distributed with mean $0$ and variance $n^{-1}$ for all fixed $h$.

The proof of this Theorem is given in Appendix \@ref(appendixa).

Using this result, we now have an approximate method to assess whether peaks in
the sample autocorrelation are significant by determining whether the observed
peak lies outside the interval $\pm 2/\sqrt{T}$ (i.e. an approximate 95%
confidence interval). Returning to our previous example and adding confidence
bands to the previous graph, we obtain:

```r
# Plot autocorrelation with confidence bands 
plot(acf_Xt)

It can now be observed that most peaks lie within the interval $\pm 2/\sqrt{T}$ suggesting that the true data generating process is uncorrelated.

```{example, label = "acffeatures"} To illustrate how the autocorrelation function can be used to reveal some "features" of a time series, we download the level of the Standard & Poor's 500 index, often abbreviated as the S&P 500. This financial index is based on the market capitalization of 500 large companies having common stock listed on the New York Stock Exchange or the NASDAQ Stock Market. The graph below shows the index level and daily returns from 1990.

```r
# Load package
library(quantmod)

# Download S&P index
getSymbols("^GSPC", from="1990-01-01", to = Sys.Date())

# Compute returns
GSPC.ret = ClCl(GSPC)

# Plot index level and returns
par(mfrow = c(1,2))
plot(GSPC, main = " ", ylab = "Index level")
plot(GSPC.ret, main = " ", ylab = "Daily returns")

From these graphs, it is clear that the returns are not identically distributed as the variance seems to vary with time, and clusters with either high or low volatility can be observed. These characteristics of financial time series is well known and in the Chapter 5, we will discuss how the variance of such process can be approximated. Nevertheless, we compute the empirical autocorrelation function of the S&P 500 return to evaluate the degree of "linear" dependence between observations. The graph below presents the empirical autocorrelation.

sp500 = na.omit(GSPC.ret)
names(sp500) = paste("S&P 500 (1990-01-01 - ",Sys.Date(),")", sep = "")
plot(ACF(sp500))

As expected, the autocorrelation is small but it might be reasonable to believe that this sequence is not purely uncorrelated.

Unfortunately, Theorem 1 is based on an asymptotic argument and since the confidence bands constructed are also asymptotic, there are no "exact" tools that can be used in this case. To study the validity of these results when $n$ is "small" we performed a simulation. In the latter, we simulated processes following from a Gaussian white noise and examined the empirical distribution of $\hat{\rho}(3)$ with different sample sizes (i.e. $n$ is set to 5, 10, 30 and 300). Intuitively, the "quality" of the approximation provided by Theorem 1 should increase with the sample size $n$. The code below performs such a simulation and compares the empirical distribution of $\sqrt{n} \hat{\rho}(3)$ with a normal distribution with mean 0 and variance 1 (its asymptotic distribution), which is depicted using a red line.

# Number of Monte Carlo replications
B = 10000

# Define considered lag
h = 3

# Sample size considered
N = c(5, 10, 30, 300)

# Initialisation
result = matrix(NA,B,length(N))

# Set seed
set.seed(1)

# Start Monte Carlo
for (i in seq_len(B)){
  for (j in seq_along(N)){
    # Simluate process
    Xt = rnorm(N[j])

    # Save autocorrelation at lag h
    result[i,j] = acf(Xt, plot = FALSE)$acf[h+1]
  }
}

# Plot results
par(mfrow = c(2,length(N)/2))
for (i in seq_along(N)){
  # Estimated empirical distribution
  hist(sqrt(N[i])*result[,i], col = "royalblue1", 
       main = paste("Sample size n =",N[i]), probability = TRUE,
       xlim = c(-4,4), xlab = " ")

  # Asymptotic distribution
  xx = seq(from = -10, to = 10, length.out = 10^3)
  yy = dnorm(xx,0,1)
  lines(xx,yy, col = "red", lwd = 2)
}

As expected, it can clearly be observed that the asymptotic approximation is quite poor when $n = 5$ but as the sample size increases the approximation improves and is very close when, for example, $n = 300$. This simulation could suggest that Theorem 1 provides a relatively "close" approximation of the distribution of $\hat{\rho}(h)$.

Robustness Issues

The data generating process delivers a theoretical autocorrelation (autocovariance) function that, as explained in the previous section, can then be estimated through the sample autocorrelation (autocovariance) functions. However, in practice, the sample is often issued from a data generating process that is "close" to the true one, meaning that the sample suffers from some form of small contamination. This contamination is typically represented by a small amount of extreme observations that are called "outliers" that come from a process that is different from the true data generating process.

The fact that the sample can suffer from outliers implies that the standard estimation of the autocorrelation (autocovariance) functions through the sample functions could be highly biased. The standard estimators presented in the previous section are therefore not "robust" and can behave badly when the sample suffers from contamination. To illustrate this limitation of a classical estimator, we consider the following two processes:

[ \begin{aligned} X_t &= \phi X_{t-1} + W_t, \;\;\; W_t \sim \mathcal{N}(0,\sigma_w^2),\ Y_t &= \begin{cases} X_t & \quad \text{with probability } 1 - \epsilon\ U_t & \quad \text{with probability } \epsilon\ \end{cases}, \;\;\; U_t \sim \mathcal{N}(0,\sigma_u^2), \end{aligned} ]

when $\epsilon$ is "small" and $\sigma_u^2 \gg \sigma_w^2$, the process $(Y_t)$ can be interpreted as a "contaminated" version of $(X_t)$. The figure below represents one relalization of the processes $(X_t)$ and $(Y_t)$ using the following setting: $n = 100$, $\sigma_u^2 = 10$, $\phi = 0,5$, $\sigma_w^2 = 1$ as well as $\alpha = 0.05$.

Next, we consider a simulated example to highlight how the performance of a "classical" autocorrelation can deteriorate if the sample is contaminated (i.e. what is the impact of using $Y_t$ instead of $X_t$, the "uncontaminated" process). In this simulation, we will use the setting presented above and consider $B = 10^3$ bootstrap replications.

The boxplots in each figure show how the standard autocorrelation estimator is centered around the true value (red line) when the sample is not contaminated (left boxplot) while it is considerably biased when the sample is contaminated (right boxplot), especially at the smallest lags. Indeed, it can be seen how the boxplots under contamination are often close to zero indicating that it does not detect much dependence in the data although it should. This is a known result in robustness, more specifically that outliers in the data can break the dependence structure and make it more difficult for the latter to be detected.

In order to limit this problem, different robust estimators exist for time series problems which are designed to reduce contamination during the estimation procedure. Among these estimators, there are a few that estimate the autocorrelation (autocovariance) functions in a robust manner. One of these estimators is provided in the robacf() function in the "robcor" package. The following simulated example shows how it limits bias from contamination. Unlike in the previous simulation, we shall only consider data issued from the contaminated model, $Y_t$, and compare the performance of two estimators (i.e. classical and robust autocorrelation estimators):

The robust estimator remains close to the true value represented by the red line in the boxplots as opposed to the standard estimator. It can also be observed that to reduce the bias induced by contamination in the sample, robust estimators pay a certain price in terms of efficiency as highlighted by the boxplots that show more variability compared to those of the standard estimator. To assess how much is "lost" by the robust estimator compared to the classical one in terms of efficiency, we consider one last simulation where we examine the performance of two estimators on data issued from the uncontaminated model, i.e. $(X_t)$. Therefore, the only difference between this simulation and the previous one is the value of $\alpha$ set equal to $0$; the code shall thus be omitted and the results are depicted below:

# TO DO

It can be observed that both estimators provide extremely similar results, although the robust estimator is slightly more variable.

Next, we consider the issue of robustness on the real data set coming from the domain of hydrology presented in Section \@ref(eda). This data concerns monthly precipitation (in mm) over a certain period of time (1907 to 1972). Let us compare the standard and robust estimators of the autocorrelation functions:

# TO DO

It can be seen that, under certain assumptions (e.g. linear dependence), the standard estimator does not detect any significant autocorrelation between lags since the estimations all lie within the asymptotic confidence intervals. However, many of the robust estimations lie outside these confidence intervals at different lags indicating that there could be dependence within the data. If one were only to rely on the standard estimator in this case, there may be erroneous conclusions drawn on this data. Robustness issues therefore need to be considered for any time series analysis, not only when estimating the autocorrelation (autocovariance) functions.

Finally, we return to S&P 500 returns and compare the classical and robust autocorrelation estimators, which are presented in the figure below.

# TO DO

It can be observed that both estimators are very similar. Nevertheless, some small discrepancies can be observed. In particular, the robust estimators seem to indicate an absence of linear dependence while a slightly different interpretation might be achieved with the classical estimator.

Joint Stationarity

The notion of joint stationarity implies that the time series under investigation is bivariate in nature, e.g. $\left(X_t \right)$ and $\left(Y_t\right)$ are two distinct time series with matching time points. In order to fulfill bivariate stationarity, both processes must be considered weakly stationary (with constant mean and autocovariance dependent on lag $h$) and there must exist a cross-covariance function between them, depenent only on lag $h$. The ideas discussed next can extended beyond the bivariate case to a general multivariate setting.

```{definition, label="crosscov"} The cross-covariance function of two jointly stationary processes $\left{X_t\right}$ and $\left{Y_t\right}$ is given by:

[{\gamma {XY}}\left( {t + h,t} \right) = \text{cov}\left( {{X{t+h}},{Y_{t}}} \right) = \e{\left( {{X_{t + h}} - {\mu X}} \right)\left( {{Y_t} - {\mu _Y}} \right)} = {\gamma {XY}}\left( h \right)]

Unlike the symmetry found in the autocovariance function of a stationary
process $\left\{X_t\right\}$, e.g. $\gamma_X(h) = \gamma_X(-h)$, the cross-covariance
function is only equal when $\gamma_{XY}(h) = \gamma_{YX}(-h)$. Notice the
switch in indices and negative lag change.

```{definition, label="crosscorr"}
The *cross-correlation* function for two jointly stationary time series
$\left\{X_t\right\}$ and $\left\{Y_t\right\}$ can be expressed as:

  \[{\rho _{XY}}\left( {t+h,t} \right) = \frac{{{\gamma _{XY}}\left( {t + h,t} \right)}}{{\sqrt {{\gamma _X}\left( 0 \right)} \sqrt {{\gamma _Y}\left( 0 \right)} }} = \frac{{{\gamma _{XY}}\left( h \right)}}{{{\sigma _{{X_{t+h}}}}{\sigma _{{Y_{t}}}}}} = {\rho _{XY}}\left( h \right)\]

Due to the previously discussed symmetry regarding the cross-covariance function, note that the cross-correlation function also only has equality when $\rho_{XY}(h) = \rho_{YX}(-h)$. Thus, note that $\rho_{XY}(h) \neq \rho_{XY}(-h)$.

```{example, label="ccfjs"} Consider two time series processes given by:

[\begin{aligned} {X_t} &= {W_t} - {W_{t-1}} \ {Y_t} &= {W_t} + {W_{t-1}} \ \end{aligned} ]

where $W_t \sim WN(0, \sigma^2_W)$.

First check to see if $\left{X_t\right}$ and $\left{Y_t\right}$ are weakly stationary.

The means of both time series are evident to be [\begin{aligned} E\left[ {{X_t}} \right] &= E\left[ {{Y_t}} \right] = 0 \ \end{aligned} ]

The autocovariance for $\left{X_t\right}$ is given as:

[\begin{aligned} {\gamma X}\left( h \right) &= \text{cov} \left( {{X{t + h}},{X_t}} \right) \ &= \text{cov} \left( {{W_{t + h}} - {W_{t + h - 1}},{W_t} - {W_{t - 1}}} \right) \ &= {1_{\left{ {h = 0} \right}}}2{\sigma ^2} + {1_{\left{ {h = - 1} \right}}}{-\sigma ^2} + {1_{\left{ {h = 1} \right}}}{-\sigma ^2} \ &= \begin{cases} 2\sigma^2_W, &\text{if} h = 0 \ -\sigma^2_W, &\text{if} \left|h\right| = 1 \ 0, &\text{if} \left|h\right| \ge 1 \end{cases} \end{aligned} ]

Similarly, the autocovariance for $\left{Y_t\right}$ is calculated to be:

[\begin{aligned} {\gamma Y}\left( h \right) &= \text{cov} \left( {{Y{t + h}},{Y_t}} \right) \ &= \text{cov} \left( {{W_{t + h}} + {W_{t + h - 1}},{W_t} + {W_{t - 1}}} \right) \ &= {1_{\left{ {h = 0} \right}}}2{\sigma ^2} + {1_{\left{ {h = - 1} \right}}}{\sigma ^2} + {1_{\left{ {h = 1} \right}}}{\sigma ^2} \ &= \begin{cases} 2\sigma^2_W, &\text{if} h = 0 \ \sigma^2_W, &\text{if} \left|h\right| = 1 \ 0, &\text{if} \left|h\right| \ge 1 \end{cases} \end{aligned} ]

Next, the cross-covariance for $\left{X_t\right}$ and $\left{Y_t\right}$:

[\begin{aligned} {\gamma {XY}}\left( h \right) &= \text{cov} \left( {{X{t + h}},{Y_t}} \right) \ &= \text{cov} \left( {{W_{t + h}} - {W_{t + h - 1}},{W_t} + {W_{t-1}}} \right) \ &= \begin{cases} 0, &\text{if } h = 0 \ -\sigma^2_W, &\text{if } h = 1 \ \sigma^2_W, &\text{if } h = -1 \ 0, &\text{if } h \ge 2 \end{cases} \end{aligned}]

Therefore, based on obtain the weak stationarity for each process and obtain a cross-covariance function that only depends on the lag $h$, the process is joint stationary.

Furthermore, we also have the cross-correlation function as:

[{\rho {XY}}\left( {{X{t + h}},{Y_t}} \right) = \frac{{{\gamma {XY}}\left( h \right)}}{{{\sigma _X}{\sigma _Y}}} = \frac{{{\gamma {XY}}\left( h \right)}}{{\sqrt {{\gamma x}\left( 0 \right)} \sqrt {{\gamma _y}\left( 0 \right)} }} = \frac{{{\gamma {XY}}\left( h \right)}}{{{\gamma _X}\left( 0 \right)}} = \begin{cases} 0, &\text{if } h = 0 \ -\frac{1}{2}, &\text{if } h = 1 \ \frac{1}{2}, &\text{if } h = -1 \ 0, &\text{if } h \ge 2 \end{cases}]

### Sample Cross-Covariance and Cross-Correlation Functions

A natural estimator of the *cross-covariance function* is given by:

  \[{{\hat \gamma }_{XY}}\left( h \right) = \frac{1}{T}\sum\limits_{t = 1}^{T - h} {\left( {{X_{t + h}} - \bar X} \right)\left( {{Y_t} - \bar Y} \right)} \]

With this in mind, the "plug-in" estimator for the *cross-correlation function*
  follows:

  \[{{\hat \rho }_{XY}}\left( h \right) = \frac{{{{\hat \gamma }_{XY}}\left( h \right)}}{{\sqrt {{{\hat \gamma }_X}\left( 0 \right)} \sqrt {{{\hat \gamma }_Y}\left( 0 \right)} }}\]

Both of the above estimators are again only symmetric under the above index
and lag transformation.


## Portmanteau test

In this section we give a brief introduction to Portmanteau tests used in time
series analysis. In linear regression, we always need to do a diagnostic test for
residuals after building our model, to check whether our assumptions are 
satisfied. If there is no evidence to reject any of the assumptions, we can say
that the linear model we built is adequate. Otherwise, if the linear model is not
adequate, some modifications or transformations need to be done either for the
previous model or for the data. This rule also applies to time series modeling.
In time series analysis, a wide variety of Portmanteau tests can be used to 
check the white noise residuals assumption. We will introduce two of them as 
follows, which are based on the ACF of residuals, in order to illustrate some 
of ideas behind these kinds of tests.

Dating back to 1970, Box and Pierce proposed the well-known Box-Pierce test
statistic with the following form:

  $$Q_{BP} = n\sum_{h =1}^m \hat{\rho}_h^2,$$

  where the empirical autocorrelation of residuals at lag $h$ is defined
as $\hat{\rho}_h = \frac{\hat{\gamma}(h)}{\hat{\gamma}(0)}$. It is obvious that
under the alternative hypothesis, $\hat{\rho}_h$ would be deviate from $0$, thus a
large $Q_{BP}$ gives us the evidence to reject the null. Under the null
hypothesis that the residuals are white noise (or equivalently the time series 
are adequate), it can be shown that when $n \to \infty$, we have

$$Q_{BP} \overset{\mathcal{D}}{\to} \chi^2_{m - m^{\prime}},$$
where $m^{\prime}$ is the number of parameters in the time series model.

## Proof
We can write the autocorrelation $\bm{\rho} = \left(\rho(1), \rho(2), \dots, \rho(m)\right)^T$, and the sample autocorrelation $\bm{\hat{\rho}} = \left(\hat{\rho}(1), \hat{\rho}(2), \dots, \hat{\rho}(m)\right)^T$.

According to Theorem A.7 in the textbook, under the data generating process $X_t = W_t + \mu$, where $\mu < \infty$ and $(W_t)$ is a strong white noise process with variance $\sigma^2$ and finite fourth moment (i.e. $\mathbb{E} [W_t^4] < \infty$), we have 
$$\sqrt{n} \left(\mathbf{\hat{\rho}} - \mathbf{\rho}\right) \overset{\mathcal{D}}{\to} V,$$
  where the $(p,q)^{th}$ element of $V$ is
$$v_{pq} = \sum_{u=1}^{\infty}\left[ \rho(u+p) + \rho(u-p) - 2\rho(p)\rho(u) \right] \times \left[ \rho(u+p) + \rho(u-p) - 2\rho(p)\rho(u) \right].$$
  And since $V$ is a positive definite covariance matrix, it follows that
$$\sqrt{n} V^{-\frac{1}{2}}\left(\mathbf{\hat{\rho}} - \mathbf{\rho}\right) \overset{\mathcal{D}}{\to} \mathcal{N}(\mathbf{0}, I),$$
  and also
$$n \left(\mathbf{\hat{\rho}} - \mathbf{\rho}\right)^T V^{-1} \left(\mathbf{\hat{\rho}} - \mathbf{\rho}\right) \overset{\mathcal{D}}{\to} \chi^2_m,$$
  Since under the null, $\mathbf{\rho} = \mathbf{0}$, we have 
$$v_{pq} = 
  \begin{cases}
0, \; p \neq q\\
1, \; p = q
\end{cases}
.$$

  Thus, it's easy to show that the Box and Pierce test statistic 
$$Q_{BP} = n\sum_{h =1}^m \hat{\rho}_h^2,$$
under that data generating process would follow a $\chi^2$ distribution 
with $m$ degree of freedom.

Before, we supposed that we knew $\mu$ in our data generating process exactly,
and hence no unknown parameter needed be estimated. Now, suppose $\mu$ is
unknown and contains $m^{\prime}$ number of unknown parameters which we have 
to estimate. As we know, if our model for $\mu$ is correctly specified and
those $m^{\prime}$ unknown parameters are appropriately estimated, then the 
underlying model of the residules of the fitted model would be $W_t$ as before,
but the $m^{\prime}$ number of true parameters are replaced by their estimators.
Then $Q_{BP}$ will still asymptotically follows $\chi^2$ but
with $m - m^{\prime}$ degree of freedom. $\;\;\;\;\;\;\;\; \blacksquare$


Then on 1978, Ljung and Box improved Box-Pierce test by standardizing
each $\hat{\rho}_h$ by its asymptotic variance. The Ljung and Box test
statistic is

$$Q_{LB} = n\sum_{h =1}^m \frac{n+2}{n-h}\hat{\rho}_h^2.$$

It can also be shown that $Q_{LB} \overset{\mathcal{D}}{\to} \chi^2_{m - m^{\prime}}$ under
the null. However, compared to $Q_{BP}$, the distribution of $Q_{BP}$ under the
null is closer to $\chi^2_{m - m^{\prime}}$, when $n$ is finite.

In the above two examples, the test statistic contains a user specified
parameter $m$. And for different $m$, the power of the test would be different.
Thus, much work has gone into either finding an approach for selecting the optimal $m$, 
or finding a new test statistic without user specified parameters. Moreover, testing 
white noise can also be done by checking PACF, or by checking spectral density in the
frequency domain. Therefore, these differing approaches have lead to many different 
Portmanteau tests.

Take for an example the following use of a Portmanteau test to show the 
distribution of test statistics under the null:

```r

# set seed
set.seed(1345)

# Specify models
model = WN(sigma2 = 1) # WN

B = 1000 # number of parametric bootstrap
BP.obs = rep(NA, B)
LB.obs = rep(NA, B)

for (j in seq_len(B)){
x = gen_gts(1000, model)
BP.obs[j] = Box.test(x, lag = 10, "Box-Pierce", fitdf = 0)$statistic
LB.obs[j] = Box.test(x, lag = 10, "Ljung-Box", fitdf = 0)$statistic
}

sim_results = data.frame(sim = c(BP.obs, LB.obs),
simtype = c(rep("Box-Pierce",B), rep("Ljung-Box",B)))

ggplot(data = sim_results, aes(x = sim)) + 
geom_histogram(aes(y = ..density.., fill = simtype),
binwidth = 1, color = "black") +
stat_function(fun = dchisq, args = list(df = 10)) +
facet_wrap( ~ simtype) + ylim(0, 0.12) + 
labs(fill = "Statistic", title = "Histogram of the Observed Test Statistics",
y = "Density", x = "Observed Test Statistic")

From the histogram, we can see that under the null the distribution of both BP and LB are close to a Chi-square distribution, but LP is slightly better.

Consider the following simulation that shows how the test depends on specified $m$ and displays the distribution of P-values under different alternatives.

# Set seed
set.seed(1234)

# Specify models
model1 = WN(sigma2 = 1)                         # WN
model2 = AR(phi = 0.3, sigma2 = 1)              # AR(1)
model3 = AR(phi = c(rep(0,9), 0.3), sigma2 = 1) # seasonal AR(10)

B = 1000 # number of parametric bootstrap
LB.pvalue = matrix(NA, nrow = B, ncol = 6)

for (i in 1:3){
for (j in seq_len(B)){
x = gen_gts(1000, get(paste0("model", i)))
LB.pvalue[j,2*i-1] = Box.test(x, lag = 1, "Ljung-Box", fitdf = 0)$p.value
LB.pvalue[j,2*i] = Box.test(x, lag = 10, "Ljung-Box", fitdf = 0)$p.value
}
}

para_1 = data.frame(lag = 1, LB.pvalue[, c(1,3,5)])
para_2 = data.frame(lag = 2, LB.pvalue[, c(2,4,6)])
para = rbind(para_1, para_2)

colnames(para)[2:4] = c("WN", "AR(1)", "Seasonal AR(10)")

library(reshape2)
para.melt = melt(para, id.vars = "lag")

ggplot(data = para.melt, aes(x=variable, y=value)) + 
geom_boxplot(aes(fill=factor(lag))) + facet_wrap( ~ variable, scales="free") +
ggtitle("Simulated P-value") +
scale_fill_hue(name="Specified m", breaks = c(1,2) , labels = c("m=1", "m=10"))