VARshrink 0.3: Shrinkage Estimation Methods for Vector Autoregressive Models (A Brief Version)"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction {#sec:intro}

Let $\mathbf{y}t = (y{t1},y_{t2},\ldots,y_{tK})^\top$ denote a $K\times 1$ vector of endogenous variables. A vector autoregressive (VAR) model of order $p$ can be expressed as $$ \mathbf{y}t = \mathbf{A}_1 \mathbf{y}{t-1} + \cdots + \mathbf{A}p \mathbf{y}{t-p} + \mathbf{C} \mathbf{d}_t + \boldsymbol\epsilon_t, \qquad\qquad\qquad (1) $$\label{VAReqnA} where $\mathbf{d}_t$ is an $L\times 1$ vector of deterministic regressors, $\boldsymbol\epsilon_t$ is a $K\times 1$ noise vector, and $\mathbf{A}_1,\ldots,\mathbf{A}_p$ and $\mathbf{C}$ are coefficient matrices [@Hamilton94; @Tsay05].

Backgrounds

Recently, several R packages have been developed for parameter estimation and forecasting using stochastic time series models. The forecast package provides various methods and tools for univariate time series models such as the ARIMA and ETS models [@Hyndman18]. The MTS package has been developed for a wide variety of multivariate linear time series models and multivariate volatility models such as the VARMA, multivariate EWMA, and low-dimensional BEKK models [@Tsay18]. The vars package provides methods for multivariate time series analysis using the VAR, SVAR, and SVEC models [@Pfaff18].

In this study, we focus on shrinkage estimation of VAR model parameters. Shrinkage estimation methods have been playing crucial roles in high-dimensional statistical modeling; see, e.g., @Beltra2013shrinkageEEG, @Bohm2009shrinkage, @Fiecas2011aoas and @LedoitWolf2004shrinkcov. For VAR models, several shrinkage estimation methods have been suggested such as a Stein-type nonparametric shrinkage estimation method [@Rhein07c], Bayesian VARs using informative priors [@Banbura10;@Doan84;@Koop10BayesianVAR;@Litterman86], Bayesian VARs using noninformative priors [@Ni05;@Sun04], and a semiparametric Bayesian approach adopting a modified $K$-fold cross validation [@LeeChoiKim2016].

Due to its popularity in macroeconomic time series analysis, several Bayesian VAR methods have been implemented in R packages. For example, the function BVAR() in MTS implements a Bayesian VAR method using an informative prior [@Tsay18]; the package bvarsv implements Bayesian VAR models with stochastic volatility and time-varying parameters [@Krueger2015bvarsv;@Koop10BayesianVAR;@Primiceri2005timevarying].

On the other hand, we note that other types of shrinkage methods which have been implemented for other purposes than multivariate time series analysis can be applied to shrinkage estimation of VAR parameters. For instance, the function cov.shrink() in the package corpcor has been implemented to compute shrinkage estimates of covariances, but it has been applied to estimate VAR coefficients in @Rhein07c [@Schafer17]. Moreover, we note that VAR models can be reformulated into multivariate regression problems, so that penalized least squares methods can be used for shrinkage estimation of VAR parameters, e.g., the functions lm.gls() for generalized least squares and lm.ridge() for ridge regression in the package MASS [@Ripley18]; the function glmnet() for Lasso and Elastic-Net regularized generalized linear models in the package glmnet() [@Frideman18]; the function linearRidge() for ridge regression in the package ridge [@Moritz18].

Main Purpose

While Bayesian approaches have been widely used in the literature, we note that nonparametric and semiparametric approaches have advantages in the case of high-dimensional VARs with more than several hundreds of time series variables due to their relatively low computational costs [@Rhein07c]. Despite of their relatively high computational costs, Bayesian approaches can impose proper assumptions on the multivariate time series data flexibly, such as VAR roots near unity and correlations between noise processes [@LeeChoiKim2016]. In this sense, a semiparametric approach can be a trade-off between nonparametric and parametric approaches [@LeeChoiKim2016].

In this study, we developed an integrative R package, VARshrink, for implementing nonparametric, parametric, and semiparametric approaches for shrinkage estimation of VAR model parameters. By providing a simple interface function, VARshrink(), for running various types of shrinkage estimation methods, the performance of the methods can be easily compared. We note that the package vars implemented an ordinary least squares method for VAR models by the function VAR(). The function VARshrink() was built to extend VAR() to shrinkage estimation methods, so that the VARshrink package takes advantage of the tools and methods available in the vars package. For example, we demonstrate the use of model selection criteria such as AIC and BIC in this paper, where the methods AIC(), BIC(), and logLik() can handle multivariate t-distribution and the effective number of parameters in VARshrink.

This paper is a brief version of the original manuscript. This paper is organized as follows. In Section 2, we explain the formulation of VAR models in a multivariate regression problem, which simplifies implementation of the package. In Section 3, we describe the common interface function and the four shrinkage estimation methods included in the package. We clearly present closed form expressions for the shrinkage estimators inferred by the shrinkage methods, so that we can indicate the role of the shrinkage intensity parameters in each method. In addition, we explain how the effective number of parameters can be computed for shrinkage estimators. In Section 4, we present numerical experiments using benchmark data and simulated data briefly for comparing performances of the shrinkage estimation methods. Discussion and conclusions are provided in Section 5.


Models {#sec:models}

In general, we can rewrite the model equation Eq. (1) in the form of a multivariate regression as $$ \mathbf{y}t = \mathbf{\Psi}^\top \mathbf{x}_t + \boldsymbol{\epsilon}_t, \qquad\qquad\qquad (5) $$\label{VAReqnPsi} where $\mathbf{\Psi} = (\mathbf{A}_1, \mathbf{A}_2, \ldots, \mathbf{A}_p, \mathbf{C})^\top$ is a $(Kp + L) \times K$ matrix of coefficients, $\mathbf{x}_t = (\mathbf{y}{t-1}^\top, \ldots, \mathbf{y}{t-p}^\top, \mathbf{d}_t^\top)^\top$ is a $(Kp + L)\times 1$ vector of regressors. For estimation of VAR parameters from the observed time series data $\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_T$, we define data matrices as \begin{equation} \begin{split} \mathbf{Y} &= ( \mathbf{y}{p+1}, \mathbf{y}{p+2}, \ldots, \mathbf{y}{T} )^\top \in \mathbb{R}^{N \times K}, \ \mathbf{X} &= ( \mathbf{x}{p+1}, \mathbf{x}{p+2}, \ldots, \mathbf{x}{T} )^\top \in \mathbb{R}^{N \times (Kp + L)}, \end{split} \end{equation} with $N = T-p$. Then, we can rewrite Eq. (5) in a matrix form as $$ \mathbf{Y} = \mathbf{X} \mathbf{\Psi} + \mathbf{E} \in \mathbb{R}^{N \times K}, \qquad\qquad\qquad (6) $$\label{VAReqnMat} with $\mathbf{E} = (\boldsymbol\epsilon{p+1}, \ldots, \boldsymbol\epsilon_T)^\top$ and $N = T-p$.


Shrinkage Estimation Methods {#sec:methods}

In this section, we will describe the shrinkage estimation methods implemented in this package. The methods are used for estimating the VAR coefficient matrix $\mathbf{\Psi}$ alone or both of the $\mathbf{\Psi}$ and $\mathbf{\Sigma}$ in Eq. (6).

We provide a common R function interface VARshrink() for running the estimation methods, which is defined by

VARshrink(y, p = 1, type = c("const", "trend", "both", "none"),
  season = NULL, exogen = NULL, method = c("ridge", "ns", "fbayes",
  "sbayes", "kcv"), lambda = NULL, lambda_var = NULL, dof = Inf, ...)

The input arguments are described as follows.

The output value is an object of class "varshrinkest", which inherits the class "varest" in the package vars, and an object of class "varest" can be obtained by the function VAR() in the package vars. The input arguments and the output value indicate that VARshrink() is an extension of VAR() in the vars package. As a result, almost all the methods and functions included in the package vars are available for the package VARshrink, such as fevd(), Phi(), plot(). The methods and functions available in the VARshrink package are summarized in Table 1. The names of the functions for class "varshrinkest" has suffix "_sh" in order to distinguish them from the functions for class "varest". We remark that the classes "varshrinkest", "varshirf", and "varshsum" inherit the classes "varest", "varirf", and "varsum" of the vars package, respectively.

text_tbl <- data.frame(
FM = c("VAR",                  "fevd",    "irf",              "predict", "summary",        "arch.test_sh", "normality.test_sh", "serial.test_sh", "stability_sh"), 
CL = c("varshrinkest, varest", "varfevd", "varshirf, varirf", "varprd",  "varshsum, varsum",    "varcheck",      "varcheck",          "varcheck",       "varstabil"),
MC = c("coef, fevd, fitted, irf, logLik, Phi, plot, predict, print, Psi, resid, summary",                              "plot, print", "plot, print",  "plot, print", "print",                "plot, print",   "plot, print",       "plot, print",    "plot, print"), 
FC = c("Acoef_sh, arch.test_sh, Bcoef_sh, BQ_sh, causality_sh, normality.test_sh, restrict_sh, roots_sh, serial.test_sh, stability_sh", " ", " ", "fanchart", " ", " ", " ", " ", " ")
)
colnames(text_tbl) <- c("Function or method", "Class", "Methods for class", "Functions for class")
kableExtra::column_spec(
  knitr::kable(text_tbl,
       caption = "Table 1. Structure of the package VARshrink."),
  1:4, width = "7em", border_left = FALSE, border_right = FALSE)

Multivariate Ridge Regression

The ridge regression method is a kind of penalized least squares (PLS) method, which produces a biased estimate of the VAR coefficient [@Hoerl70]. Formally speaking, the ridge regression estimator of $\mathbf{\Psi}$ can be obtained by minimizing the penalized sum of squared prediction errors (SSPE) as \begin{equation} \widehat{ \mathbf{\Psi} }^{\text{R}} (\lambda) = \arg\min_{\mathbf{\Psi}} \ \frac{1}{N} \left\|\mathbf{Y} - \mathbf{X} \mathbf{\Psi} \right\|F^2 + \lambda \left\| \mathbf{\Psi} \right\|_F^2, \end{equation} where $\|\mathbf{A}\|_F = \sqrt{ \sum{i} \sum_{j} a_{ij}^2 }$ is the Frobenius norm of a matrix $\mathbf{A}$, $N = T-p$, and $\lambda \geq 0$ is called the regularization parameter or the shrinkage parameter. The ridge regression estimator $\widehat{ \mathbf{\Psi} }^\text{R} (\lambda)$ can be expressed in the closed form \begin{equation} \widehat{ \mathbf{\Psi} }^\text{R} (\lambda) = \left( \mathbf{X}^\top \mathbf{X} + N \lambda \mathbf{I} \right)^{-1} \mathbf{X}^\top \mathbf{Y}, \qquad \lambda \geq 0. \end{equation}

The shrinkage parameter $\lambda$ for the ridge regression can be automatically determined by using the generalized cross-validation (GCV) [@Golub79]. The GCV selects the value of $\lambda$ where the GCV score given below is minimized: \begin{equation} GCV(\lambda) = \frac{1}{N} \left\| (\mathbf{I} - \mathbf{H}(\lambda) \mathbf{Y} ) \right\|^2_\text{F} \left/ \left[ \frac{1}{N} Trace(\mathbf{I} - \mathbf{H}(\lambda) ) \right]^2 \right. , \end{equation} where $\mathbf{H}(\lambda) = \mathbf{X}^\top \left(\mathbf{X}^\top \mathbf{X} + N \lambda \mathbf{I} \right)^{-1} \mathbf{X}^\top$.

In this package, the interface to the shrinkage estimation methods is provided by the function VARshrink(method = "ridge", ...). If the input argument lambda is set at a value or a vector of values, then corresponding GCV score is computed automatically for each $\lambda$ value, and the VAR coefficients with the smallest GCV score is selected. If lambda = NULL, then the default value of c(0.0001, 0.0005, 0.001, 0.005, ..., 10, 50) is used.

For example, simulated time series data of length $T=100$ were generated based on a multivariate normal distribution for noise and a VAR model with $p=1$, $K=2$, $\mathbf{A}_1 = 0.5\mathbf{I}_2$, $\mathbf{C}=(0.2, 0.7)^\top$, and $\mathbf{\Sigma} = 0.1^2\mathbf{I}_2$ as follows:

library(VARshrink)
set.seed(1000)
myCoef <- list(A = list(matrix(c(0.5, 0, 0, 0.5), 2, 2)), c = c(0.2, 0.7))
myModel <- list(Coef = myCoef, Sigma = diag(0.1^2, 2), dof = Inf)
Y <- simVARmodel(numT = 100, model = myModel, burnin = 10)
resu_estim <- list()

Then, the multivariate ridge regression can be carried out for VAR models as follows. The result printed on the screen shows all the lambda values considered and the corresponding GCV values. The VAR parameters are estimated using the lambda value with the minimum GCV value.

resu_estim$`Ridge regression` <-
  VARshrink(Y, p = 1, type = "const", method = "ridge", lambda = NULL)
resu_estim$`Ridge regression`

The method summary() is available for objects of class "varshrinkest" as follows:

summary(resu_estim$`Ridge regression`)

Nonparametric Shrinkage (NS)

The nonparametric shrinkage (NS) estimation method for VAR models, proposed by @Rhein07c, produces an estimate of $\mathbf{\Psi}$ based on a James-Stein type shrinkage of sample covariance matrices [@Rhein07a;@Schafer05b].

We will briefly describe the NS method. In the NS method, the data matrices $\mathbf{X}$ and $\mathbf{Y}$ are mean-corrected so that each column has the mean of zero. Let $\mathbf{Z} = [\mathbf{X}, \mathbf{Y}] \in \mathbb{R}^{N \times (Kp + K + L)}$ be a combined data matrix. The sample covariance matrix of $\mathbf{Z}$ is partitioned as \begin{equation} \mathbf{S}\text{ZZ} = \frac{1}{N - 1} \mathbf{Z}^\top \mathbf{Z} = \begin{bmatrix} \mathbf{S}\text{XX} & \mathbf{S}\text{XY} \ \mathbf{S}\text{XY}^\top & \mathbf{S}\text{YY} \end{bmatrix} \in \mathbb{R}^{(Kp + K + L) \times (Kp + K + L)}, \end{equation} where $\mathbf{S}\text{XX} = (N - 1)^{-1} \mathbf{X}^\top \mathbf{X}$, $\mathbf{S}\text{XY} = (N - 1)^{-1} \mathbf{X}^\top \mathbf{Y}$, and $\mathbf{S}\text{YY} = (N - 1)^{-1} \mathbf{Y}^\top \mathbf{Y}$. The matrix $\mathbf{S}\text{ZZ}$ can be decomposed as \begin{equation} \mathbf{S}\text{ZZ} = \mathbf{D}\text{Z}^{1/2} \mathbf{R}\text{ZZ} \mathbf{D}\text{Z}^{1/2}, \end{equation} where $\mathbf{R}\text{ZZ}$ is the sample correlation matrix and $\mathbf{D}\text{Z} = \text{diag}(s{11}, s_{22}, \ldots, s_{Kp + K + L, Kp + K + L})$ is a diagonal matrix with diagonal elements of sample variances. Shrinkage estimates of the correlation matrix and the variances can be written as \begin{equation} \widehat{\mathbf{R}}\text{ZZ} = (1-\lambda) \mathbf{R}\text{ZZ} + \lambda \mathbf{I} \quad \text{and} \quad \widehat{\mathbf{D}}\text{Z} = \text{diag}(\hat{s}{11}, \hat{s}{22}, \ldots, \hat{s}{Kp+K+L,Kp+K+L}) \end{equation} with \begin{equation} \hat{s}{ii} = (1-\lambda_v) s{ii} + \lambda_v s_\text{med}, \quad i = 1, 2, \ldots, Kp + K + L, \end{equation} where $s_\text{med}$ is a median of all the sample variances $s_{ii}$, and $0\leq \lambda, \lambda_v \leq 1$ are shrinkage parameters. The estimated covariance matrix can be computed by \begin{equation} \widehat{\mathbf{S}}\text{ZZ}(\lambda, \lambda_v) = \widehat{\mathbf{D}}\text{Z}^{1/2} \widehat{\mathbf{R}}\text{ZZ} \widehat{\mathbf{D}}\text{Z}^{1/2}. \end{equation} The values of the shrinkage parameters $\lambda$ and $\lambda_v$ are determined by the James-Stein type shrinkage method, which we call the NS method described in [@Rhein07a;@Schafer05b].

The ordinary least squares estimate $\widehat{\mathbf{\Psi}}^{\text{OLS}}$ of $\mathbf{\Psi}$ is given by $\widehat{\mathbf{\Psi}}^{\text{OLS}} = \mathbf{S}\text{XX}^{-1} \mathbf{S}\text{XY}$. We define the NS estimate of $\mathbf{\Psi}$ as \begin{equation} \widehat{ \mathbf{\Psi} }^\text{N} (\lambda, \lambda_v) = \widehat{\mathbf{S}}\text{XX}^{-1} \widehat{\mathbf{S}}\text{XY}, \qquad 0\leq \lambda, \lambda_v \leq 1. \end{equation} where $\widehat{\mathbf{S}}\text{XX}$ and $\widehat{\mathbf{S}}\text{XY}$ are parts of the estimated covariance matrix, \begin{equation} \widehat{\mathbf{S}}\text{ZZ} (\lambda, \lambda_v) = \begin{bmatrix} \widehat{\mathbf{S}}\text{XX} & \widehat{\mathbf{S}}\text{XY} \ \widehat{\mathbf{S}}\text{XY}^\top & \widehat{\mathbf{S}}_\text{YY} \end{bmatrix}. \end{equation}

In the package VARshrink, the function VARshrink(method = "ns", ...) provides an interface with the NS method. In specific, the package corpcor [@Schafer17] includes the R function cov.shrink(), which can determine $\lambda$ and $\lambda_v$ and estimate the covariance matrix $\widehat{\mathbf{S}}\text{ZZ}(\lambda, \lambda_v)$. The function VARshrink() in the VARshrink package infers the NS estimates of VAR coefficients, $\widehat{\mathbf{\Psi}}^\text{N} (\lambda, \lambda_v)$, by using the covariance matrix $\widehat{\mathbf{S}}\text{ZZ}(\lambda, \lambda_v)$. If the input arguments lambda and lambda_var are set as lambda = NULL and lambda_var = NULL, then $\lambda$ and $\lambda_v$ are determined automatically. We can find the estimated $\lambda$ and $\lambda_v$ values on the printed screen. For example,

resu_estim$`Nonparametric shrinkage` <-
  VARshrink(Y, p = 1, type = "const", method = "ns",
                    lambda = NULL, lambda_var = NULL)
resu_estim$`Nonparametric shrinkage`

The summary() shows statistical inference on the estimated VAR coefficients together with the estimated scale matrix $\mathbf{\Sigma}$ of multivariate t-distribution for noise.

summary(resu_estim$`Nonparametric shrinkage`)

Full Bayesian Shrinkage

@Ni05 and @Sun04 studied Bayesian estimation methods using noninformative priors for VAR models, where they used Markov Chain Monte Carlo (MCMC) methods for estimating coefficient matrices, noise covariance matrices, and other hyperparameters in Bayesian VAR models. In @Ni05, various Bayesian estimators were compared using several types of loss functions, noninformative priors, and multivariate normal and t-distributions. Among them, Bayesian estimators using a certain type of noninformative priors showed higher accuracies than the other Bayesian estimators in simulated experiments. The noninformative prior for coefficient matrices was called the shrinkage prior, and the prior for the noise covariance matrix was called the reference prior.

In the package VARshrink, we can obtain Bayesian estimators using the shrinkage prior and the reference prior, which are the noninformative priors that yielded the highest accuracies in the simulation results [@Ni05]. As a Bayesian estimator of VAR parameters, the minimizer of the posterior expectation of quadratic loss is computed, which is the mean of the posterior distribution. Additionally, the minimizer of the posterior expectation of LINEX loss is also computed [@Ni05;@Zellner86]. In this section, we will explain the model assumptions in more detail.

First, the noise vectors are independent and identically distributed from multivariate t-distributions with the degree of freedom $\nu$, i.e., $\boldsymbol\epsilon_t \sim t_\nu (\mathbf{0}, \mathbf{\Sigma})$. It can be expressed as a hierarchical model as \begin{equation} \begin{split} ( \boldsymbol\epsilon_t | q_t ) & \sim \text{N}_K (\mathbf{0}, q_t^{-1} \mathbf{\Sigma}). \qquad\qquad\qquad (7) \ q_t & \sim \text{Gamma}(\nu/2, \nu/2). \end{split} \end{equation}\label{noise_tdist_hierarchy}

Second, we denote the vectorized version of $\mathbf{\Psi} \in\mathbb{R}^{(J/K)\times K}$ by $\boldsymbol\psi = \text{vec}(\mathbf{\Psi}) \in\mathbb{R}^{J}$. Here $J = K (Kp + L)$. The shrinkage prior $\pi_\text{S}(\boldsymbol\psi)$ for $\boldsymbol\psi \in\mathbb{R}^{J}$ is taken as $\pi_\text{S}(\boldsymbol\psi) \propto \left\| \boldsymbol\psi \right\|^{ -(J-2) }.$ By introducing a latent variable $\lambda>0$, the shrinkage prior can also be expressed as a scale mixture of multivariate normals as\footnote{\cite{Ni05} used the letter $\delta$ to denote $\delta = \lambda^{-1}$.} \begin{equation} \begin{split} (\boldsymbol\psi | \lambda) & \sim \text{N}_{J} (\mathbf{0}, \lambda^{-1} \mathbf{I}_J), \qquad\qquad\qquad (8) \ \pi (\lambda) & \propto 1. \end{split} \end{equation}\label{psi_normal_const_prior}

Third, the reference prior $\pi_\text{R}(\mathbf{\Sigma})$ for $\mathbf{\Sigma}$ is taken as \begin{equation} \pi_\text{R}(\mathbf{\Sigma}) \propto \left| \mathbf{\Sigma} \right|^{-1} \prod_{1\leq i\leq j\leq K} (\lambda_i - \lambda_j)^{-1}, \end{equation} where $\lambda_1 > \lambda_2 > \cdots > \lambda_K$ are eigenvalues of $\mathbf{\Sigma}$.

Note that no hyperparameters are involved in the shrinkage prior and the reference prior, since they are noninformative priors. The Gibbs MCMC method makes use of the hierarchical expression of $\boldsymbol\epsilon_t$. The Gibbs MCMC samples the parameters $(\boldsymbol\psi, \lambda, \mathbf{\Sigma}, \mathbf{Q}, \nu)$ with $\mathbf{Q} = \text{diag}(q_{p+1}, \ldots, q_T)$ from conditional posterior distributions [@Ni05]. We remark that the mean of the conditional posterior distribution, $\pi(\boldsymbol\psi | \lambda, \mathbf{\Sigma}, \mathbf{Q}, \nu; \mathbf{Y})$ of $\boldsymbol\psi$ is given by \begin{equation} \widehat{\boldsymbol\psi}^{F} (\lambda) = \left[\left(\mathbf{\Sigma}^{-1} \otimes \left( \mathbf{X}^\top \mathbf{Q} \mathbf{X} \right) \right) + \lambda \mathbf{I}J \right]^{-1} \text{vec}\left( \mathbf{X}^\top \mathbf{Q} \mathbf{Y} \mathbf{\Sigma}^{-1} \right), \qquad \lambda > 0. \qquad\qquad\qquad (9) \end{equation}\label{expr_fbayes_psihat} Note that if $\mathbf{\Sigma} = \mathbf{I}_K$ and $q{p+1}=\cdots=q_T = 1$, then the estimator becomes the ridge regression estimator. That is, Bayesian shrinkage estimators have more flexible and abundant expressions, even though more computational effort is required to estimate more parameters.

In the package VARshrink, the function VARshrink(method = "fbayes", ...) plays the role as an interface with the full Bayesian shrinkage method with the shrinkage prior and the reference prior. The shrinkage parameter $\lambda$ is not set at a fixed value, i.e., the arguments lambda and lambda_var are of no use here. Instead, the mean of the posterior distribution, $\hat{\delta} = \mathbb{E}\left[ \lambda^{-1} | \mathbf{Y} \right]$ is estimated during the MCMC process [@Ni05], and we define $\hat{\lambda} = \hat{\delta}^{-1}$. There are several arguments to be specified as follows.

For example, we run the full Bayesian shrinkage method with a fixed $\nu = 6$ as follows.

resu_estim$`Full Bayes (fixed dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "fbayes", dof = 6,
                     burnincycle = 1000, mcmccycle = 2000)
resu_estim$`Full Bayes (fixed dof)`

By using summary(), we can find the estimated scale matrix $\mathbf{\Sigma}$ of multivariate t-distribution for noise.

summary(resu_estim$`Full Bayes (fixed dof)`)

Instead, we can estimate the degree of freedom, $\nu$, of multivariate t-distribution by the argument dof = NULL as follows:

resu_estim$`Full Bayes (estim dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "fbayes", dof = NULL,
            burnincycle = 1000, mcmccycle = 2000)
resu_estim$`Full Bayes (estim dof)`

Semiparametric Bayesian Shrinkage {#sec:sbayes}

Whereas full Bayesian shrinkage methods estimate all the hyperparameters including latent variables via MCMC methods, semiparametric Bayesian methods estimate some of the hyperparameters by a certain nonparametric method and estimate the rest in a Bayesian framework. The semiparametric approach is advantageous especially when the dimensionality of the model is so high that standard MCMC methods are not computationally tractable.

@LeeChoiKim2016 assumed scale mixtures of multivariate normal distributions for noise vectors as in Eq. (7). The prior distribution for the model coefficients, $\boldsymbol\psi = \text{vec}(\mathbf{\Psi})$, was set as multivariate normal distributions, similarly to Eq. (8). Here, we can choose either the conjugate prior (CJ) and non-conjugate (NCJ) prior as follows:

(i) The conjugate prior for $\boldsymbol\psi \in\mathbb{R}^J$ is expressed by \begin{equation} (\boldsymbol\psi | \lambda, \mathbf{\Sigma}) \sim \text{N}_{J} \left(\mathbf{0}, \frac{1-\lambda}{(N - 1) \lambda} \mathbf{\Sigma} \otimes \mathbf{I} \right), \qquad 0<\lambda < 1. \end{equation}

(ii) The non-conjugate prior for $\boldsymbol\psi \in\mathbb{R}^J$ is expressed by \begin{equation} (\boldsymbol\psi | \lambda) \sim \text{N}_{J} \left(\mathbf{0}, \frac{1-\lambda}{(N - 1) \lambda} \mathbf{I}_J \right), \qquad 0<\lambda < 1. \end{equation}

The scale matrix $\mathbf{\Sigma}$ for noise is included in the equation for conjugate prior distribution but not for the non-conjugate prior distribution. The non-conjugate prior is quite similar to the full Bayesian formulation in Eq. (8). However, the main difference is that, in the semiparametric Bayesian approach, the shrinkage parameter $\lambda$ should be estimated explicitly via a nonparametric method, but in the full Bayes approach, it is a latent variable which should be sampled and estimated implicitly via an MCMC method.

The prior distribution for $\mathbf\Sigma$ was set as an inverse Wishart distribution as \begin{equation} (\mathbf{\Sigma} | \mathbf{L}_0, m_0) \sim \text{InvWishart}( \mathbf{L}_0, m_0), \qquad \mathbf{L}_0 \succ \mathbf{0}, m_0 > K - 1. \end{equation} where $\mathbf{L}_0 \succ \mathbf{0}$ means that $\mathbf{L}_0$ is positive definite.

Once the shrinkage parameter $\lambda$ is set at a fixed value, the other parameters, $\boldsymbol\psi, \mathbf{\Sigma}$, and $\mathbf{Q}$ can be estimated iteratively in a Bayesian framework efficiently. Briefly speaking, we consider estimating the parameters $\boldsymbol\psi$ and $\mathbf{\Sigma}$ at the maximum point (i.e., the mode) of the marginal posterior density function $\pi (\boldsymbol\psi, \mathbf{\Sigma} | \lambda; \mathbf{Y})$. In the case of the non-conjugate prior, the mode, $(\widehat{\boldsymbol\psi}^\text{S} (\lambda), \widehat{\mathbf{\Sigma}}^\text{S} (\lambda))$, is expressed as \begin{equation} \widehat{\boldsymbol\psi}^{S} (\lambda) = \left[\left( (\widehat{\mathbf{\Sigma}}^\text{S})^{-1} \otimes \left( \mathbf{X}^\top \widehat{\mathbf{Q}}^\text{S} \mathbf{X} \right) \right) + \frac{ (N - 1) \lambda}{1-\lambda} \mathbf{I}J \right]^{-1} \text{vec}\left( \mathbf{X}^\top \widehat{\mathbf{Q}}^\text{S} \mathbf{Y} ( \widehat{\mathbf{\Sigma}}^\text{S})^{-1} \right), \qquad\qquad\qquad (10) \end{equation}\label{expr_sbayes_psihat} and
\begin{equation} \widehat{\mathbf{\Sigma}}^\text{S} (\lambda) = \frac{1}{m_0 + T + K + 1} \left( \mathbf{L}_0 + \mathbf{Y}^\top \widehat{\mathbf{Q}} \mathbf{Y} - \mathbf{Y}^\top \widehat{\mathbf{Q}} \mathbf{X} \widehat{\mathbf{\Psi}}^\text{S} (\lambda) \right), \end{equation} for $0< \lambda < 1$, where $\widehat{\mathbf{Q}}^\text{S} = \text{diag}(\hat{q}
{p+1}, \ldots, \hat{q}_T)$ is obtained in an iterative manner.\footnote{We note that the expression in Eq. (10) is equivalent to that in Eq. (9) in a full Bayesian framework.} The values $\widehat{\mathbf{Q}}^\text{S}$ is also expressed as \begin{equation} \hat{q}_t = h \left( \left\| (\widehat{\mathbf{\Sigma}}^\text{S})^{-1/2} \hat{\boldsymbol\epsilon}_t \right\|^2 \right), \qquad t = p+1, \ldots, T, \end{equation} where $h(x)$ is defined depending on the noise distribution and $\hat{\boldsymbol\epsilon}_t$ is the residual given by $\hat{\boldsymbol\epsilon}_t = \mathbf{y}_t - \widehat{\mathbf{\Psi}}^{\text{S}\top} \mathbf{x}_t$ [@LeeChoiKim2016].

The shrinkage parameter $\lambda$ is determined via an internal nonparametric method, which is called the parameterized cross validation (PCV). This algorithm can be considered as a modified $K$-fold cross validation, especially for estimating the shrinkage parameter of VAR models; see @LeeChoiKim2016 for a detailed explanation.

In addition, the semiparametric Bayesian shrinkage method adopts the idea of shrinking both the correlations and variances from the NS method. As a result, there are two shrinkage parameters $\lambda$ and $\lambda_v$, where $0 \leq \lambda \leq 1$ is used for the shrinkage estimation of the VAR coefficient matrix $\Psi$ and noise covariance matrix $\mathbf{\Sigma}$ while $0 \leq \lambda_v \leq 1$ is used for the shrinkage estimation of the variances of the variables $y_{tj}, j = 1, \ldots, K$.

In the package VARshrink, the function VARshrink(method = "sbayes", ...) is for the semiparametric Bayesian shrinkage method. There are several input arguments to these functions as follows:

For example, we can run the semiparametric shrinkage method as follows.

resu_estim$`Semi Bayes (fixed dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "sbayes", dof = 6,
            lambda = NULL, lambda_var = NULL, prior_type = "NCJ",
            num_folds = 5, m0 = ncol(Y))
resu_estim$`Semi Bayes (fixed dof)`
summary(resu_estim$`Semi Bayes (fixed dof)`)

We can also let the software package to choose the degree of freedom parameter $\nu$ automatically by setting dof = NULL. In this case, the package uses simply a $K$-fold cross validation to find an optimal value of $\nu$.

resu_estim$`Semi Bayes (estim dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "sbayes", dof = NULL,
            lambda = NULL, lambda_var = NULL, prior_type = "NCJ",
            num_folds = 5, m0 = ncol(Y))
resu_estim$`Semi Bayes (estim dof)`

K-fold Cross Validation for Semiparametric Shrinkage

The VARshrink includes an implementation of the K-fold cross validation (CV) method for selecting shrinkage parameters. In the current version of the package, the K-fold CV method can select the $\lambda$ and $\lambda_v$ values for the semiparametric Bayesian shrinkage estimator described in Section 3.4. Note the the semiparametric shrinkage method in the previous section selects a $\lambda$ value by using the PCV method and selects a $\lambda_v$ value by a Stein-type nonparametric shrinkage method.

The K-fold CV method can be run as follows. The arguments to VARshrink() are same to those for the semiparametric Bayesian shrinkage method except for the argument method = "kcv".

resu_estim$`K-fold CV (fixed dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "kcv", dof = 6,
            lambda = NULL, lambda_var = NULL, prior_type = "NCJ",
            num_folds = 5, m0 = ncol(Y))
resu_estim$`K-fold CV (fixed dof)`

Degree of freedom of multivariate t-distribution for noise can be selected automatically by setting the argument dof = Inf as follows.

resu_estim$`K-fold CV (estim dof)` <-
  VARshrink(Y, p = 1, type = "const", method = "kcv", dof = NULL,
            lambda = NULL, lambda_var = NULL, prior_type = "NCJ",
            num_folds = 5, m0 = ncol(Y))

After all, the function calcSSE_Acoef() computes the sum of squared errors (SSEs) between two VAR model parameters, ${\mathbf{A}k^{(1)}}$ and ${\mathbf{A}_k^{(2)}}$, as $SSE = \sum{k=1}^p \sum_{i,j} (\mathbf{A}^{(1)}{kij} - \mathbf{A}^{(2)}{kij})^2$. For example, Table 2 shows the SSEs of the estimated VAR coefficients.

resu_sse <- data.frame(SSE = sapply(resu_estim,
  function(x) calcSSE_Acoef(Acoef_sh(x), myCoef$A)))
knitr::kable(round(resu_sse, 3),
             caption = "Table 2. Sum of squared errors of VAR coefficients estimated by the shrinkage methods.")

Numerical Experiments {#sec:numer}

In this section, we apply the shrinkage estimation methods in the package VARshrink to a benchmark data set. Using the benchmark data set, we demonstrate the use of information criteria such as Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) for comparison of VAR models. In this case, the effective number of parameters for an shrinkage estimate has to be re-calculated based on the shrinkage intensity parameter value.

Benchmark Data

The Canada data set is a benchmark macroeconomic data included in the package vars. It contains four time series variables representing employment(e), labor productivity(prod), real wage(rw), and unemployment rate(U), with $84$ observations. We took difference on the data to remove trend, yielding $T = 83$. Figure 1 shows the differencing result.

data(Canada, package = "vars")
Y = diff(Canada)
plot(Y, cex.lab = 1.3)

The shrinkage methods are run with the option const = "const" since the mean of the data has not been corrected. We set the lag order $p$ at p = 1, 2, 3 to compare several VAR models. For the semiparametric Bayesian method (method = "sbayes"), the degree of freedom $\nu$ was automatically selected. In addition, we ran the K-fold CV method (method = "kcv") for choosing $\lambda$ and $\lambda_v$ values of the semiparametric Bayesian shrinkage estimator.

set.seed(1000)
resu_model <- array(NA, dim = c(5, 2, 3),
  dimnames = list(c("Ridge regression", "Nonparametric shrinkage",
                    "Full Bayes", "Semi Bayes", "K-fold CV"),
                  c("AIC", "BIC"), c("p=1", "p=2", "p=3")))
for (p in 1:3) {
  EstimRidge <- VARshrink(Y, p = p, type = "const", method = "ridge")
  resu_model["Ridge regression", , p] <- c(AIC(EstimRidge), BIC(EstimRidge))

  EstimNS <- VARshrink(Y, p = p, type = "const", method = "ns")
  resu_model["Nonparametric shrinkage", , p] <-
    c(AIC(EstimNS), BIC(EstimNS))

  EstimFB <- VARshrink(Y, p = p, type = "const", method = "fbayes", dof = NULL)
  resu_model["Full Bayes", , p] <- c(AIC(EstimFB), BIC(EstimFB))

  EstimSB <- VARshrink(Y, p = p, type = "const", method = "sbayes",
                       dof = NULL, prior_type = "NCJ")
  resu_model["Semi Bayes", , p] <- c(AIC(EstimSB), BIC(EstimSB))

  EstimKCV <- VARshrink(Y, p = p, type = "const", method = "kcv",
                          dof = NULL, prior_type = "NCJ")
  resu_model["K-fold CV", , p] <- c(AIC(EstimKCV), BIC(EstimKCV))
}

We can compare several models by computing their AIC and BIC. The following results in Table 3 indicate that the NS method produced better VAR coefficients than those of the other methods, and that the AIC took the minimum at $p=3$ while the BIC took the minimum at $p=2$.

load("table3_modelcomp.RData")
knitr::kable(round(resu_model, 1),
             caption = "Table 3. Information criteria (AIC, BIC) for model comparison.")

The estimated parameters by the NS method with $p=2$ can be analyzed further by using the methods and functions in Table 1. For example, we can perform time series forecasting as in Figure 2.

plot(predict(VARshrink(Y, p = 2, type = "const", method = "ns")), names = "U")

Conclusions {#sec:concl}

We wrote an R software package VARshrink for shrinkage estimation of VAR model parameters. The shrinkage methods included in this package are the multivariate ridge regression [@Hoerl70; @Golub79], a nonparametric shrinkage method [@Rhein07c], a full Bayesian shrinkage method [@Ni05], and a semiparametric Bayesian shrinkage method [@LeeChoiKim2016]. An advantage of this package is the integration of the nonparametric, parametric, and semiparametric methods in one frame via a common interface function VARshrink(), which makes it easy and convenient to use various types of shrinkage estimation methods. Moreover, we note that the shrinkage estimation methods implemented in the current version have not been widely implemented in R packages in the context of VAR models.

We demonstrated the use of model selection criteria as AIC and BIC by using benchmark time series data. We note that computation of the log-likelihood of an estimated model must consider the actual distribution assumption of each method. We explained that the multivariate normal distribution is not the only choice for a distribution of noise, but another distributions such as the multivariate t-distribution can be chosen. Moreover, an effective number of parameters must be calculated for computing the AIC and BIC values. In this case, a large number of shrinkage parameter value leads to a small value of the effective number of parameters. In the package VARshrink, effective number of parameters is computed automatically based on the selected shrinkage parameter value.

Shrinkage methods are quite crucial especially for high dimensional VAR models. Bayesian approaches have been developed widely for VAR models in the literature for various purposes. However, the computational time for carrying out MCMC processes may be intractably high for high dimensional VAR models. For this reason, it is important to use nonparametric and semiparametric shrinkage estimation methods together to produce computationally feasible estimates of model parameters. The R package VARshrink is the first step to an integrative and general types of software packages for VAR models. Moreover, this package can be extended to include other shrinkage methods such as Bayesian shrinkage methods using several types of different prior distributions.

References {-}



Try the VARshrink package in your browser

Any scripts or data that you put into this service are public.

VARshrink documentation built on Oct. 9, 2019, 5:06 p.m.