knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

HDGCvar

CRAN_Version_Badge CRAN_Downloads_Badge License_GPLv3_Badge

HDGCvar allows for testing Granger causality in High Dimensional Vector Autoregressive Models (VARs). Granger causality can be tested between time series that are stationary (HDGC_VAR_I0), non stationary (unit root), cointegrated, or all the above (HDGC_VAR). Bivariate as well as multivariate (i.e. blocks) causality can be considered by specifying the name(s) of the variable(s) of interest in GCpair (or GCpairs) and networks can be plotted to visualize the causal structure among several variables using Plot_GC_all.

A specific part of HDGCvar is dedicated to Realized Volatilities (RV), thus using the Heterogeneous VAR (HVARs). It gives the possibiity of building RV spillover networks (HDGC_HVAR_all) as well as conditioning RV on Realized Correlations (HDGC_HVAR_RV_RCoV_all).

Installation

You can install the released version of HDGCvar from CRAN with:

install.packages("HDGCvar")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("Marga8/HDGCvar")

All the functions in HDGCvar are based on the following two papers:

  1. A. Hecq, L. Margaritella, S.Smeekes, "Granger Causality Testing in High Dimensional VARs: a Post Double Selection Procedure"(2019, https://arxiv.org/abs/1902.10991 )
  2. A. Hecq, L. Margaritella, S.Smeekes, "Inference in Non Stationary High Dimensional VARs" (2020, check the latest version at https://sites.google.com/view/luca-margaritella ).

Details

If the addition of variable X to the given information set Ω alters the conditional distribution of another variable Y, and both X and Ω are observed prior to Y, then X improves predictability of Y, and is said to Granger cause Y with respect to Ω.

Statistically assessing the predictability among two (or blocks of) time series turns out to be a fundamental concept in modern time series analysis. Its applications range over from macroeconomics, finance, network theory and even the neurosciences.

With the increased availability of larger datasets, these causality concepts have been extended to the high-dimensional setting where they can benefit from the inclusion of many more series within the available information set Ω. Note how including as much information as possible is particularly tending to the original idea of Granger (1969) who envisioned Ω to be "all the information in the universe", in order to avoid spurious causality.

Conditioning on so many variables comes at a cost, i.e. quickly running into problems of high-dimensionality: sky-rocketing variance, overfitting and in general invalidity of standard statistical techniques.

A. Hecq, L. Margaritella, S.Smeekes (2019), designed a Granger causality LM test for high-dimensional stationary vector autoregressive models (VAR) which combine dimensionality reduction techniques based on penalized regressions such as the lasso of Tibshirani (1996), with the post-double-selection procedure of Belloni et al. (2011) to select the set of relevant covariates in the system. The double selection step allows to substantially reduce the omitted variable bias and thereby allowing for valid post-selection inference on the parameters.

If you are sure that your time series are stationary, you can go ahead and use one of the following functions from HDGCvar:

  1. HDGC_VAR_I0 which tests for Granger causality in High Dimensional Stationary VARs
  2. HDGC_VAR_multiple_I0 or HDGC_VAR_multiple_pairs_I0 which test multiple combinations Granger causality in High Dimensional Stationary VARs
  3. HDGC_VAR_all_I0 which allows you to test all the bivariate combinations in you dataset and hence to build a Granger causality network.

All these functions ask you as inputs:

A. Hecq, L. Margaritella, S.Smeekes (2020) extended the Post-double-selection Granger causality LM test of Hecq et al. (2019) to the case in which your system might contain time series integrated of arbitrary orders and possibly even cointegrated. To accomplish this is necessary to simply augment the lags of the variables of interest by the maximum order of integration we suspect the series having. This allows to completely avoid any biased pre-test of unit-root or cointegration and directly perform high-dimensional inference on the desired parameters.

Therefore: if you are NOT sure whether your time series are stationary, non-stationary, cointegrated, or an appropriate mix of the above or you do not trust the biased unit root and cointegration tests out there, we got you covered! you can go ahead and use one of the following functions from HDGCvar:

  1. HDGC_VAR which tests for Granger causality in High Dimensional Stationary/Non-Stationary/cointegrated or a mix of the above, VARs
  2. HDGC_VAR_multiple or HDGC_VAR_multiple_pairs which test multiple combinations Granger causality in High Dimensional Stationary/Non-Stationary/cointegrated or a mix of the above, VARs
  3. HDGC_VAR_all which allows you to test all the bivariate combinations in you dataset and hence to build a Granger causality network.

All these functions ask you the same inputs as reported above for HDGC_VAR_I0 with the addition of the following:

Back to A. Hecq, L. Margaritella, S.Smeekes (2019), in their empirical application they considered a specific case of stationary time series, namely they investigate the volatility transmission in stock return prices using the daily realized variances. These are very attractive measures among practitioners and academics for modelling time varying volatilities and monitoring financial risk. Given the time series of realized volatilities, a multivariate version of the heterogeneous autoregressive model (HVAR) of Corsi (2009) is employed to model their joint behavior.

Therefore: if you are interested in testing Granger causality among realized volatilities and potentially build volatility spillover networks you should use one of the following function from HDGCvar:

  1. HDGC_HVAR which tests for Granger causality in High Dimensional stationary HVARs.
  2. HDGC_HVAR_multiple or HDGC_VAR_multiple_pairs which test multiple combinations Granger causality in High Dimensional stationary HVARs.
  3. HDGC_HVAR_all which allows you to test all the bivariate combinations in you dataset and hence to build a Granger causality network.

All these functions ask you the same inputs as reported above for HDGC_VAR_I0 with the inclusion of log=TRUE to log-transform the series (recommended) and the exclusion of the lag-length parameter p as in a HVAR this is already pre-specified to be equal to 3, namely the daily, weekly, monthly aggregates of realized volatilities (for details see Corsi (2009) and Hecq et al. (2019)).

Note: to account for the potential heteroskedasticity in the data, all these functions returns the Asymptotic χ$^2$ version of the LM test ("Asymp"), the small sample correction F tests ("FS_cor") and the heteroskedasticity-robust Asymptotic χ$^2$ (Asymp_Robust).

It is not all: if you whish to condition your volatility spillover network on realized correlations you can also do so by using one of the following function from HDGCvar:

  1. HDGC_HVAR_RV_RCoV_all, compute Network of Realized Volatilities conditional on the set of Realized Correlations
  2. HDGC_HVAR_RVCOV, Test Granger causality for Realized Volatilities in High Dimensional Stationary Heterogeneous VARs conditioning on Realized Correlations
  3. HDGC_HVAR_multiple_RVCOV, Test multiple combinations Granger causality for realized volatilities in High Dimensional Stationary HVARs

All these functions ask you the following inputs:

In order to plot networks created via: HDGC_VAR_all_I0, HDGC_VAR_all, HDGC_HVAR_all, HDGC_HVAR_RV_RCoV_all you can use the function

This function takes as inputs:
Comb, stands for the result of one of the functions: HDGC_VAR_all_I0, HDGC_VAR_all, HDGC_HVAR_all, HDGC_HVAR_RV_RCoV_all
Stat_type="FS_cor", the default is "FS_cor" namely the small sample correction F test. Alternatively one can use "Asymp" for the asymptotic χ$^2$ test or "Asymp_Robust" for the heteroskedasticity-robust asymptotic χ$^2$ test.
alpha=0.01, the desired probability of type one error, default is 0.01.
multip_corr, a list A list: first element is logical, if TRUE a multiple testing correction using stats::p.adjust() is used. The second element of the list define the p.adjust.method=c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr", "none")). If the second element gets the name "APF_FDR" then APFr::apf_fdr which uses empirical Bayes is called and a third and fourth elements in the mutip_corr list are required: gamm=c(,,) requires a min, max and step length values to be set for the threshold on the p_values,fdr.apf=c(,) requires one or two values: either (NULL,value) or (value,NULL) if one wants to have specified amount of average power (fdr) no matter fdr (average power). If both (value,value) are given, the calculated threshold will find the closest combination to both apf and fdr desired. The last element of the list is logical: verbose=TRUE if one wants to know how much apf/fdr the testing has.
..., all parameters for the network plot: see example and igraph documentation.
cluster, a list: first element is logical, if TRUE a cluster plot using igraph::cluster_edge_betweenness() is plotted. Other elements are respectively: vertex.size, vertex.label.color,vertex.label.cex, vertex.label.dist, edge.curved (see igraph for details).

Data

In HDGCvar the user can find three simulated datasets, namely: sample_dataset_I0, sample_dataset_I1, and sample_RV. For details on sample_dataset_I0, sample_dataset_I1 look at the next section. In a nutshell, a stationary VAR(1) in first-differences, with 30 variables for 200 observations, is simulated in sample_dataset_I0 and it is then reverted into a levels VAR(2) taking the inverse-differences in sample_dataset_I1. sample_RV contains 30 Realized Volatility time series of sample T=200 obtined by simulating 30 random instances from a Heterogeneous Autoregressive (HAR) model with daily, weekly and monthly lags. The simulations are obtained using the "HARSimulate" function from the package "HARModel". The value for the constant and the daily, weekly and monthly lags coefficients are respectively 0.01, 0.36 ,0.28 , 0.28 and the standard deviation of the error term is 0.001.

Examples

Here is a simulated examples which show you how to use some of the features in HDGCvar.

library(HDGCvar)
library(igraph)

#First, let us simulate some time series
#The SimulVAR below simulates a (non-sparse) stationary VAR(1) with sample size T_ and number of variables g
SimulVAR=function(T_,g){
  coef1<-matrix(NA,nrow=g,ncol=g)
  for (i in 1:g) {
    for (j in 1:g) {
      coef1[i,j]<-((-1)^(abs(i-j)))*(0.4^(abs(i-j)+1))
    }
  }
  presample<-1
  T_new<-T_+presample
  eps1<-rnorm(ncol(coef1)*T_new,0,1)
  eps<-matrix(eps1,nrow=ncol(coef1))
  X <- matrix(nrow=ncol(coef1),ncol=T_new)
  X[,1] <- eps[,1]
  for (t in 2:T_new) {
    X[,t] <- (coef1)%*%X[,t-1]+eps[,t]
  }
  finseries<- X[,(1+presample):T_new]
  return(t(finseries))
}

#You can check the stationarity of the VAR(1) by simply observing:
g=20
coef<-matrix(NA,nrow=g,ncol=g)
  for (i in 1:g) {
    for (j in 1:g) {
      coef[i,j]<-((-1)^(abs(i-j)))*(0.4^(abs(i-j)+1))
    }
  }
max(eigen(coef)$values) #<1

#Let us fix some seed and simulate 20 non-stationary time series with sample size 200 from a VAR(2) by taking the inverse difference of the stationary series simulated.
set.seed(123)
dataset_I0<-as.matrix(SimulVAR(100,20))
dataset<-as.matrix(diffinv(dataset_I0))
colnames(dataset)<-c(paste(rep("Var",20),1:20))
#Note: we set T=100 and g=20 for simplicity of exposition. The functions in HDGCvar are able to handle much larger samples ("big data") and also high-dimensional settings where g>T.

#Let us plot the time series to get a feeling of what kind of data we are dealing with
ts.plot(dataset)

#Select the lag-length of the VAR using HDGCvar::lags_upbound_BIC
selected_lag<-lags_upbound_BIC(dataset,p_max=10)
print(selected_lag)
#Note: hooray! the selected lag is p=2 which it is correct given our simulated data! Of course in practice we could not know this.

#Suppose we are interested in testing whether variable name "Var5" Granger-causes variable name "Var1" given all other variables in your dataset.
interest_variables=list("GCto"="Var 1","GCfrom"="Var 5")

#By visual diagnostics on the series in our dataset from the previous plot we definitely have some time series with unit #roots and probably cointegration. Actually: we definitely know this as we created the process in such a way that all #the variables have a unit root, in practice obviously this would not be the case, we will come back to this point later.

#However, the nice feature of HDGCvar is that you can completely avoid any pre-tests of integration/cointegration of #your series. Letting d=1 be the augmentation needed, we can run:

HDGC_VAR(GCpair=interest_variables, data=dataset, p = selected_lag, d = 2, bound = 0.5 * nrow(dataset),
                     parallel = T, n_cores = 2) 

# What we did so far is correct but there is a caveat: we have simulated the stationary time series using SimulVAR and taken the inverse-differences to make them non-stationary, hence by construction we know for sure these series are all I(1) and no I(2) are present in our dataset. 

# It goes without saying that this would not be the case in practice: i.e. we would not know with probability 1 that our series are all maximum I(1), although in many empirical exercises this is the case nonetheless.

#However, we are not sure of any further roots of the system being close to one. As we simulated the data, we could easily see this (even though, again, in practice this would not be feasible):

coef<-matrix(NA,nrow=g,ncol=g)
for (i in 1:g) {
  for (j in 1:g) {
    coef[i,j]<-((-1)^(abs(i-j)))*(0.4^(abs(i-j)+1))
  }
}
diag(coef)<-1.4

coef1<-matrix(NA,nrow=g,ncol=g)
for (i in 1:g) {
  for (j in 1:g) {
    coef1[i,j]<-((-1)^(abs(i-j)))*(0.4^(abs(i-j)+1))
  }
}
coef1<-(-coef1)

lefthand<-rbind(coef, diag(1,nrow = nrow(coef), ncol = ncol(coef)) )
righthand<-rbind(coef1, diag(0,nrow = nrow(coef1), ncol = ncol(coef1)) )
companion<-cbind(lefthand,righthand)
eigen(companion)$values

#To safeguard us from I(2), or near-I(2) in the present case, we should put d=2.
#In fact, in the paper A. Hecq, L. Margaritella, S.Smeekes (2020) the use of d=2 is suggested as the standard since if some near unit root is present in the dataset, we risk that the augmentation d=1 takes care of only the pure I(1) therefore resulting in unsatisfoctory finite sample performaces.

#Now, suppose we are interested in testing multiple Granger causality relations
mult_interest_variables<-list(list("GCto"="Var 7", "GCfrom"="Var 19"),list("GCto"="Var 4", "GCfrom"="Var 16"))
HDGC_VAR_multiple(dataset, GCpairs=mult_interest_variables, p= selected_lag, d=2, bound = 0.5 * nrow(dataset),
                     parallel = T, n_cores = 2)

#Let us now estimate the full network of causality among the 20 series in dataset

network<-HDGC_VAR_all(dataset, p = selected_lag, d = 2, bound = 0.5 * nrow(dataset), 
                         parallel = TRUE, n_cores = 2)

#Now let us plot the estimated network.
#You can even do clustering of the connected series using the option cluster=list(T,5,"black",0.8,1,0).
#This uses Newman-Girvan (2002) algorithm implemented in igraph and based on edge-betweenness.
set.seed(888)
Plot_GC_all(network, Stat_type="FS_cor",alpha=0.01, multip_corr=list(F),directed=T, layout=layout.circle, main="Network",edge.arrow.size=.2,vertex.size=5, vertex.color=c("lightblue"), vertex.frame.color="blue",vertex.label.size=2,vertex.label.color="black",vertex.label.cex=0.6, vertex.label.dist=1, edge.curved=0,cluster=list(T,5,"black",0.8,1,0)) 

# After running HDGC_VAR with d=2, p=2, we got a warning which reads as follows: "To avoid spurious regression problems in the post-double-selection steps, unless you are certain that your series are maximum I(1), you might want to increase the lag-length p to be larger than d".

#Note that in this case we know that no spurious regression occurred as the series are maximum I(1) by construction and the estimated lag with lags_upbound_BIC() is p=2 so the use of d=2 is purely for statistical reasons.

#To understand this last point, it is perhaps easier to make a small toy-example: consider a trivariate system with variables y_t, x_t, z_t and suppose we are interested in testing Granger causality from x_t to y_t. 

#Suppose also that we would believe the time series in our dataset to be maximum I(1) and we ignore the recommendation of setting d=2 and hence we would set d=1. 
#Then, suppose further that either estimating the lag-length using lags_upbound_BIC() or by following a fishy oracle we would get/inpute p=1.

#The post-double-selection procedure will consist of p+1 steps of lasso regressions:
# 1) y_t on y_{t-1}+x_{t-1}+z_{t-1}
# 2) x_{t-1} on y_{t-1}+z_{t-1}
# Now, clearly step 2) is problematic as x_{t-1} is a non-stationary time series and on the right-hand-side there is not sufficient lags of itself to make it in first-differences.

#Now suppose we increase the lag to p=2, then
# 1) y_t on y_{t-1}+x_{t-1}+z_{t-1} + y_{t-2}+x_{t-2}+z_{t-2} 
# 2) x_{t-1} on y_{t-1}+z_{t-1}+ y_{t-2}+x_{t-2}+z_{t-2}
# 3) x_{t-2} on y_{t-1}+x_{t-1}+z_{t-1} + y_{t-2}+z_{t-2} 
# and now step 2 and 3 are free of spuriousness as either x_{t-2} or x_{t-1} can be used to make them in first-difference.

#However, the warning needs to be taken "with a grain of salt": if p=2, then d=2 would be "enough" and actually pretty good if the variables are actually I(1) (or I(0) and I(1)), where the good would come from the fact that we would be safeguarded from near-I(2) behaviors. 
#On the other hand, it would not be "enough" if some variables would actually be I(2) (and we leave to you to think through if this is something likely in your field) as in that case we would need two lags on the right hand side of the post-double-selection (lasso) regressions to make the left hand side in second-differences.


#Now as a humbling final exercise we could use p=3 and d=2 to compare what happens to the selection
HDGC_VAR(GCpair=interest_variables, data=dataset, p = 3, d = 2, bound = 0.5 * nrow(dataset),
                     parallel = T, n_cores = 2) 
# It is clear upon inspection of the selctions that the modifications of p does not come without efficiency loss for the statistical tests. 

References



Marga8/HDGCvar documentation built on May 25, 2024, 11:12 a.m.