mvar: Estimating mixed Vector Autoregressive Model (mVAR)

View source: R/mvar.R

mvarR Documentation

Estimating mixed Vector Autoregressive Model (mVAR)

Description

Estimates mixed Vector Autoregressive Model (mVAR) via elastic-net regularized Generalized Linear Models

Usage

mvar(data, type, level, lambdaSeq, lambdaSel, lambdaFolds,
     lambdaGam, alphaSeq, alphaSel, alphaFolds, alphaGam, lags,
     consec, beepvar, dayvar, weights, threshold, method, binarySign, 
     scale, verbatim, pbar, warnings, saveModels, saveData, 
     overparameterize, thresholdCat, signInfo, ...)

Arguments

data

n x p data matrix.

type

p vector indicating the type of variable for each column in data. "g" for Gaussian, "p" for Poisson, "c" for categorical.

level

p vector indicating the number of categories of each variable. For continuous variables set to 1.

lambdaSeq

A sequence of lambdas that should be searched (see also lambdaSel). Defaults to NULL, which uses the glmnet default to select a lambda candidate sequence (recommended). See ?glmnet for details.

lambdaSel

Specifies the procedure for selecting the tuning parameter controlling the Lq-penalization. The two options are cross validation "CV" and the Extended Bayesian Information Criterion (EBIC) "EBIC". The EBIC performs well in selecting sparse graphs (see Barber and Drton, 2010 and Foygel and Drton, 2014). Note that when also searching the alpha parameter in the elastic net penalty, cross validation should be preferred, as the parameter vector will not necessarily be sparse anymore. The EBIC tends to be a bit more conservative than CV (see Haslbeck and Waldorp, 2016). CV can sometimes not be performed with categorical variables, because glmnet requires at least 2 events of each category of each categorical variable in each training-fold. Defaults to lambdaSel = "CV".

lambdaFolds

Number of folds in cross validation if lambdaSel = "CV".

lambdaGam

Hyperparameter gamma in the EBIC if lambdaSel = "EBIC". Defaults to lambdaGam = .25.

alphaSeq

A sequence of alpha parameters for the elastic net penality in [0,1] that should be searched (see also alphaSel). Defaults to alphaSeq = 1, which means that the lasso is being used. alphaSeq = 0 corresponds to an L2-penalty (Ridge regression). For details see Friedman, Hastie and Tibshirani (2010).

alphaSel

Specifies the procedure for selecting the alpha parameter in the elastic net penalty. The two options are cross validation "CV" and the Extended Bayesian Information Criterion (EBIC) "EBIC". The EBIC performs well in selecting sparse graphs (see Barber and Drton, 2010 and Foygel and Drton, 2014). Note that when also searching the alpha parameter in the elastic net penalty, cross validation should be preferred, as the parameter vector will not necessarily be sparse anymore. The EBIC tends to be a bit more conservative than CV (see Haslbeck and Waldorp, 2016). CV can sometimes not be performed with categorical variables, because glmnet requires at least 2 events of each category of each categorical variable in each training-fold. Defaults to alphaSel = "CV".

alphaFolds

Number of folds in cross validation if alphaSel = "CV"

alphaGam

Hyperparameter gamma in the EBIC if alphaSel = "EBIC". Defaults to alphaGam = .25.

lags

Vector of positive integers indicating the lags included in the mVAR model (e.g. 1:3 or c(1,3,5))

consec

An integer vector of length n, indicating the consecutiveness of measurement points of the rows in data. This means that rows for which the necessary (defined by the specified VAR model) measurements at previous time points are not available are excluded from the analysis. For instance, for a VAR model with lag 1 a consec vector of consec = c(1,2,3,5) would mean that the fourth row is excluded from the analysis, since no measurement 5-1=4 is available (next to the first row, for which also no previous measurement can be available). This is useful in many applications in which measurements are missing randomly or due to the design of the data collection (for example, respondents only respond during the hours they are awake). The "trimmed" dataset is returned in call$data_lagged if saveData = TRUE. Defaults to consec = NULL, which assumes that all measurements are consecutive, i.e. consec = 1:n. In this case only the first max(lags) lags are excluded to obtain the VAR design matrix.

beepvar

Together with the argument dayvar, this argument is an alternative to the consec argument (see above) to specify the consecutiveness of measurements. This is tailored to experience sampling method (ESM) studies, where the consectutiveness is defined by the number of notification on a given day (beepvar) and the given day (dayvar).

dayvar

See beepvar.

weights

A vector with n - max(lags) entries, indicating the weight for each observation. The mVAR design matrix has with n - max(lags) rows, because the first row must be predictable by the highest lag. The weights have to be on the scale [0, n - max(lags) ].

threshold

A threshold below which edge-weights are put to zero. This is done in order to guarantee a lower bound on the false-positive rate. threshold = "LW" refers to the threshold in Loh and Wainwright (2013), which was used in all previous versions of mgm. threshold = "HW" refers to the threshold in Haslbeck and Waldorp (2016). If threshold = "none" no thresholding is applied. Defaults to threshold = "LW".

method

Estimation method, currently only method = "glm".

binarySign

If binarySign = TRUE, a sign for the interaction within binary nodes and between binary and continuous nodes is provided in the output. Note that in this case the two categories of the binary variables have to be coded in 0,1. This is to ensure that the interpretation of the sign is unambigous: a positive sign of a parameter means that increasing the associated predictor results in a higher probability for category 1.

scale

If scale = TRUE, all Gaussian nodes (specified by "g" in the type argument) are centered and divided by their standard deviation. Scaling is recommended, because otherwise the penalization of a parameter depends on the variance of the associated predictor.

verbatim

If verbatim = TRUE, no warnings and no progress bar is shown. Defaults to verbatim = FALSE.

pbar

If pbar = TRUE, a progress bar is shown. Defaults to pbar = TRUE.

warnings

If warnings = TRUE, no warnigns are returned. Defaults to warnings = FALSE.

saveModels

If saveModels = FALSE, only information about the weighted adjacency matrix, and if d > 1 about the factor graph is provided in the output list. If saveModels = TRUE, all fitted parameters are additionally returned.

saveData

If saveData = TRUE, the data is saved in the output list. Defaults to saveData = FALSE.

overparameterize

If overparameterize = TRUE, mgm() uses over-parameterized design-matrices for each neighborhood regression; this means that a cross-lagged effect between two categorical variables with m and s categories is parameterized by m*s parameters. If overparameterize = FALSE the standard parameterization (in glmnet) with m*(s-1) parameters is used, where the first category of the predicting variable serves as reference category. If all variables are continuous both parameterizations are the same. The default is set to overparameterize = FALSE.

thresholdCat

If thresholdCat = FALSE, the thresholds of categorical variables are set to zero. Defaults to thresholdCat = TRUE for which the thresholds are esimated.

signInfo

If signInfo = TRUE, a message is shown in the console, indicating that the sign of estimates is stored separately. Defaults to signInfo = TRUE.

...

Additional arguments.

Details

See Haslbeck and Waldorp (2018) for details about how the mixed VAR model is estimated.

Value

The function returns a list with the following entries:

call

Contains all provided input arguments. If saveData = TRUE, it also contains the data.

wadj

A p x p x n_lags array, in which rows are predicted by columns, i.e. entry wadj[1, 2, 4] corresponds to the parameter(s) of variable 2 at time point t predicting variable 1 at time point t - z, where z is the fourth specified lag in lags and n_lags is the number of specified lags in lags. For interactions that involve more than two parameters (e.g. always for categorical variables with more than 2 categories), we take the arithmetic mean of the absolute value of all parameters. The full set of estimated parameters is saved in rawlags (see below).

signs

A p x p x n_lags array, specifying the signs corresponding to the entries of wadj (if defined), where n_lags is the number of specified lags in lags. 1/-1 indicate positive and negative relationships, respectively. 0 indicates that no sign is defined, which is the case for interactions that involve a categorical variable where an interaction can have more than one parameter. If binarySign = TRUE, a sign is calculated for interactions between binary variables and binary and continuous variables, where the interaction is still defined by one parameter and hence a sign can be specified. NA indicates that the corresponding parameter in wadj is zero.

edgecolor

A p x p x n_lags array of colors indicating the sign of each parameter. This array contains the same information is signs and is included for convenient plotting.

rawlags

List with entries equal to the number of specified lags in lags. Each entry is a nested list, with each p entries: the first level indicates the predicted variable, the second level the predictor variable. In case of categorical variables, interactions have more than one parameter.

intercepts

A list with p entries, which contain the intercept/thresholds for each node. In case a given node is categorical with m categories, there are m thresholds for this variable.

nodemodels

A list with p glmnet() models, from which all above output is computed. Also contains the coefficients models for the selected lambda and the applied tau threshold tau.

Author(s)

Jonas Haslbeck <jonashaslbeck@gmail.com>

References

Barber, R. F., & Drton, M. (2015). High-dimensional Ising model selection with Bayesian information criteria. Electronic Journal of Statistics, 9(1), 567-607.

Foygel, R., & Drton, M. (2010). Extended Bayesian information criteria for Gaussian graphical models. In Advances in neural information processing systems (pp. 604-612).

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1), 1.

Haslbeck, J. M. B., & Waldorp, L. J. (2020). mgm: Estimating time-varying Mixed Graphical Models in high-dimensional Data. Journal of Statistical Software, 93(8), pp. 1-46. DOI: 10.18637/jss.v093.i08

Loh, P. L., & Wainwright, M. J. (2012, December). Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. In NIPS (pp. 2096-2104).

Yang, E., Baker, Y., Ravikumar, P., Allen, G. I., & Liu, Z. (2014, April). Mixed Graphical Models via Exponential Families. In AISTATS (Vol. 2012, pp. 1042-1050).

Examples


## Not run: 


## We generate data from a mixed VAR model and then recover the model using mvar()

# 1) Define mVAR model
p <- 6 # Six variables
type <- c("c", "c", "c", "c", "g", "g") # 4 categorical, 2 gaussians
level <- c(2, 2, 4, 4, 1, 1) # 2 categoricals with m=2, 2 categoricals with m=4, two continuous
max_level <- max(level)

lags <- c(1, 3, 9) # include lagged effects of order 1, 3, 9
n_lags <- length(lags)

# Specify thresholds
thresholds <- list()
thresholds[[1]] <- rep(0, level[1])
thresholds[[2]] <- rep(0, level[2])
thresholds[[3]] <- rep(0, level[3])
thresholds[[4]] <- rep(0, level[4])
thresholds[[5]] <- rep(0, level[5])
thresholds[[6]] <- rep(0, level[6])

# Specify standard deviations for the Gaussians
sds <- rep(NULL, p)
sds[5:6] <- 1

# Create coefficient array
coefarray <- array(0, dim=c(p, p, max_level, max_level, n_lags))

# a.1) interaction between continuous 5<-6, lag=3
coefarray[5, 6, 1, 1, 2] <- .4
# a.2) interaction between 1<-3, lag=1
m1 <- matrix(0, nrow=level[2], ncol=level[4])
m1[1,1:2] <- 1
m1[2,3:4] <- 1
coefarray[1, 3, 1:level[2], 1:level[4], 1] <- m1
# a.3) interaction between 1<-5, lag=9
coefarray[1, 5, 1:level[1], 1:level[5], 3] <- c(0, 1)


# 2) Sample
set.seed(1)
dlist <- mvarsampler(coefarray = coefarray,
                     lags = lags,
                     thresholds = thresholds,
                     sds = sds,
                     type = type,
                     level = level,
                     N = 200,
                     pbar = TRUE)

# 3) Recover
set.seed(1)
mvar_obj <- mvar(data = dlist$data,
                 type = type,
                 level = level,
                 lambdaSel = "CV",
                 lags = c(1, 3, 9),
                 signInfo = FALSE,
                 overparameterize = F)

# Did we recover the true parameters?
mvar_obj$wadj[5, 6, 2] # cross-lagged effect of 6 on 2 over lag lags[2]
mvar_obj$wadj[1, 3, 1] # cross-lagged effect of 3 on 1 over lag lags[1]
mvar_obj$wadj[1, 5, 3] # cross-lagged effect of 1 on 5 over lag lags[3]

# How to get the exact parameter estimates?
# Example: the full parameters for the crossed-lagged interaction of 2 on 1 over lag lags[1]
mvar_obj$rawlags[[1]][[1]][[2]] 

# 4) Predict / Compute nodewise Error
pred_mvar <- predict.mgm(mvar_obj, dlist$data)

head(pred_mvar$predicted) # first 6 rows of predicted values
pred_mvar$errors # Nodewise errors

# For more examples see https://github.com/jmbh/mgmDocumentation


## End(Not run)


mgm documentation built on Sept. 8, 2023, 6:05 p.m.