covdepGE: Covariate Dependent Graph Estimation

View source: R/main.R

covdepGER Documentation

Covariate Dependent Graph Estimation

Description

Model the conditional dependence structure of X as a function of Z as described in (1)

Usage

covdepGE(
  X,
  Z = NULL,
  hp_method = "hybrid",
  ssq = NULL,
  sbsq = NULL,
  pip = NULL,
  nssq = 5,
  nsbsq = 5,
  npip = 5,
  ssq_mult = 1.5,
  ssq_lower = 1e-05,
  snr_upper = 25,
  sbsq_lower = 1e-05,
  pip_lower = 1e-05,
  pip_upper = NULL,
  tau = NULL,
  norm = 2,
  center_X = TRUE,
  scale_Z = TRUE,
  alpha_tol = 1e-05,
  max_iter_grid = 10,
  max_iter = 100,
  edge_threshold = 0.5,
  sym_method = "mean",
  parallel = FALSE,
  num_workers = NULL,
  prog_bar = TRUE
)

Arguments

X

n \times p numeric matrix; data matrix. For best results, n should be greater than p

Z

NULL OR n \times q numeric matrix; extraneous covariates. If NULL, Z will be treated as constant for all observations, i.e.:

Z <- rep(0, nrow(X))

If Z is constant, the estimated graph will be homogeneous throughout the data. NULL by default

hp_method

character in c("grid_search","model_average","hybrid"); method for selecting hyperparameters from the the hyperparameter grid. The grid will be generated as the Cartesian product of ssq, sbsq, and pip. Fix X_j, the j-th column of X, as the response; then, the hyperparameters will be selected as follows:

  • If "grid_search", the point in the hyperparameter grid that maximizes the total ELBO summed across all n regressions will be selected

  • If "model_average", then all posterior quantities will be an average of the variational estimates resulting from the model fit for each point in the hyperparameter grid. The unnormalized averaging weights for each of the n regressions are the exponentiated ELBO

  • If "hybrid", then models will be averaged over pip as in "model_average", with \sigma^2 and \sigma_\beta^2 chosen for each \pi in pip by maximizing the total ELBO over the grid defined by the Cartesian product of ssq and sbsq as in "grid_search"

"hybrid" by default

ssq

NULL OR numeric vector with positive entries; candidate values of the hyperparameter \sigma^2 (prior residual variance). If NULL, ssq will be generated for each variable X_j fixed as the response as:

ssq <- seq(ssq_lower, ssq_upper, length.out = nssq)

NULL by default

sbsq

NULL OR numeric vector with positive entries; candidate values of the hyperparameter \sigma_\beta^2 (prior slab variance). If NULL, sbsq will be generated for each variable X_j fixed as the response as:

sbsq <- seq(sbsq_lower, sbsq_upper, length.out = nsbsq)

NULL by default

pip

NULL OR numeric vector with entries in (0, 1); candidate values of the hyperparameter \pi (prior inclusion probability). If NULL, pip will be generated for each variable X_j fixed as the response as:

pip <- seq(pip_lower, pi_upper, length.out = npip)

NULL by default

nssq

positive integer; number of points to generate for ssq if ssq is NULL. 5 by default

nsbsq

positive integer; number of points to generate for sbsq if sbsq is NULL. 5 by default

npip

positive integer; number of points to generate for pip if pip is NULL. 5 by default

ssq_mult

positive numeric; if ssq is NULL, then for each variable X_j fixed as the response:

ssq_upper <- ssq_mult * stats::var(X_j)

Then, ssq_upper will be the greatest value in ssq for variable X_j. 1.5 by default

ssq_lower

positive numeric; if ssq is NULL, then ssq_lower will be the least value in ssq. 1e-5 by default

snr_upper

positive numeric; upper bound on the signal-to-noise ratio. If sbsq is NULL, then for each variable X_j fixed as the response:

s2_sum <- sum(apply(X, 2, stats::var))
sbsq_upper <- snr_upper / (pip_upper * s2_sum)

Then, sbsq_upper will be the greatest value in sbsq. 25 by default

sbsq_lower

positive numeric; if sbsq is NULL, then sbsq_lower will be the least value in sbsq. 1e-5 by default

pip_lower

numeric in (0, 1); if pip is NULL, then pip_lower will be the least value in pip. 1e-5 by default

pip_upper

NULL OR numeric in (0, 1); if pip is NULL, then pip_upper will be the greatest value in pip. If sbsq is NULL, pip_upper will be used to calculate sbsq_upper. If NULL, pip_upper will be calculated for each variable X_j fixed as the response as:

lasso <- glmnet::cv.glmnet(X, X_j)
non0 <- sum(glmnet::coef.glmnet(lasso, s = "lambda.1se")[-1] != 0)
non0 <- min(max(non0, 1), p - 1)
pip_upper <- non0 / p

NULL by default

tau

NULL OR positive numeric OR numeric vector of length n with positive entries; bandwidth parameter. Greater values allow for more information to be shared between observations. Allows for global or observation-specific specification. If NULL, use 2-step KDE methodology as described in (2) to calculate observation-specific bandwidths. NULL by default

norm

numeric in [1, \infty]; norm to use when calculating weights. Inf results in infinity norm. 2 by default

center_X

logical; if TRUE, center X column-wise to mean 0. TRUE by default

scale_Z

logical; if TRUE, center and scale Z column-wise to mean 0, standard deviation 1 prior to calculating the weights. TRUE by default

alpha_tol

positive numeric; end CAVI when the Frobenius norm of the change in the alpha matrix is within alpha_tol. 1e-5 by default

max_iter_grid

positive integer; if tolerance criteria has not been met by max_iter_grid iterations during grid search, end CAVI. After grid search has completed, CAVI is performed with the final hyperparameters selected by grid search for at most max_iter iterations. Does not apply to hp_method = "model_average". 10 by default

max_iter

positive integer; if tolerance criteria has not been met by max_iter iterations, end CAVI. 100 by default

edge_threshold

numeric in (0, 1); a graph for each observation will be constructed by including an edge between variable i and variable j if, and only if, the (i, j) entry of the symmetrized posterior inclusion probability matrix corresponding to the observation is greater than edge_threshold. 0.5 by default

sym_method

character in c("mean","max","min"); to symmetrize the posterior inclusion probability matrix for each observation, the (i, j) and (j, i) entries will be post-processed as sym_method applied to the (i, j) and (j, i) entries. "mean" by default

parallel

logical; if TRUE, hyperparameter selection and CAVI for each of the p variables will be performed in parallel using foreach. Parallel backend may be registered prior to making a call to covdepGE. If no active parallel backend can be detected, then parallel backend will be automatically registered using:

doParallel::registerDoParallel(num_workers)

FALSE by default

num_workers

NULL OR positive integer less than or equal to parallel::detectCores(); argument to doParallel::registerDoParallel if parallel = TRUE and no parallel backend is detected. If NULL, then:

num_workers <- floor(parallel::detectCores() / 2)

NULL by default

prog_bar

logical; if TRUE, then a progress bar will be displayed denoting the number of remaining variables to fix as the response and perform CAVI. If parallel, no progress bar will be displayed. TRUE by default

Value

Returns object of class covdepGE with the following values:

graphs

list with the following values:

  • graphs: list of n numeric matrices of dimension p \times p; the l-th matrix is the adjacency matrix for the l-th observation

  • unique_graphs: list; the l-th element is a list containing the l-th unique graph and the indices of the observation(s) corresponding to this graph

  • inclusion_probs_sym: list of n numeric matrices of dimension p \times p; the l-th matrix is the symmetrized posterior inclusion probability matrix for the l-th observation

  • inclusion_probs_asym: list of n numeric matrices of dimension p \times p; the l-th matrix is the posterior inclusion probability matrix for the l-th observation prior to symmetrization

variational_params

list with the following values:

  • alpha: list of p numeric matrices of dimension n \times (p - 1); the (i, j) entry of the k-th matrix is the variational approximation to the posterior inclusion probability of the j-th variable in a weighted regression with variable k fixed as the response, where the weights are taken with respect to observation i

  • mu: list of p numeric matrices of dimension n \times (p - 1); the (i, j) entry of the k-th matrix is the variational approximation to the posterior slab mean for the j-th variable in a weighted regression with variable k fixed as the response, where the weights are taken with respect to observation i

  • ssq_var: list of p numeric matrices of dimension n \times (p - 1); the (i, j) entry of the k-th matrix is the variational approximation to the posterior slab variance for the j-th variable in a weighted regression with variable k fixed as the response, where the weights are taken with respect to observation i

hyperparameters

list of p lists; the j-th list has the following values for variable j fixed as the response:

  • grid: matrix of candidate hyperparameter values, corresponding ELBO, and iterations to converge

  • final: the final hyperparameters chosen by grid search and the ELBO and iterations to converge for these hyperparameters

model_details

list with the following values:

  • elapsed: amount of time to fit the model

  • n: number of observations

  • p: number of variables

  • ELBO: ELBO summed across all observations and variables. If hp_method is "model_average" or "hybrid", this ELBO is averaged across the hyperparameter grid using the model averaging weights for each variable

  • num_unique: number of unique graphs

  • grid_size: number of points in the hyperparameter grid

  • args: list containing all passed arguments of length 1

weights

list with the following values:

  • weights: n\times n numeric matrix. The (i, j) entry is the similarity weight of the i-th observation with respect to the j-th observation using the j-th observation's bandwidth

  • bandwidths: numeric vector of length n. The i-th entry is the bandwidth for the i-th observation

References

(1) Sutanoy Dasgupta, Peng Zhao, Jacob Helwig, Prasenjit Ghosh, Debdeep Pati, and Bani Mallick. An Approximate Bayesian Approach to Covariate-dependent Graphical Modeling. arXiv preprint, 1–64, 2023.

(2) Sutanoy Dasgupta, Debdeep Pati, and Anuj Srivastava. A Two-Step Geometric Framework For Density Modeling. Statistica Sinica, 30(4):2155–2177, 2020.

Examples

## Not run: 
library(ggplot2)

# get the data
set.seed(12)
data <- generateData()
X <- data$X
Z <- data$Z
interval <- data$interval
prec <- data$true_precision

# get overall and within interval sample sizes
n <- nrow(X)
n1 <- sum(interval == 1)
n2 <- sum(interval == 2)
n3 <- sum(interval == 3)

# visualize the distribution of the extraneous covariate
ggplot(data.frame(Z = Z, interval = as.factor(interval))) +
  geom_histogram(aes(Z, fill = interval), color = "black", bins = n %/% 5)

# visualize the true precision matrices in each of the intervals

# interval 1
matViz(prec[[1]], incl_val = TRUE) +
  ggtitle(paste0("True precision matrix, interval 1, observations 1,...,", n1))

# interval 2 (varies continuously with Z)
cat("\nInterval 2, observations ", n1 + 1, ",...,", n1 + n2, sep = "")
int2_mats <- prec[interval == 2]
int2_inds <- c(5, n2 %/% 2, n2 - 5)
lapply(int2_inds, function(j) matViz(int2_mats[[j]], incl_val = TRUE) +
         ggtitle(paste("True precision matrix, interval 2, observation", j + n1)))

# interval 3
matViz(prec[[length(prec)]], incl_val = TRUE) +
  ggtitle(paste0("True precision matrix, interval 3, observations ",
                 n1 + n2 + 1, ",...,", n1 + n2 + n3))

# fit the model and visualize the estimated graphs
(out <- covdepGE(X, Z))
plot(out)

# visualize the posterior inclusion probabilities for variables (1, 3) and (1, 2)
inclusionCurve(out, 1, 2)
inclusionCurve(out, 1, 3)

## End(Not run)

JacobHelwig/covdepGE documentation built on April 11, 2024, 7:22 a.m.