atlasqtl: Flexible sparse regression for hierarchically-related...

View source: R/atlasqtl.R

atlasqtlR Documentation

Flexible sparse regression for hierarchically-related responses using annealed variational inference.

Description

Flexible sparse regression for hierarchically-related responses using annealed variational inference.

Usage

atlasqtl(
  Y,
  X,
  p0,
  anneal = c(1, 2, 10),
  tol = 0.1,
  maxit = 1000,
  user_seed = NULL,
  verbose = 1,
  list_hyper = NULL,
  list_init = NULL,
  save_hyper = FALSE,
  save_init = FALSE,
  full_output = FALSE,
  thinned_elbo_eval = TRUE,
  checkpoint_path = NULL,
  trace_path = NULL,
  add_collinear_back = FALSE
)

Arguments

Y

Response data matrix of dimension n x q, where n is the number of samples and q is the number of response variables.

X

Input matrix of dimension n x p, where p is the number of candidate predictors. X cannot contain NAs. No intercept must be supplied.

p0

Vector of size 2 whose entries are the prior expectation and variance of the number of predictors associated with each response. Must be NULL if list_init and list_hyper are both non-NULL.

anneal

Parameters for annealing scheme. Must be a vector whose first entry is the type of schedule: 1 = geometric spacing (default), 2 = harmonic spacing or 3 = linear spacing, the second entry is the initial temperature (default is 2), and the third entry is the temperature grid size (default is 10). If NULL, no annealing is performed.

tol

Tolerance for the stopping criterion (default is 0.1).

maxit

Maximum number of iterations allowed (default is 1000).

user_seed

Seed set for reproducible default choices of hyperparameters (if list_hyper is NULL) and initial variational parameters (if list_init is NULL). Default is NULL, no seed set.

verbose

Integer specifying the level of verbosity during execution: 0, for no message, 1 for standard verbosity (default), 2 for detailed verbosity.

list_hyper

An object of class "hyper" containing the model hyperparameters. Must be specified using the set_hyper function or set to NULL for default hyperparameters.

list_init

An object of class "init" containing the initial variational parameters. Must be specified using the set_init function or set to NULL for a default initialization.

save_hyper

If TRUE, the hyperparameters used for the model are returned.

save_init

If TRUE, the initial variational parameters used for the inference are returned (note that the size of the resulting objects is likely to be large). Default is FALSE.

full_output

If TRUE, the inferred variational parameters for all parameters are returned.

thinned_elbo_eval

If TRUE, the convergence criterion (variational lower bound on the marginal log-likelihood, or ELBO) is not evaluated at each iteration, in order to save computational resources, but after a given batch of iterations where the batch size is set in an adaptive fashion accounting for the number of iterations roughly expected until convergence.

checkpoint_path

Path where to save temporary checkpoint outputs. Default is NULL, for no checkpointing.

trace_path

Path where to save trace plot for the variance of hotspot propensities. Default is NULL, for no trace saved.

add_collinear_back

Boolean for re-adding the collinear variables X (removed as part of atlasqtl to ensure the stability of inference) to the final posterior summaries. Default is FALSE.

Details

atlasqtl implements a flexible hierarchical regression framework that allows information-sharing across responses and predictors, thereby enhancing the detection of weak effects. atlasqtl is tailored to the detection of hotspot predictors, i.e., predictors associated with many responses: it is based on a fully Bayesian model whereby the hotspot propensity is modelled using the horseshoe shrinkage prior: its global-local formulation shrinks noise globally and hence can accommodate highly sparse settings, while being robust to individual signals, thus leaving the effects of hotspots unshrunk. Inference is carried out using a scalable variational algorithm coupled with a novel simulated annealing procedure, which is applicable to large dimensions, i.e., thousands of response variables, and allows efficient exploration of multimodal distributions.

The columns of the response matrix Y are centered within the atlasqtl call, and the columns of the candidate predictor matrix X are standardized.

Value

An object of class "atlasqtl" containing the following variational estimates and settings:

beta_vb

Estimated effect size matrix of dimension p x q. Entry (s, t) corresponds to the variational posterior mean (mu_beta_vb_st x gam_vb_st) of the regression effect between candidate predictor s and response t.

gam_vb

Posterior inclusion probability matrix of dimension p x q. Entry (s, t) corresponds to the variational posterior probability of association between candidate predictor s and response t.

theta_vb

Vector of length p containing the variational posterior mean of the hotspot propensity. Entry s corresponds to the propensity of candidate predictor s to be associated with many responses.

X_beta_vb

Matrix of dimension n x q equal to the matrix product X beta_vb, where X is the rescaled predictor matrix with possible collinear predictor(s) excluded.

zeta_vb

Vector of length q containing the variational posterior mean of the response importance. Entry t corresponds to the propensity of response t have associated predictors.

converged

A boolean indicating whether the algorithm has converged before reaching maxit iterations.

it

Final number of iterations.

lb_opt

Optimized variational lower bound (ELBO) on the marginal log-likelihood.

diff_lb

Difference in ELBO between the last and penultimate iterations (to be used as a convergence diagnostic when maxit has been reached).

rmvd_cst_x

Vectors containing the indices of constant variables in X removed prior to the analysis.

rmvd_coll_x

Vectors containing the indices of variables in X removed prior to the analysis because collinear to other variables. The entry name indicates the corresponding variable kept in the analysis.

list_hyper, list_init

If save_hyper, resp. save_init, are TRUE, the hyperparameters, resp. initial variational parameters, used for inference are saved as output.

...

If full_output is TRUE all inferred variational parameters are returned.

References

Helene Ruffieux, Anthony C. Davison, Jorg Hager, Jamie Inshaw, Benjamin P. Fairfax, Sylvia Richardson, Leonardo Bottolo, A global-local approach for detecting hotspots in multiple-response regression, arXiv:1811.03334, 2018.

See Also

set_hyper, set_init.

Examples

seed <- 123; set.seed(seed)

###################
## Simulate data ##
###################

# Example with small problem sizes:
#
n <- 200; p <- 50; p_act <- 10; q <- 100; q_act <- 50

# Candidate predictors (subject to selection)
#
# Here example with common genetic variants under Hardy-Weinberg equilibrium
#
X_act <- matrix(rbinom(n * p_act, size = 2, p = 0.25), nrow = n)
X_inact <- matrix(rbinom(n * (p - p_act), size = 2, p = 0.25), nrow = n)

# shuffle indices 
shuff_x_ind <- sample(p)
shuff_y_ind <- sample(q)

X <- cbind(X_act, X_inact)[, shuff_x_ind]

# Association pattern and effect sizes
#
pat <- matrix(FALSE, ncol = q, nrow = p)
bool_x <- shuff_x_ind <= p_act
bool_y <- shuff_y_ind <= q_act

pat_act <- beta_act <- matrix(0, nrow = p_act, ncol = q_act)
pat_act[sample(p_act * q_act, floor(p_act * q_act / 5))] <- 1
beta_act[as.logical(pat_act)] <-  rnorm(sum(pat_act))

pat[bool_x, bool_y] <- pat_act

# Gaussian responses
#
Y_act <- matrix(rnorm(n * q_act, mean = X_act %*% beta_act), nrow = n)
Y_inact <- matrix(rnorm(n * (q - q_act)), nrow = n)

Y <- cbind(Y_act, Y_inact)[, shuff_y_ind]

########################
## Infer associations ##
########################

# Expectation and variance for the prior number of predictors associated with
# each response
#
p0 <- c(mean(colSums(pat)), 10)

res_atlas <- atlasqtl(Y = Y, X = X, p0 = p0, user_seed = seed)


hruffieux/atlasqtl documentation built on April 12, 2025, 12:54 p.m.