locus: Fit sparse multivariate regression models using variational...
In hruffieux/locus: Large-Scale Variational Inference for Combined Selection of Covariate and Response Variables in Regression Models

View source: R/locus.R

locus

R Documentation

Fit sparse multivariate regression models using variational inference.

Description

Variational approximation procedure fitting sparse multivariate regression models for combined selection of predictors and associated responses in high-dimensional set-ups. Dependence across responses linked to the same predictors is modelled through the model hierarchical structure. The responses can be purely continuous, purely binary (logit or probit link fits), or a mix of continuous and binary variables.

Usage

locus(
  Y,
  X,
  p0_av,
  Z = NULL,
  link = "identity",
  ind_bin = NULL,
  list_hyper = NULL,
  list_init = NULL,
  list_cv = NULL,
  list_blocks = NULL,
  list_groups = NULL,
  list_struct = NULL,
  user_seed = NULL,
  tol = 0.1,
  maxit = 1000,
  anneal = NULL,
  save_hyper = FALSE,
  save_init = FALSE,
  full_output = FALSE,
  verbose = TRUE,
  checkpoint_path = NULL
)

Arguments

`Y`	Response data matrix of dimension n x d, where n is the number of samples and d is the number of response variables.
`X`	Input matrix of dimension n x p, where p is the number of candidate predictors. `X` cannot contain NAs. No intercept must be supplied.
`p0_av`	Prior average number of predictors (or groups of predictors if `list_groups` is non-`NULL`) expected to be included in the model. Can also be a vector of length p (resp. of length the number of groups) with entry s corresponding to the prior probability that candidate predictor s (resp. group s) is associated with at least one response. Must be `NULL` if `list_init` and `list_hyper` are both non-`NULL` or if `list_cv` is non-`NULL`.
`Z`	Covariate matrix of dimension n x q, where q is the number of covariates. Variables in `Z` are not subject to selection. `NULL` if no covariate. Factor covariates must be supplied after transformation to dummy coding. No intercept must be supplied.
`link`	Response link. Must be "`identity`" for linear regression, "`logit`" for logistic regression, "`probit`" for probit regression, or "`mix`" for a mix of identity and probit link functions (in this case, the indices of the binary responses must be gathered in argument `ind_bin`, see below).
`ind_bin`	If `link = "mix"`, vector of indices corresponding to the binary variables in `Y`. Must be `NULL` if `link != "mix"`.
`list_hyper`	An object of class "`hyper`" containing the model hyperparameters. Must be filled using the `set_hyper` function or must be `NULL` for default hyperparameters.
`list_init`	An object of class "`init`" containing the initial variational parameters. Must be filled using the `set_init` function or be `NULL` for a default initialization.
`list_cv`	An object of class "`cv`" containing settings for choosing the prior average number of predictors expected to be included in the model, `p0_av`, by cross-validation. Must be filled using the `set_cv` function or must be `NULL` for no cross-validation. If non-`NULL`, `p0_av`, `list_init` and `list_hyper` must all be `NULL`. Cross-validation only available for `link = "identity"`.
`list_blocks`	An object of class "`blocks`" containing settings for parallel inference on a partitioned predictor space. Must be filled using the `set_blocks` function or must be `NULL` for no partitioning.
`list_groups`	An object of class "`groups`" containing settings for group selection of candidate predictors. Must be filled using the `set_groups` function or must be `NULL` for group selection.
`list_struct`	An object of class "`struct`" containing settings for structure sparsity priors. Must be filled using the `set_struct` function or must be `NULL` for structured selection.
`user_seed`	Seed set for reproducible default choices of hyperparameters (if `list_hyper` is `NULL`) and initial variational parameters (if `list_init` is `NULL`). Also used at the cross-validation stage (if `list_cv` is non-`NULL`). Default is `NULL`, no seed set.
`tol`	Tolerance for the stopping criterion.
`maxit`	Maximum number of iterations allowed.
`anneal`	Parameters for annealing scheme. Must be a vector whose first entry is sets the type of ladder: 1 = geometric spacing, 2 = harmonic spacing or 3 = linear spacing, the second entry is the initial temperature, and the third entry is the ladder size. If `NULL` (default), no annealing is performed.
`save_hyper`	If `TRUE`, the hyperparameters used for the model are returned.
`save_init`	If `TRUE`, the initial variational parameters used for the inference are returned (note that the size of the resulting objects is likely to be large). Default is `FALSE`.
`full_output`	If `TRUE`, the inferred variational parameters for all parameters are returned.
`verbose`	If `TRUE`, messages are displayed during execution.
`checkpoint_path`	Path where to save temporary checkpoint outputs. Default is `NULL`, for no checkpointing.

Details

The optimization uses efficient block coordinate ascent schemes, for which convergence is ensured as the objective (elbo) is multiconcave for the selected blocks, i.e., it is concave in each block of parameters whose updates are made simultaneously, see Wu et al. (reference Section below).

The continuous response variables in Y (if any) will be centered before application of the variational algorithm, and the candidate predictors and covariates resp. in X and Z will be standardized. An intercept will be added if link is "logit", "probit" or "mix" (do not supply it in X or Z).

Value

An object of class "vb" containing the following variational estimates and settings:

`gam_vb`	Posterior inclusion probability matrix of dimension p x d. Entry (s, t) corresponds to the posterior probability of association between candidate predictor s and response t.
`alpha_vb`	Matrix of dimension q x d whose entries are the posterior mean regression coefficients for the covariates provided in `Z` (if `link = "logit"`, `link = "logit"` or `link = "mix"` also for the intercept). `NULL` if `Z` is `NULL`.
`om_vb`	Vector of length p containing the posterior mean of omega. Entry s controls the proportion of responses associated with candidate predictor s.
`converged`	A boolean indicating whether the algorithm has converged before reaching `maxit` iterations.
`it`	Final number of iterations.
`lb_opt`	Optimized variational lower bound for the marginal log-likelihood (ELBO).
`diff_lb`	Difference in ELBO between the last and penultimate iterations. This may be a useful diagnostic information when convergence has not been reached before `maxit`.
`p_star`	Vector of length 1 or p defining the applied sparsity control.
`rmvd_cst_x, rmvd_cst_z`	Vectors containing the indices of constant variables in `X` (resp. `Z`) removed prior to the analysis.
`rmvd_coll_x, rmvd_coll_z`	Vectors containing the indices of variables in `X` (resp. `Z`) removed prior to the analysis because collinear to other variables. The entry name indicates the corresponding variable kept in the analysis (i.e., that causing the collinearity for the entry in question).
`list_hyper, list_init`	If `save_hyper`, resp. `save_init`, `TRUE`, hyperparameters, resp. initial variational parameters, used for inference are saved as output.
`group_labels`	If `list_groups` is non-`NULL`, labels of the groups to which the candidate predictor belong (these labels are gathered after removal of constant and collinear predictors, whose indices are stored in `rmvd_cst_x` and `rmvd_coll_x`).
`...`	Other specific outputs are possible depending on the model used.

References

H. Ruffieux, A. C. Davison, J. Hager, I. Irincheeva. Efficient inference for genetic association studies with multiple outcomes. Biostatistics, 2017.

Y. Xu, and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6, pp.1758-1789, 2013.

Examples

seed <- 123; set.seed(seed)

###################
## Simulate data ##
###################

## Examples using small problem sizes:
##
n <- 200; p <- 250; p0 <- 25; d <- 30; d0 <- 25; q <- 3

## Candidate predictors (subject to selection)
##
# Here we simulate common genetic variants (but any type of candidate
# predictors can be supplied).
# 0 = homozygous, major allele, 1 = heterozygous, 2 = homozygous, minor allele
#
X_act <- matrix(rbinom(n * p0, size = 2, p = 0.25), nrow = n)
X_inact <- matrix(rbinom(n * (p - p0), size = 2, p = 0.25), nrow = n)

shuff_x_ind <- sample(p)
X <- cbind(X_act, X_inact)[, shuff_x_ind]

bool_x_act <- shuff_x_ind <= p0

pat_act <- beta <- matrix(0, nrow = p0, ncol = d0)
pat_act[sample(p0*d0, floor(p0*d0/5))] <- 1
beta[as.logical(pat_act)] <-  rnorm(sum(pat_act))

## Covariates (not subject to selection)
##
Z <- matrix(rnorm(n * q), nrow = n)

alpha <-  matrix(rnorm(q * d), nrow = q)

## Gaussian responses
##
Y_act <- matrix(rnorm(n * d0, mean = X_act %*% beta, sd = 0.5), nrow = n)
Y_inact <- matrix(rnorm(n * (d - d0), sd = 0.5), nrow = n)
shuff_y_ind <- sample(d)
Y <- cbind(Y_act, Y_inact)[, shuff_y_ind] + Z %*% alpha

## Binary responses
##
Y_bin <- ifelse(Y > 0, 1, 0)

########################
## Infer associations ##
########################

## Continuous responses
##
# We take p0_av = p0 (known here); this choice may, in some cases, result in
# (too) conservative variable selections. In practice, it is advised to set
# p0_av as a slightly overestimated guess of p0, or perform cross-validation
# using function `set_cv'.

# No covariate
#
vb_g <- locus(Y = Y, X = X, p0_av = p0, link = "identity", user_seed = seed)

# With covariates
#
vb_g_z <- locus(Y = Y, X = X, p0_av = p0,  Z = Z, link = "identity",
                user_seed = seed)

## Binary responses
##
vb_logit <- locus(Y = Y_bin, X = X, p0_av = p0, Z = Z, link = "logit",
                  user_seed = seed)

vb_probit <- locus(Y = Y_bin, X = X, p0_av = p0, Z = Z, link = "probit",
                   user_seed = seed)

## Mix of continuous and binary responses
##
Y_mix <- cbind(Y, Y_bin)
ind_bin <- (d+1):(2*d)

vb_mix <- locus(Y = Y_mix, X = X, p0_av = p0, Z = Z, link = "mix",
                ind_bin = ind_bin, user_seed = seed)

hruffieux/locus documentation built on Jan. 10, 2024, 10:07 p.m.