locus: Fit sparse multivariate regression models using variational...

View source: R/locus.R

locusR Documentation

Fit sparse multivariate regression models using variational inference.

Description

Variational approximation procedure fitting sparse multivariate regression models for combined selection of predictors and associated responses in high-dimensional set-ups. Dependence across responses linked to the same predictors is modelled through the model hierarchical structure. The responses can be purely continuous, purely binary (logit or probit link fits), or a mix of continuous and binary variables.

Usage

locus(
  Y,
  X,
  p0_av,
  Z = NULL,
  link = "identity",
  ind_bin = NULL,
  list_hyper = NULL,
  list_init = NULL,
  list_cv = NULL,
  list_blocks = NULL,
  list_groups = NULL,
  list_struct = NULL,
  user_seed = NULL,
  tol = 0.1,
  maxit = 1000,
  anneal = NULL,
  save_hyper = FALSE,
  save_init = FALSE,
  full_output = FALSE,
  verbose = TRUE,
  checkpoint_path = NULL
)

Arguments

Y

Response data matrix of dimension n x d, where n is the number of samples and d is the number of response variables.

X

Input matrix of dimension n x p, where p is the number of candidate predictors. X cannot contain NAs. No intercept must be supplied.

p0_av

Prior average number of predictors (or groups of predictors if list_groups is non-NULL) expected to be included in the model. Can also be a vector of length p (resp. of length the number of groups) with entry s corresponding to the prior probability that candidate predictor s (resp. group s) is associated with at least one response. Must be NULL if list_init and list_hyper are both non-NULL or if list_cv is non-NULL.

Z

Covariate matrix of dimension n x q, where q is the number of covariates. Variables in Z are not subject to selection. NULL if no covariate. Factor covariates must be supplied after transformation to dummy coding. No intercept must be supplied.

link

Response link. Must be "identity" for linear regression, "logit" for logistic regression, "probit" for probit regression, or "mix" for a mix of identity and probit link functions (in this case, the indices of the binary responses must be gathered in argument ind_bin, see below).

ind_bin

If link = "mix", vector of indices corresponding to the binary variables in Y. Must be NULL if link != "mix".

list_hyper

An object of class "hyper" containing the model hyperparameters. Must be filled using the set_hyper function or must be NULL for default hyperparameters.

list_init

An object of class "init" containing the initial variational parameters. Must be filled using the set_init function or be NULL for a default initialization.

list_cv

An object of class "cv" containing settings for choosing the prior average number of predictors expected to be included in the model, p0_av, by cross-validation. Must be filled using the set_cv function or must be NULL for no cross-validation. If non-NULL, p0_av, list_init and list_hyper must all be NULL. Cross-validation only available for link = "identity".

list_blocks

An object of class "blocks" containing settings for parallel inference on a partitioned predictor space. Must be filled using the set_blocks function or must be NULL for no partitioning.

list_groups

An object of class "groups" containing settings for group selection of candidate predictors. Must be filled using the set_groups function or must be NULL for group selection.

list_struct

An object of class "struct" containing settings for structure sparsity priors. Must be filled using the set_struct function or must be NULL for structured selection.

user_seed

Seed set for reproducible default choices of hyperparameters (if list_hyper is NULL) and initial variational parameters (if list_init is NULL). Also used at the cross-validation stage (if list_cv is non-NULL). Default is NULL, no seed set.

tol

Tolerance for the stopping criterion.

maxit

Maximum number of iterations allowed.

anneal

Parameters for annealing scheme. Must be a vector whose first entry is sets the type of ladder: 1 = geometric spacing, 2 = harmonic spacing or 3 = linear spacing, the second entry is the initial temperature, and the third entry is the ladder size. If NULL (default), no annealing is performed.

save_hyper

If TRUE, the hyperparameters used for the model are returned.

save_init

If TRUE, the initial variational parameters used for the inference are returned (note that the size of the resulting objects is likely to be large). Default is FALSE.

full_output

If TRUE, the inferred variational parameters for all parameters are returned.

verbose

If TRUE, messages are displayed during execution.

checkpoint_path

Path where to save temporary checkpoint outputs. Default is NULL, for no checkpointing.

Details

The optimization uses efficient block coordinate ascent schemes, for which convergence is ensured as the objective (elbo) is multiconcave for the selected blocks, i.e., it is concave in each block of parameters whose updates are made simultaneously, see Wu et al. (reference Section below).

The continuous response variables in Y (if any) will be centered before application of the variational algorithm, and the candidate predictors and covariates resp. in X and Z will be standardized. An intercept will be added if link is "logit", "probit" or "mix" (do not supply it in X or Z).

Value

An object of class "vb" containing the following variational estimates and settings:

gam_vb

Posterior inclusion probability matrix of dimension p x d. Entry (s, t) corresponds to the posterior probability of association between candidate predictor s and response t.

alpha_vb

Matrix of dimension q x d whose entries are the posterior mean regression coefficients for the covariates provided in Z (if link = "logit", link = "logit" or link = "mix" also for the intercept). NULL if Z is NULL.

om_vb

Vector of length p containing the posterior mean of omega. Entry s controls the proportion of responses associated with candidate predictor s.

converged

A boolean indicating whether the algorithm has converged before reaching maxit iterations.

it

Final number of iterations.

lb_opt

Optimized variational lower bound for the marginal log-likelihood (ELBO).

diff_lb

Difference in ELBO between the last and penultimate iterations. This may be a useful diagnostic information when convergence has not been reached before maxit.

p_star

Vector of length 1 or p defining the applied sparsity control.

rmvd_cst_x, rmvd_cst_z

Vectors containing the indices of constant variables in X (resp. Z) removed prior to the analysis.

rmvd_coll_x, rmvd_coll_z

Vectors containing the indices of variables in X (resp. Z) removed prior to the analysis because collinear to other variables. The entry name indicates the corresponding variable kept in the analysis (i.e., that causing the collinearity for the entry in question).

list_hyper, list_init

If save_hyper, resp. save_init, TRUE, hyperparameters, resp. initial variational parameters, used for inference are saved as output.

group_labels

If list_groups is non-NULL, labels of the groups to which the candidate predictor belong (these labels are gathered after removal of constant and collinear predictors, whose indices are stored in rmvd_cst_x and rmvd_coll_x).

...

Other specific outputs are possible depending on the model used.

References

H. Ruffieux, A. C. Davison, J. Hager, I. Irincheeva. Efficient inference for genetic association studies with multiple outcomes. Biostatistics, 2017.

Y. Xu, and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6, pp.1758-1789, 2013.

See Also

set_hyper, set_init, set_cv, set_blocks, set_groups and set_struct.

Examples

seed <- 123; set.seed(seed)

###################
## Simulate data ##
###################

## Examples using small problem sizes:
##
n <- 200; p <- 250; p0 <- 25; d <- 30; d0 <- 25; q <- 3

## Candidate predictors (subject to selection)
##
# Here we simulate common genetic variants (but any type of candidate
# predictors can be supplied).
# 0 = homozygous, major allele, 1 = heterozygous, 2 = homozygous, minor allele
#
X_act <- matrix(rbinom(n * p0, size = 2, p = 0.25), nrow = n)
X_inact <- matrix(rbinom(n * (p - p0), size = 2, p = 0.25), nrow = n)

shuff_x_ind <- sample(p)
X <- cbind(X_act, X_inact)[, shuff_x_ind]

bool_x_act <- shuff_x_ind <= p0

pat_act <- beta <- matrix(0, nrow = p0, ncol = d0)
pat_act[sample(p0*d0, floor(p0*d0/5))] <- 1
beta[as.logical(pat_act)] <-  rnorm(sum(pat_act))

## Covariates (not subject to selection)
##
Z <- matrix(rnorm(n * q), nrow = n)

alpha <-  matrix(rnorm(q * d), nrow = q)

## Gaussian responses
##
Y_act <- matrix(rnorm(n * d0, mean = X_act %*% beta, sd = 0.5), nrow = n)
Y_inact <- matrix(rnorm(n * (d - d0), sd = 0.5), nrow = n)
shuff_y_ind <- sample(d)
Y <- cbind(Y_act, Y_inact)[, shuff_y_ind] + Z %*% alpha

## Binary responses
##
Y_bin <- ifelse(Y > 0, 1, 0)

########################
## Infer associations ##
########################

## Continuous responses
##
# We take p0_av = p0 (known here); this choice may, in some cases, result in
# (too) conservative variable selections. In practice, it is advised to set
# p0_av as a slightly overestimated guess of p0, or perform cross-validation
# using function `set_cv'.

# No covariate
#
vb_g <- locus(Y = Y, X = X, p0_av = p0, link = "identity", user_seed = seed)

# With covariates
#
vb_g_z <- locus(Y = Y, X = X, p0_av = p0,  Z = Z, link = "identity",
                user_seed = seed)

## Binary responses
##
vb_logit <- locus(Y = Y_bin, X = X, p0_av = p0, Z = Z, link = "logit",
                  user_seed = seed)

vb_probit <- locus(Y = Y_bin, X = X, p0_av = p0, Z = Z, link = "probit",
                   user_seed = seed)

## Mix of continuous and binary responses
##
Y_mix <- cbind(Y, Y_bin)
ind_bin <- (d+1):(2*d)

vb_mix <- locus(Y = Y_mix, X = X, p0_av = p0, Z = Z, link = "mix",
                ind_bin = ind_bin, user_seed = seed)


hruffieux/locus documentation built on Jan. 10, 2024, 10:07 p.m.