locus: Fit sparse multivariate regression models using variational...

Description Usage Arguments Details Value References See Also Examples

View source: R/locus.R

Description

Variational approximation procedure fitting sparse multivariate regression models for combined selection of predictors and associated responses in high-dimensional set-ups. Dependence across responses linked to the same predictors is captured through the model hierarchical structure. The responses can be purely continuous, purely binary (logit or probit link fits), or a mix of continuous and binary variables.

Usage

1
2
3
4
locus(Y, X, p0_av, Z = NULL, V = NULL, link = "identity",
  ind_bin = NULL, list_hyper = NULL, list_init = NULL, list_cv = NULL,
  list_blocks = NULL, user_seed = NULL, tol = 0.001, maxit = 1000,
  save_hyper = FALSE, save_init = FALSE, verbose = TRUE)

Arguments

Y

Response data matrix of dimension n x d, where n is the number of observations and d is the number of response variables.

X

Input matrix of dimension n x p, where p is the number of candidate predictors. X cannot contain NAs. No intercept must be supplied.

p0_av

Prior average number of predictors expected to be included in the model. Must be NULL if list_init and list_hyper are both non-NULL or if list_cv is non-NULL. Can also be a vector of length p with entry s corresponding to the prior probability that candidate predictor s is associated with at least one response.

Z

Covariate matrix of dimension n x q, where q is the number of covariates. Variables in Z are not subject to selection. NULL if no covariate. Factor covariates must be supplied after transformation to dummy coding. No intercept must be supplied.

V

Annotation matrix of dimension p x r, where r is the number of variables representing external information on the candidate predictors which may make their selection more or less likely. NULL if no such information.

link

Response link. Must be "identity" for linear regression, "logit" for logistic regression, "probit" for probit regression, or "mix" for a mix of identity and probit link functions (in this case, the indices of the binary responses must be gathered in argument ind_bin, see below).

ind_bin

If link = "mix", vector of indices corresponding to the binary variables in Y. Must be NULL if link != "mix".

list_hyper

An object of class "hyper" containing the model hyperparameters. Must be filled using the set_hyper function or must be NULL for default hyperparameters.

list_init

An object of class "init" containing the initial variational parameters. Must be filled using the set_init function or be NULL for a default initialization.

list_cv

An object of class "cv" containing settings for choosing the prior average number of predictors expected to be included in the model, p0_av, by cross-validation. Must be filled using the set_cv function or must be NULL for no cross-validation. If non-NULL, p0_av, list_init and list_hyper must all be NULL. Cross-validation only available for link = "identity".

list_blocks

An object of class "blocks" containing settings for parallel inference on a partitioned predictor space. Must be filled using the set_blocks function or must be NULL for no partitioning.

user_seed

Seed set for reproducible default choices of hyperparameters (if list_hyper is NULL) and initial variational parameters (if list_init is NULL). Also used at the cross-validation stage (if list_cv is non-NULL). Default is NULL, no seed set.

tol

Tolerance for the stopping criterion.

maxit

Maximum number of iterations allowed.

save_hyper

If TRUE, the hyperparameters used for the model are saved as output.

save_init

If TRUE, the initial variational parameters used for the inference are saved as output.

verbose

If TRUE, messages are displayed during execution.

Details

The optimization is made using efficient block coordinate ascent schemes, for which convergence is ensured as the objective (elbo) is multiconcave for the selected blocks, i.e., it is concave in each block of parameters whose updates are made simultaneously, see Wu et al. (reference Section below).

The continuous response variables in Y (if any) will be centered before application of the variational algorithm, and the candidate predictors and covariates resp. in X and Z will be standardized. An intercept will be added if link is "logit", "probit" or "mix" (do not supply it in X or Z).

Value

An object of class "vb" containing the following variational estimates and settings:

gam_vb

Posterior inclusion probability matrix of dimension p x d. Entry (s, t) corresponds to the posterior probability of association between candidate predictor s and response t.

mu_alpha_vb

Matrix of dimension q x d whose entries are the posterior mean regression coefficients for the covariates provided in Z (if link = "logit", link = "logit" or link = "mix" also for the intercept). NULL if Z is NULL.

mu_c0_vb

Vector of length p whose entries are the posterior intercept coefficients at the level of probabilities of associations. Entry s represents the control of the proportion of responses associated with candidate predictor s which is not due to external annotations. NULL if V is NULL.

mu_c_vb

Matrix of dimension r x d. Entry (l, k) contains the effect of annotation l for response k on the probabilities of associations. NULL if V is NULL.

om_vb

Vector of length p containing the posterior mean of omega. Entry s controls the proportion of responses associated with candidate predictor s. NULL if V is non-NULL.

converged

A boolean indicating whether the algorithm has converged before reaching maxit iterations.

it

Final number of iterations.

lb_opt

Optimized variational lower bound for the marginal log-likelihood (ELBO).

diff_lb

Difference in ELBO between the last and penultimate iterations. This may be a useful diagnostic information when convergence has not been reached before maxit.

p_star

Vector of length 1 or p defining the applied sparsity control.

rmvd_cst_x, rmvd_cst_z

Vectors containing the indices of constant variables in X (resp. Z) removed prior to the analysis.

rmvd_coll_x, rmvd_coll_z

Vectors containing the indices of variables in X (resp. Z) removed prior to the analysis because collinear to other variables. The entry name indicates the corresponding variable kept in the analysis (i.e., that causing the collinearity for the entry in question).

list_hyper, list_init

If save_hyper, resp. save_init, TRUE, hyperparameters, resp. initial variational parameters, used for inference are saved as output.

References

H. Ruffieux, A. C. Davison, J. Hager, I. Irincheeva. Efficient inference for genetic association studies with multiple outcomes. Biostatistics, 2017.

Y. Xu, and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6, pp.1758-1789, 2013.

See Also

set_hyper, set_init, set_cv, set_blocks

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
seed <- 123; set.seed(seed)

###################
## Simulate data ##
###################

## Examples using small problem sizes:
##
n <- 200; p <- 250; p0 <- 25; d <- 30; d0 <- 25; q <- 3; r <- 3

## Candidate predictors (subject to selection)
##
# Here we simulate common genetic variants (but any type of candidate
# predictors can be supplied).
# 0 = homozygous, major allele, 1 = heterozygous, 2 = homozygous, minor allele
#
X_act <- matrix(rbinom(n * p0, size = 2, p = 0.25), nrow = n)
X_inact <- matrix(rbinom(n * (p - p0), size = 2, p = 0.25), nrow = n)

shuff_x_ind <- sample(p)
X <- cbind(X_act, X_inact)[, shuff_x_ind]

bool_x_act <- shuff_x_ind <= p0

pat_act <- beta <- matrix(0, nrow = p0, ncol = d0)
pat_act[sample(p0*d0, floor(p0*d0/5))] <- 1
beta[as.logical(pat_act)] <-  rnorm(sum(pat_act))

## Covariates (not subject to selection)
##
Z <- matrix(rnorm(n * q), nrow = n)

alpha <-  matrix(rnorm(q * d), nrow = q)

## Gaussian responses
##
Y_act <- matrix(rnorm(n * d0, mean = X_act %*% beta, sd = 0.5), nrow = n)
Y_inact <- matrix(rnorm(n * (d - d0), sd = 0.5), nrow = n)
shuff_y_ind <- sample(d)
Y <- cbind(Y_act, Y_inact)[, shuff_y_ind] + Z %*% alpha

## Binary responses
##
Y_bin <- ifelse(Y > 0, 1, 0)

## Informative annotation variables
##
V <- matrix(rnorm(p * r), nrow = p)
V[bool_x_act, ] <- rnorm(p0 * r, mean = 2)


########################
## Infer associations ##
########################

## Continuous responses
##
# We take p0_av = p0 (known here); this choice may, in some cases, result in
# (too) conservative variable selections. In practice, it is advised to set
# p0_av as a slightly overestimated guess of p0, or perform cross-validation
# using function `set_cv'.

# No covariate
#
vb_g <- locus(Y = Y, X = X, p0_av = p0, link = "identity", user_seed = seed)

# With covariates
#
vb_g_z <- locus(Y = Y, X = X, p0_av = p0,  Z = Z, link = "identity",
                user_seed = seed)

# With external annotation variables
#
vb_g_v <- locus(Y = Y, X = X, p0_av = p0, Z = Z, V = V, link = "identity",
                user_seed = seed)

## Binary responses
##
vb_logit <- locus(Y = Y_bin, X = X, p0_av = p0, Z = Z, link = "logit",
                  user_seed = seed)

vb_probit <- locus(Y = Y_bin, X = X, p0_av = p0, Z = Z, link = "probit",
                   user_seed = seed)

## Mix of continuous and binary responses
##
Y_mix <- cbind(Y, Y_bin)
ind_bin <- (d+1):(2*d)

vb_mix <- locus(Y = Y_mix, X = X, p0_av = p0, Z = Z, link = "mix",
                ind_bin = ind_bin, user_seed = seed)

hruffieux/locus documentation built on Oct. 22, 2018, 6:54 a.m.