rgcca: Regularized (or Sparse) Generalized Canonical Correlation...

View source: R/rgcca.R

rgccaR Documentation

Regularized (or Sparse) Generalized Canonical Correlation Analysis (S/RGCCA)

Description

RGCCA is a generalization of regularized canonical correlation analysis to three or more sets of variables. SGCCA extends RGCCA to address the issue of variable selection

Usage

rgcca(
  blocks,
  method = "rgcca",
  scale = TRUE,
  scale_block = "inertia",
  connection = NULL,
  scheme = "factorial",
  ncomp = 1,
  tau = 1,
  sparsity = 1,
  init = "svd",
  bias = TRUE,
  tol = 1e-08,
  response = NULL,
  superblock = FALSE,
  NA_method = "nipals",
  verbose = FALSE,
  quiet = TRUE,
  n_iter_max = 1000,
  comp_orth = TRUE
)

Arguments

blocks

A list that contains the J blocks of variables X1, X2, ..., XJ. Block Xj is a matrix of dimension n x p_j where n is the number of observations and p_j the number of variables.

method

A character string indicating the multi-block component method to consider. See available_methods for the list of the available methods.

scale

Logical value indicating if blocks are standardized.

scale_block

Value indicating if each block is divided by a constant value. If TRUE or "inertia", each block is divided by the sum of eigenvalues of its empirical covariance matrix. If "lambda1", each block is divided by the square root of the highest eigenvalue of its empirical covariance matrix. Otherwise the blocks are not scaled. If standardization is applied (scale = TRUE), the block scaling is applied on the result of the standardization.

connection

A symmetric matrix (J x J) that describes the relationships between blocks.

scheme

Character string or a function giving the scheme function for covariance maximization among "horst" (the identity function), "factorial" (the squared values), "centroid" (the absolute values). The scheme function can be any continously differentiable convex function and it is possible to design explicitely the scheme function (e.g. function(x) x^4) as argument of rgcca function. See (Tenenhaus et al, 2017) for details.

ncomp

Vector of length J indicating the number of block components for each block.

tau

Either a 1 x J vector or a max(ncomp) x J matrix containing the values of the regularization parameters (default: tau = 1, for each block and each dimension). The regularization parameters varies from 0 (maximizing the correlation) to 1 (maximizing the covariance). If tau = "optimal" the regularization parameters are estimated for each block and each dimension using the Schafer and Strimmer (2005) analytical formula. If tau is a 1 x J vector, tau[j] is identical across the dimensions of block Xj. If tau is a matrix, tau[k, j] is associated with Xjk (kth residual matrix for block j). The regularization parameters can also be estimated using rgcca_permutation or rgcca_cv.

sparsity

Either a 1*J vector or a max(ncomp) * J matrix encoding the L1 constraints applied to the outer weight vectors. The amount of sparsity varies between 1/sqrt(p_j) and 1 (larger values of sparsity correspond to less penalization). If sparsity is a vector, L1-penalties are the same for all the weights corresponding to the same block but different components:

for all h, |a_{j,h}|_{L_1} ≤ c_1[j] √{p_j},

with p_j the number of variables of X_j. If sparsity is a matrix, each row h defines the constraints applied to the weights corresponding to components h:

for all h, |a_{j,h}|_{L_1} ≤ c_1[h,j] √{p_j}.

It can be estimated by using rgcca_permutation.

init

Character string giving the type of initialization to use in the algorithm. It could be either by Singular Value Decompostion ("svd") or by random initialisation ("random") (default: "svd").

bias

A logical value for biaised (1/n) or unbiaised (1/(n-1)) estimator of the var/cov (default: bias = TRUE).

tol

The stopping value for the convergence of the algorithm.

response

Numerical value giving the position of the response block. When the response argument is filled the supervised mode is automatically activated.

superblock

Boolean indicating the presence of a superblock (deflation strategy must be adapted when a superblock is used).

NA_method

Character string corresponding to the method used for handling missing values ("nipals", "complete"). (default: "nipals").

  • "complete"corresponds to perform RGCCA on the fully observed observations (observations with missing values are removed)

  • "nipals"corresponds to perform RGCCA algorithm on available data (NIPALS-type algorithm)

verbose

Logical value indicating if the progress of the algorithm is reported while computing.

quiet

Logical value indicating if warning messages are reported.

n_iter_max

Integer giving the algorithm's maximum number of iterations.

comp_orth

Logical value indicating if the deflation should lead to orthogonal components or orthogonal weights.

Details

Given J matrices X1, X2, ..., XJ that represent J sets of variables observed on the same set of n individuals. The matrices X1, X2, ..., XJ must have the same number of rows, but may (and usually will) have different numbers of columns. The aim of RGCCA is to study the relationships between these J blocks of variables. It constitutes a general framework for many multi-block data analysis methods (see Tenenhaus and Tenenhaus, 2011 ; Tenenhaus et al. 2017). It combines the power of multi-block data analysis methods (maximization of well identified criteria) and the flexibility of PLS path modeling (the researcher decides which blocks are connected and which are not). Hence, the use of RGCCA requires the construction (user specified) of a design matrix C that characterizes the connections between blocks. Elements of the (symmetric) design matrix C = (c_jk) are positive (and usually equal to 1 if block j and block k are connected, and 0 otherwise). The rgcca() function implements a monotone global convergent algorithm - i.e. the bounded criteria to be maximized increases at each step of the iterative procedure and hits, at convergence a stationary point of the RGCCA optimization problem. Moreover, depending on the dimensionality of each block Xj, j = 1, ..., J, the primal (when n > p_j) algorithm or the dual (when n < p_j) algorithm is used (see Tenenhaus et al. 2015). At last, a deflation strategy is used to compute several RGCCA block components (specified by ncomp) for each block. Block components of each block are guaranteed to be orthogonal. The so-called symmetric deflation is implemented (i.e. each block is deflated with respect to its own component). It should be noted that the numbers of components per block can differ from one block to another. SGCCA extends RGCCA to address the issue of variable selection (Tenenhaus et al, 2014). Specifically, RGCCA is combined with an L1-penalty that gives rise to Sparse GCCA (SGCCA). The SGCCA algorithm is very similar to the RGCCA algorithm and keeps the same convergence properties (i.e. the bounded criteria to be maximized increases at each step of the iterative procedure and hits at convergence a stationary point). Moreover, using a deflation strategy, sgcca() enables the computation of several SGCCA orthogonal block components (specified by ncomp) for each block. The rgcca() function can handle missing values using a NIPALS type algorithm (non-linear iterative partial least squares algorithm) described in (Tenenhaus et al, 2005). Guidelines describing how to use RGCCA in practice are provided in (Garali et al., 2018).

Value

A rgcca fitted object

Y

List of J elements. Each element of the list Y is a matrix that contains the RGCCA block components for the corresponding block.

a

List of J elements. Each element of the list a is a matrix of block weight vectors for the corresponding block.

astar

List of J elements. Each column of astar[[j]] is a vector such that Y[[j]][, h] = blocks[[j]] %*% astar[[j]][, h].

tau

Regularization parameters used during the analysis.

crit

List of vector of length max(ncomp). Each vector of the list is related to one specific deflation stage and reports the values of the criterion for this stage across iterations.

primal_dual

A 1 \times J vector that contains the formulation ("primal" or "dual") applied to each of the J blocks within the RGCCA alogrithm.

AVE

List of numerical values giving the indicators of model quality based on the Average Variance Explained (AVE): AVE(for each block), AVE(outer model), AVE(inner model).

A

List that contains the J blocks of variables X1, X2, ..., XJ. Block Xj is a matrix of dimension n x p_j where p_j is the number of variables in X_j. These blocks are imputed when an imputation strategy is selected.

call

Call of the function.

References

Garali I, Adanyeguh IM, Ichou F, Perlbarg V, Seyer A, Colsch B, Moszer I, Guillemot V, Durr A, Mochel F, Tenenhaus A. (2018) A strategy for multimodal data integration: application to biomarkers identification in spinocerebellar ataxia. Briefings in Bioinformatics. 19(6):1356-1369.

Tenenhaus M., Tenenhaus A. and Groenen P. J. (2017). Regularized generalized canonical correlation analysis: a framework for sequential multiblock component methods. Psychometrika, 82(3), 737-777.

Tenenhaus A., Philippe C. and Frouin, V. (2015). Kernel generalized canonical correlation analysis. Computational Statistics and Data Analysis, 90, 114-131.

Tenenhaus A., Philippe C., Guillemot V., Le Cao K. A., Grill J. and Frouin, V. (2014), Variable selection for generalized canonical correlation analysis, Biostatistics, 15(3), pp. 569-583.

Tenenhaus A. and Tenenhaus M., (2011). Regularized Generalized Canonical Correlation Analysis, Psychometrika, 76(2), pp 257-284.

Schafer J. and Strimmer K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology 4:32.

Arnaud Gloaguen, Vincent Guillemot, Arthur Tenenhaus. An efficient algorithm to satisfy l1 and l2 constraints. 49emes Journees de Statistique, May 2017, Avignon, France. (hal-01630744)

See Also

plot.rgcca, print.rgcca, rgcca_cv, rgcca_permutation rgcca_predict

Examples

####################
# Example 1: RGCCA #
####################
# Create the dataset
data(Russett)
blocks <- list(
  agriculture = Russett[, seq(3)],
  industry = Russett[, 4:5],
  politic = Russett[, 6:11]
)

# Blocks are fully connected, factorial scheme and tau =1 for all blocks is
# used by default
fit.rgcca <- rgcca(
  blocks = blocks, method = "rgcca", connection = 1 - diag(3),
  scheme = "factorial", tau = rep(1, 3)
)
print(fit.rgcca)
plot(fit.rgcca, type = "weight", block = 3)
politic <- as.vector(apply(Russett[, 9:11], 1, which.max))
plot(fit.rgcca,
  type = "sample", block = 1:2,
  comp = rep(1, 2), resp = politic
)

############################################
# Example 2: RGCCA and multiple components #
############################################
fit.rgcca <- rgcca(blocks,
  method = "rgcca",
  connection = 1 - diag(3), superblock = FALSE,
  tau = rep(1, 3), ncomp = c(2, 2, 2),
  scheme = "factorial", verbose = TRUE
)

politic <- as.vector(apply(Russett[, 9:11], 1, which.max))
plot(fit.rgcca,
  type = "sample", block = 1:2,
  comp = rep(1, 2), resp = politic
)

plot(fit.rgcca, type = "ave")
plot(fit.rgcca, type = "weight", block = 1)
plot(fit.rgcca, type = "loadings")
## Not run: 
##################################
# Example 3: Sparse GCCA (SGCCA) #
##################################

# Tune the model to find the best sparsity coefficients (all the blocks are
# connected together)
perm.out <- rgcca_permutation(blocks,
  n_cores = 1,
  par_type = "sparsity", n_perms = 10
)
print(perm.out)
plot(perm.out)

fit.sgcca <- rgcca(blocks, sparsity = perm.out$bestpenalties)
plot(fit.sgcca, type = "ave")

# Select the most significant variables
b <- rgcca_bootstrap(fit.sgcca, n_cores = 1, n_boot = 100)
plot(b, n_cores = 1)

##############################
# Example 3: Supervised mode #
##############################
# Tune the model for explaining the politic block
# (politic connected to the two other blocks)
cv.out <- rgcca_cv(blocks, response = 3, ncomp = 2, n_cores = 1)
print(cv.out)
plot(cv.out)

fit.rgcca <- rgcca(blocks,
  response = 3, ncomp = 2,
  tau = cv.out$bestpenalties
)
plot(fit.rgcca, type = "both")

b <- rgcca_bootstrap(fit.rgcca, n_cores = 1, n_boot = 10)
plot(b, n_cores = 1)

## End(Not run)


Tenenhaus/RGCCA documentation built on March 16, 2023, 2:04 p.m.