# rgcca: Regularized (or Sparse) Generalized Canonical Correlation... In Tenenhaus/RGCCA: Regularized (or Sparse) Generalized Canonical Correlation Analysis (R/SGCCA) for multi-block data analysis

 rgcca R Documentation

## Regularized (or Sparse) Generalized Canonical Correlation Analysis (S/RGCCA)

### Description

RGCCA is a generalization of regularized canonical correlation analysis to three or more sets of variables. SGCCA extends RGCCA to address the issue of variable selection

### Usage

rgcca(
blocks,
method = "rgcca",
scale = TRUE,
scale_block = "inertia",
connection = NULL,
scheme = "factorial",
ncomp = 1,
tau = 1,
sparsity = 1,
init = "svd",
bias = TRUE,
tol = 1e-08,
response = NULL,
superblock = FALSE,
NA_method = "nipals",
verbose = FALSE,
quiet = TRUE,
n_iter_max = 1000,
comp_orth = TRUE
)


### Arguments

 blocks A list that contains the J blocks of variables X1, X2, ..., XJ. Block Xj is a matrix of dimension n x p_j where n is the number of observations and p_j the number of variables. method A character string indicating the multi-block component method to consider. See available_methods for the list of the available methods. scale Logical value indicating if blocks are standardized. scale_block Value indicating if each block is divided by a constant value. If TRUE or "inertia", each block is divided by the sum of eigenvalues of its empirical covariance matrix. If "lambda1", each block is divided by the square root of the highest eigenvalue of its empirical covariance matrix. Otherwise the blocks are not scaled. If standardization is applied (scale = TRUE), the block scaling is applied on the result of the standardization. connection A symmetric matrix (J x J) that describes the relationships between blocks. scheme Character string or a function giving the scheme function for covariance maximization among "horst" (the identity function), "factorial" (the squared values), "centroid" (the absolute values). The scheme function can be any continously differentiable convex function and it is possible to design explicitely the scheme function (e.g. function(x) x^4) as argument of rgcca function. See (Tenenhaus et al, 2017) for details. ncomp Vector of length J indicating the number of block components for each block. tau Either a 1 x J vector or a max(ncomp) x J matrix containing the values of the regularization parameters (default: tau = 1, for each block and each dimension). The regularization parameters varies from 0 (maximizing the correlation) to 1 (maximizing the covariance). If tau = "optimal" the regularization parameters are estimated for each block and each dimension using the Schafer and Strimmer (2005) analytical formula. If tau is a 1 x J vector, tau[j] is identical across the dimensions of block Xj. If tau is a matrix, tau[k, j] is associated with Xjk (kth residual matrix for block j). The regularization parameters can also be estimated using rgcca_permutation or rgcca_cv. sparsity Either a 1*J vector or a max(ncomp) * J matrix encoding the L1 constraints applied to the outer weight vectors. The amount of sparsity varies between 1/sqrt(p_j) and 1 (larger values of sparsity correspond to less penalization). If sparsity is a vector, L1-penalties are the same for all the weights corresponding to the same block but different components: for all h, |a_{j,h}|_{L_1} ≤ c_1[j] √{p_j}, with p_j the number of variables of X_j. If sparsity is a matrix, each row h defines the constraints applied to the weights corresponding to components h: for all h, |a_{j,h}|_{L_1} ≤ c_1[h,j] √{p_j}. It can be estimated by using rgcca_permutation. init Character string giving the type of initialization to use in the algorithm. It could be either by Singular Value Decompostion ("svd") or by random initialisation ("random") (default: "svd"). bias A logical value for biaised (1/n) or unbiaised (1/(n-1)) estimator of the var/cov (default: bias = TRUE). tol The stopping value for the convergence of the algorithm. response Numerical value giving the position of the response block. When the response argument is filled the supervised mode is automatically activated. superblock Boolean indicating the presence of a superblock (deflation strategy must be adapted when a superblock is used). NA_method Character string corresponding to the method used for handling missing values ("nipals", "complete"). (default: "nipals"). "complete"corresponds to perform RGCCA on the fully observed observations (observations with missing values are removed) "nipals"corresponds to perform RGCCA algorithm on available data (NIPALS-type algorithm) verbose Logical value indicating if the progress of the algorithm is reported while computing. quiet Logical value indicating if warning messages are reported. n_iter_max Integer giving the algorithm's maximum number of iterations. comp_orth Logical value indicating if the deflation should lead to orthogonal components or orthogonal weights.

### Details

Given J matrices X1, X2, ..., XJ that represent J sets of variables observed on the same set of n individuals. The matrices X1, X2, ..., XJ must have the same number of rows, but may (and usually will) have different numbers of columns. The aim of RGCCA is to study the relationships between these J blocks of variables. It constitutes a general framework for many multi-block data analysis methods (see Tenenhaus and Tenenhaus, 2011 ; Tenenhaus et al. 2017). It combines the power of multi-block data analysis methods (maximization of well identified criteria) and the flexibility of PLS path modeling (the researcher decides which blocks are connected and which are not). Hence, the use of RGCCA requires the construction (user specified) of a design matrix C that characterizes the connections between blocks. Elements of the (symmetric) design matrix C = (c_jk) are positive (and usually equal to 1 if block j and block k are connected, and 0 otherwise). The rgcca() function implements a monotone global convergent algorithm - i.e. the bounded criteria to be maximized increases at each step of the iterative procedure and hits, at convergence a stationary point of the RGCCA optimization problem. Moreover, depending on the dimensionality of each block Xj, j = 1, ..., J, the primal (when n > p_j) algorithm or the dual (when n < p_j) algorithm is used (see Tenenhaus et al. 2015). At last, a deflation strategy is used to compute several RGCCA block components (specified by ncomp) for each block. Block components of each block are guaranteed to be orthogonal. The so-called symmetric deflation is implemented (i.e. each block is deflated with respect to its own component). It should be noted that the numbers of components per block can differ from one block to another. SGCCA extends RGCCA to address the issue of variable selection (Tenenhaus et al, 2014). Specifically, RGCCA is combined with an L1-penalty that gives rise to Sparse GCCA (SGCCA). The SGCCA algorithm is very similar to the RGCCA algorithm and keeps the same convergence properties (i.e. the bounded criteria to be maximized increases at each step of the iterative procedure and hits at convergence a stationary point). Moreover, using a deflation strategy, sgcca() enables the computation of several SGCCA orthogonal block components (specified by ncomp) for each block. The rgcca() function can handle missing values using a NIPALS type algorithm (non-linear iterative partial least squares algorithm) described in (Tenenhaus et al, 2005). Guidelines describing how to use RGCCA in practice are provided in (Garali et al., 2018).

### Value

A rgcca fitted object

 Y List of J elements. Each element of the list Y is a matrix that contains the RGCCA block components for the corresponding block. a List of J elements. Each element of the list a is a matrix of block weight vectors for the corresponding block. astar List of J elements. Each column of astar[[j]] is a vector such that Y[[j]][, h] = blocks[[j]] %*% astar[[j]][, h]. tau Regularization parameters used during the analysis. crit List of vector of length max(ncomp). Each vector of the list is related to one specific deflation stage and reports the values of the criterion for this stage across iterations. primal_dual A 1 \times J vector that contains the formulation ("primal" or "dual") applied to each of the J blocks within the RGCCA alogrithm. AVE List of numerical values giving the indicators of model quality based on the Average Variance Explained (AVE): AVE(for each block), AVE(outer model), AVE(inner model). A List that contains the J blocks of variables X1, X2, ..., XJ. Block Xj is a matrix of dimension n x p_j where p_j is the number of variables in X_j. These blocks are imputed when an imputation strategy is selected. call Call of the function.

### References

Garali I, Adanyeguh IM, Ichou F, Perlbarg V, Seyer A, Colsch B, Moszer I, Guillemot V, Durr A, Mochel F, Tenenhaus A. (2018) A strategy for multimodal data integration: application to biomarkers identification in spinocerebellar ataxia. Briefings in Bioinformatics. 19(6):1356-1369.

Tenenhaus M., Tenenhaus A. and Groenen P. J. (2017). Regularized generalized canonical correlation analysis: a framework for sequential multiblock component methods. Psychometrika, 82(3), 737-777.

Tenenhaus A., Philippe C. and Frouin, V. (2015). Kernel generalized canonical correlation analysis. Computational Statistics and Data Analysis, 90, 114-131.

Tenenhaus A., Philippe C., Guillemot V., Le Cao K. A., Grill J. and Frouin, V. (2014), Variable selection for generalized canonical correlation analysis, Biostatistics, 15(3), pp. 569-583.

Tenenhaus A. and Tenenhaus M., (2011). Regularized Generalized Canonical Correlation Analysis, Psychometrika, 76(2), pp 257-284.

Schafer J. and Strimmer K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology 4:32.

Arnaud Gloaguen, Vincent Guillemot, Arthur Tenenhaus. An efficient algorithm to satisfy l1 and l2 constraints. 49emes Journees de Statistique, May 2017, Avignon, France. (hal-01630744)

plot.rgcca, print.rgcca, rgcca_cv, rgcca_permutation rgcca_predict

### Examples

####################
# Example 1: RGCCA #
####################
# Create the dataset
data(Russett)
blocks <- list(
agriculture = Russett[, seq(3)],
industry = Russett[, 4:5],
politic = Russett[, 6:11]
)

# Blocks are fully connected, factorial scheme and tau =1 for all blocks is
# used by default
fit.rgcca <- rgcca(
blocks = blocks, method = "rgcca", connection = 1 - diag(3),
scheme = "factorial", tau = rep(1, 3)
)
print(fit.rgcca)
plot(fit.rgcca, type = "weight", block = 3)
politic <- as.vector(apply(Russett[, 9:11], 1, which.max))
plot(fit.rgcca,
type = "sample", block = 1:2,
comp = rep(1, 2), resp = politic
)

############################################
# Example 2: RGCCA and multiple components #
############################################
fit.rgcca <- rgcca(blocks,
method = "rgcca",
connection = 1 - diag(3), superblock = FALSE,
tau = rep(1, 3), ncomp = c(2, 2, 2),
scheme = "factorial", verbose = TRUE
)

politic <- as.vector(apply(Russett[, 9:11], 1, which.max))
plot(fit.rgcca,
type = "sample", block = 1:2,
comp = rep(1, 2), resp = politic
)

plot(fit.rgcca, type = "ave")
plot(fit.rgcca, type = "weight", block = 1)
## Not run:
##################################
# Example 3: Sparse GCCA (SGCCA) #
##################################

# Tune the model to find the best sparsity coefficients (all the blocks are
# connected together)
perm.out <- rgcca_permutation(blocks,
n_cores = 1,
par_type = "sparsity", n_perms = 10
)
print(perm.out)
plot(perm.out)

fit.sgcca <- rgcca(blocks, sparsity = perm.out$bestpenalties) plot(fit.sgcca, type = "ave") # Select the most significant variables b <- rgcca_bootstrap(fit.sgcca, n_cores = 1, n_boot = 100) plot(b, n_cores = 1) ############################## # Example 3: Supervised mode # ############################## # Tune the model for explaining the politic block # (politic connected to the two other blocks) cv.out <- rgcca_cv(blocks, response = 3, ncomp = 2, n_cores = 1) print(cv.out) plot(cv.out) fit.rgcca <- rgcca(blocks, response = 3, ncomp = 2, tau = cv.out$bestpenalties
)
plot(fit.rgcca, type = "both")

b <- rgcca_bootstrap(fit.rgcca, n_cores = 1, n_boot = 10)
plot(b, n_cores = 1)

## End(Not run)



Tenenhaus/RGCCA documentation built on March 16, 2023, 2:04 p.m.