# sgcca: Variable Selection For Generalized Canonical Correlation... In RGCCA: Regularized and Sparse Generalized Canonical Correlation Analysis for Multiblock Data

## Description

SGCCA extends RGCCA to address the issue of variable selection. Specifically, RGCCA is combined with an L1-penalty that gives rise to Sparse GCCA (SGCCA) which is implemented in the function sgcca(). Given J matrices X_1, X_2, ..., X_J, that represent J sets of variables observed on the same set of n individuals. The matrices X_1, X_2, ..., X_J must have the same number of rows, but may (and usually will) have different numbers of columns. Blocks are not necessarily fully connected within the SGCCA framework. Hence the use of SGCCA requires the construction (user specified) of a design matrix (C) that characterizes the connections between blocks. Elements of the symmetric design matrix C = (c_{jk}) are equal to 1 if block j and block k are connected, and 0 otherwise. The SGCCA algorithm is very similar to the RGCCA algorithm and keeps the same monotone convergence properties (i.e. the bounded criteria to be maximized increases at each step of the iterative procedure and hits at convergence a stationary point). Moreover, using a deflation strategy, sgcca() enables the computation of several SGCCA block components (specified by ncomp) for each block. Block components for each block are guaranteed to be orthogonal when using this deflation strategy. The so-called symmetric deflation is considered in this implementation, i.e. each block is deflated with respect to its own component. Moreover, we stress that the numbers of components per block could differ from one block to another.

 1 2 3 sgcca(A, C = 1 - diag(length(A)), c1 = rep(1, length(A)), ncomp = rep(1, length(A)), scheme = "centroid", scale = TRUE, init = "svd", bias = TRUE, tol = .Machine$double.eps, verbose = FALSE)  ## Arguments  A A list that contains the J blocks of variables X_1, X_2, ..., X_J. C A design matrix that describes the relationships between blocks (default: complete design). c1 Either a 1*J vector or a max(ncomp) * J matrix encoding the L1 constraints applied to the outer weight vectors. Elements of c1 vary between 1/sqrt(p_j) and 1 (larger values of c1 correspond to less penalization). If c1 is a vector, L1-penalties are the same for all the weights corresponding to the same block but different components: for all h, |a_{j,h}|_{L_1} ≤ c_1[j] √{p_j}, with p_j the number of variables of X_j. If c1 is a matrix, each row h defines the constraints applied to the weights corresponding to components h: for all h, |a_{j,h}|_{L_1} ≤ c_1[h,j] √{p_j}. ncomp A 1*J vector that contains the numbers of components for each block (default: rep(1, length(A)), which means one component per block). scheme Either "horst", "factorial" or "centroid" (Default: "centroid"). scale If scale = TRUE, each block is standardized to zero means and unit variances and then divided by the square root of its number of variables (default: TRUE). init Mode of initialization use in the SGCCA algorithm, either by Singular Value Decompostion ("svd") or random ("random") (default : "svd"). bias A logical value for biaised or unbiaised estimator of the var/cov. tol Stopping value for convergence. verbose Will report progress while computing if verbose = TRUE (default: TRUE). ## Value  Y A list of J elements. Each element of Y is a matrix that contains the SGCCA components for each block. a A list of J elements. Each element of a is a matrix that contains the outer weight vectors for each block. astar A list of J elements. Each element of astar is a matrix defined as Y[[j]][, h] = A[[j]]%*%astar[[j]][, h] C A design matrix that describes the relationships between blocks (user specified). scheme The scheme chosen by the user (user specified). c1 A vector or matrix that contains the value of c1 applied to each block \mathbf{X}_j, j=1, …, J and each dimension (user specified). ncomp A 1 \times J vector that contains the number of components for each block (user specified). crit A vector that contains the values of the objective function at each iterations. AVE Indicators of model quality based on the Average Variance Explained (AVE): AVE(for one block), AVE(outer model), AVE(inner model). ## References Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K. A., Grill, J., and Frouin, V. , "Variable selection for generalized canonical correlation analysis.," Biostatistics, vol. 15, no. 3, pp. 569-583, 2014. ## Examples   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 ############# # Example 1 # ############# ## Not run: # Download the dataset's package at http://biodev.cea.fr/sgcca/. # --> gliomaData_0.4.tar.gz require(gliomaData) data(ge_cgh_locIGR) A <- ge_cgh_locIGR$multiblocks Loc <- factor(ge_cgh_locIGR$y) ; levels(Loc) <- colnames(ge_cgh_locIGR$multiblocks$y) C <- matrix(c(0, 0, 1, 0, 0, 1, 1, 1, 0), 3, 3) tau = c(1, 1, 0) # rgcca algorithm using the dual formulation for X1 and X2 # and the dual formulation for X3 A[[3]] = A[[3]][, -3] result.rgcca = rgcca(A, C, tau, ncomp = c(2, 2, 1), scheme = "factorial", verbose = TRUE) # sgcca algorithm result.sgcca = sgcca(A, C, c1 = c(.071,.2, 1), ncomp = c(2, 2, 1), scheme = "centroid", verbose = TRUE) ############################ # plot(y1, y2) for (RGCCA) # ############################ layout(t(1:2)) plot(result.rgcca$Y[[1]][, 1], result.rgcca$Y[[2]][, 1], col = "white", xlab = "Y1 (GE)", ylab = "Y2 (CGH)", main = "Factorial plan of RGCCA") text(result.rgcca$Y[[1]][, 1], result.rgcca$Y[[2]][, 1], Loc, col = as.numeric(Loc), cex = .6) plot(result.rgcca$Y[[1]][, 1], result.rgcca$Y[[1]][, 2], col = "white", xlab = "Y1 (GE)", ylab = "Y2 (GE)", main = "Factorial plan of RGCCA") text(result.rgcca$Y[[1]][, 1], result.rgcca$Y[[1]][, 2], Loc, col = as.numeric(Loc), cex = .6) ############################ # plot(y1, y2) for (SGCCA) # ############################ layout(t(1:2)) plot(result.sgcca$Y[[1]][, 1], result.sgcca$Y[[2]][, 1], col = "white", xlab = "Y1 (GE)", ylab = "Y2 (CGH)", main = "Factorial plan of SGCCA") text(result.sgcca$Y[[1]][, 1], result.sgcca$Y[[2]][, 1], Loc, col = as.numeric(Loc), cex = .6) plot(result.sgcca$Y[[1]][, 1], result.sgcca$Y[[1]][, 2], col = "white", xlab = "Y1 (GE)", ylab = "Y2 (GE)", main = "Factorial plan of SGCCA") text(result.sgcca$Y[[1]][, 1], result.sgcca$Y[[1]][, 2], Loc, col = as.numeric(Loc), cex = .6) # sgcca algorithm with multiple components and different L1 penalties for each components # (-> c1 is a matrix) init = "random" result.sgcca = sgcca(A, C, c1 = matrix(c(.071,.2, 1, 0.06, 0.15, 1), nrow = 2, byrow = TRUE), ncomp = c(2, 2, 1), scheme = "factorial", scale = TRUE, bias = TRUE, init = init, verbose = TRUE) # number of non zero elements per dimension apply(result.sgcca$a[[1]], 2, function(x) sum(x!=0)) #(-> 145 non zero elements for a11 and 107 non zero elements for a12) apply(result.sgcca\$a[[2]], 2, function(x) sum(x!=0)) #(-> 85 non zero elements for a21 and 52 non zero elements for a22) init = "svd" result.sgcca = sgcca(A, C, c1 = matrix(c(.071,.2, 1, 0.06, 0.15, 1), nrow = 2, byrow = TRUE), ncomp = c(2, 2, 1), scheme = "factorial", scale = TRUE, bias = TRUE, init = init, verbose = TRUE) ## End(Not run) 

RGCCA documentation built on May 2, 2019, 3:39 p.m.