CCA: Perform sparse canonical correlation analysis using the...

View source: R/CCA.R

CCAR Documentation

Perform sparse canonical correlation analysis using the penalized matrix decomposition.

Description

This is a function of a modification version of the PMA package. Given matrices X and Z, which represent two sets of features on the same set of samples, find sparse u and v such that u'X'Zv is large. For X and Z, the samples are on the rows and the features are on the columns. X and Z must have same number of rows, but may (and usually will) have different numbers of columns. The columns of X and/or Z can be unordered or ordered. If unordered, then a lasso penalty will be used to obtain the corresponding canonical vector. If ordered, then a fused lasso penalty will be used; this will result in smoothness. In addition, this function would report the component-wise estimated standard deviation of each U and V component and standardize them accordingly. The package also computes the non-parametric p-values of the components of u and v based on the permutations.

Usage

CCA(
  x,
  z,
  typex = c("standard", "ordered"),
  typez = c("standard", "ordered"),
  penaltyx = NULL,
  penaltyz = NULL,
  K = 1,
  niter = 15,
  v = NULL,
  trace = TRUE,
  standardize = TRUE,
  xnames = NULL,
  znames = NULL,
  chromx = NULL,
  chromz = NULL,
  upos = FALSE,
  uneg = FALSE,
  vpos = FALSE,
  vneg = FALSE,
  outcome = NULL,
  y = NULL,
  cens = NULL, 
  UVperms=NA, 
  allpenaltyxs=NA
)

Arguments

x

Data matrix; samples are rows and columns are features. Cannot contain missing values.

z

Data matrix; samples are rows and columns are features. Cannot contain missing values.

typex

Are the columns of x unordered (type="standard") or ordered (type="ordered")? If "standard", then a lasso penalty is applied to u, to enforce sparsity. If "ordered" (generally used for CGH data), then a fused lasso penalty is applied, to enforce both sparsity and smoothness.

typez

Are the columns of z unordered (type="standard") or ordered (type="ordered")? If "standard", then a lasso penalty is applied to v, to enforce sparsity. If "ordered" (generally used for CGH data), then a fused lasso penalty is applied, to enforce both sparsity and smoothness.

penaltyx

The penalty to be applied to the matrix x, i.e. the penalty that results in the canonical vector u. If typex is "standard" then the L1 bound on u is penaltyx*sqrt(ncol(x)). In this case penaltyx must be between 0 and 1 (larger L1 bound corresponds to less penalization). If "ordered" then it's the fused lasso penalty lambda, which must be non-negative (larger lambda corresponds to more penalization).

penaltyz

The penalty to be applied to the matrix z, i.e. the penalty that results in the canonical vector v. If typez is "standard" then the L1 bound on v is penaltyz*sqrt(ncol(z)). In this case penaltyz must be between 0 and 1 (larger L1 bound corresponds to less penalization). If "ordered" then it's the fused lasso penalty lambda, which must be non-negative (larger lambda corresponds to more penalization).

K

The number of u's and v's desired; that is, the number of canonical vectors to be obtained.

niter

How many iterations should be performed? Default is 15.

v

The first K columns of the v matrix of the SVD of X'Z. If NULL, then the SVD of X'Z will be computed inside the CCA function. However, if you plan to run this function multiple times, then save a copy of this argument so that it does not need to be re-computed (since that process can be time-consuming if X and Z both have high dimension).

trace

Print out progress?

standardize

Should the columns of x and z be centered (to have mean zero) and scaled (to have standard deviation 1)? Default is TRUE.

xnames

An optional vector of column names for x.

znames

An optional vector of column names for z.

chromx

Used only if typex is "ordered"; allows user to specify a vector of length ncol(x) giving the chromosomal location of each CGH spot. This is so that smoothness will be enforced within each chromosome, but not between chromosomes.

chromz

Used only if typez is "ordered"; allows user to specify a vector of length ncol(z) giving the chromosomal location of each CGH spot. This is so that smoothness will be enforced within each chromosome, but not between chromosomes.

upos

If TRUE, then require elements of u to be positive. FALSE by default. Can only be used if type is "standard".

uneg

If TRUE, then require elements of u to be negative. FALSE by default. Can only be used if type is "standard".

vpos

If TRUE, require elements of v to be positive. FALSE by default. Can only be used if type is "standard".

vneg

If TRUE, require elements of v to be negative. FALSE by default. Can only be used if type is "standard".

outcome

If you would like to incorporate a phenotype into CCA analysis - that is, you wish to find features that are correlated across the two data sets and also correlated with a phenotype - then use one of "survival", "multiclass", or "quantitative" to indicate outcome type. Default is NULL.

y

If outcome is not NULL, then this is a vector of phenotypes - one for each row of x and z. If outcome is "survival" then these are survival times; must be non-negative. If outcome is "multiclass" then these are class labels (1,2,3,...). Default NULL.

cens

If outcome is "survival" then these are censoring statuses for each observation. 1 is complete, 0 is censored. Default NULL.

UVperms

A list of U and V generated by CCA.permute function. It must include all of the estimated Us and Vs for all of the regularization parameters and all of the permutations. Read the CCA.permute for the format of this list. Check the example below.

allpenaltyxs

A vector of all of the x regularization parameters generated by the CCA.permute function. Check the example below for more information.

Details

This function is useful for performing an integrative analysis of two sets of measurements taken on the same set of samples: for instance, gene expression and CGH measurements on the same set of patients. It takes in two data sets, called x and z, each of which have (the same set of) samples on the rows. If z is a matrix of CGH data with ordered CGH spots on the columns, then use typez="ordered". If z consists of unordered columns, then use typez="standard". Similarly for typex.

This function performs the penalized matrix decomposition on the data matrix $X'Z$. Therefore, the results should be the same as running the PMD function on t(x)\ using the CCA function is much faster because it avoids computation of $X'Z$.

The CCA criterion is as follows: find unit vectors $u$ and $v$ such that $u'X'Zv$ is maximized subject to constraints on $u$ and $v$. If typex="standard" and typez="standard" then the constraints on $u$ and $v$ are lasso ($L_1$). If typex="ordered" then the constraint on $u$ is a fused lasso penalty (promoting sparsity and smoothness). Similarly if typez="ordered".

When type x is "standard": the L1 bound of u is penaltyx*sqrt(ncol(x)).

When typex is "ordered": penaltyx controls the amount of sparsity and smoothness in u, via the fused lasso penalty: $lambda sum_j |u_j| + lambda sum_j |u_j - u_(j-1)|$. If NULL, then it will be chosen adaptively from the data.

Value

u

u is output. If you asked for multiple factors then each column of u is a factor. u has dimension nxK if you asked for K factors.

v

v is output. If you asked for multiple factors then each column of v is a factor. v has dimension pxK if you asked for K factors.

d

A vector of length K, which can alternatively be computed as the diagonal of the matrix $u'X'Zv$.

v.init

The first K factors of the v matrix of the SVD of x'z. This is saved in case this function will be re-run later.

SDu

Standard deviations of the components of U through the permutations for the given joint regularization parameter penaltyx and penaltyz.

SDv

Similar to SDu.

standardu

Standardized U using the estimated U and the component-wise standard deviation of U through permutations, SDu. There would be the following message if Inf values are created: The “Inf” in standardized U or V i.e. “standardu” or “standardv” indicates that the estimated U or V for that component is nonzero and that its estimated standard deviation through all permutations is zero.Therefore, that component is the most significant among all. If a component of U or V is estimated zero, the associated “standardu” or “standardv” component is zero.

standardv

Similar to SDv.

pvalsu

No-paramteric p-values associated to the components of u for the hypothesis testing with the null that each component is zero. If the estimated component is positive, the alternative would be that the true component is positive. Note that under the null hypothesis the permutation is zero, so in this scenario, the non-paramteric p-value is the proportion that the permutations resulted in a greater value than that of the component of u. If you need the p-value of the opposite direction of simply subtract this value from 1.

pvalsv

Similar to pvalsu.

References

Ali Mahzarnia, Alexander Badea (2022), Joint Estimation of Vulnerable Brain Networks and Alzheimer’s Disease Risk Via Novel Extension of Sparse Canonical Correlation at bioRxiv.

See Also

PMD,CCA.permute

Examples


set.seed(3128) # for replicating the result
n=100 # sample size
q=20 # base size
S=100*matrix(rnorm(100),n,q) # base matrix
u=c(rep(0,5),rep(1,3),rep(0,2) ) # True u
v=c(rep(1,5),rep(0,5),rep(1,5) ) # True v
p1=length(u) # length of true u
p2=length(v) #length of true v
U=matrix(rep(u, q),p1,q) # coefficients of base matrix for constructing X
V=matrix(rep(v, q),p2,q) # coefficients of base matrix for constructing Z
x=S%*%t(U) # constructing U
x=x+matrix(rnorm(dim(x)[1]*dim(x)[2]),dim(x)[1],dim(x)[2]);  # adding noise
z=S%*%t(V) # constructing V
z=z+matrix(rnorm(dim(z)[1]*dim(z)[2]),dim(z)[1],dim(z)[2]);  # adding noise
library(PMA2)
#for beter estimations try with more permutations,
# such as nperms=1000
perm.out <- CCA.permute(x,z,typex="standard",typez="standard", 
                      nperms=10, SD=TRUE, upos = TRUE, vpos = TRUE) 
# by SD=TRUE we estimate SD of U and V components too.
# by upos and vpos we restrict the estimations to only
# positive values but this isn't necessary generally
print(perm.out)
out <- CCA(x,z,typex="standard",typez="standard",K=1,
           penaltyx=perm.out$bestpenaltyx,penaltyz=perm.out$bestpenaltyz,
           v=perm.out$v.init, UVperms = perm.out$UVperms, 
           allpenaltyxs = perm.out$penaltyxs , upos = TRUE, vpos = TRUE)
print(out)
# results of projection  for u
# respectively:True U, Estimated U, Standard deviations, Zscores, nonparametric-Pvalues
utable=base::cbind(u,out$u, out$SDu, out$standardu, out$pvalsu) 
colnames(utable)=c("True U", "Estimated U", "SDs", "Zscores", "nonpar-Pvals")
utable
# results of projection  for v
# respectively:True V, Estimated V, Standard deviations, Zscores, nonparametric-Pvalues
vtable=base::cbind(v,out$v, out$SDv, out$standardv, out$pvalsv) 
colnames(vtable)=c("True V", "Estimated V", "SDs", "Zscores", "nonpar-Pvals")
vtable
                                    

PMA2 documentation built on May 12, 2022, 9:06 a.m.

Related to CCA in PMA2...