knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
AclustsCCA a two stage framework to test the association between multiple exposures and multiple outcomes.
The framework of AclustsCCA
consists of two parts
The goal of AclustsCCA is to ...
You can install the released version of AclustsCCA from CRAN with:
install.packages("AclustsCCA")
And the development version from GitHub with:
# install.packages("devtools") devtools::install_github("jennyjyounglee/AclustsCCA") devtools::install_github("tamartsi/Aclust") library(AclustsCCA) library(Aclust)
The below is a list of input parameters:
| X
\eqn{n} by \eqn{p} exposure data matrix, where \eqn{n} is sample size and \eqn{p} is number of exposures.
| Y
\eqn{n} by \eqn{q} outcome data matrix, where \eqn{n} is sample size and \eqn{q} is number of outcomes.
| Z
\eqn{n} by \eqn{e} confounder data matrix, where \eqn{n} is sample size and \eqn{r} is number of confounders. If NULL
, partial residuals are used for SparseCCA analysis.
| clusters.list
A list of clusters with CpG sites obtained using A-clustering, each item is a cluster that contains a set of probes. A-clustering is implemented if NULL
or can be provided by users.
| annot
A preloaded annotation \code{data.table} file that includes columns "IlmnID", "CHR", "Coordinate_37", "Islands_Name", "Relation_to_Island", "UCSC_RefGene_Name". Only needed if clusters.list
is NULL
.
| dist.type
A type of similarity distance function. Options are "spearman" (default), "pearson" (correlation measures) or "euclid".
| Aclust.method
A type of clustering function. Options are "single", "complete" or "average" (default).
| dist.thresh
A similarity distance threshold. Two neighboring clusters are merged to a single cluster if the similarity distance between them is above dist.thresh. Corresponds to $\bar{D}$ in the paper and the default is $0.2$
| bp.thresh.clust
Optional maximum length between neighboring variables permitting to cluster them together. Corresponds to $\bar{d}_{bp}$ in the paper and the default is $1000$.
| bp.merge
A distance in chromosomal location. Any set of methylation sites within an interval smaller or equal to bp.dist will be potentially merged, depending on the similarity between sites at the ends of the interval. Corresponds to $\underline{d}_{bp}$ in the paper and the default is $999$.
| Xmethod
A penalty function for the exposure, i.e. penalty function when regressing Y onto X. Options are "lasso", "alasso","gglasso", and "SGL" (default).
| Ymethod
A penalty function for the outcome, i.e. penalty function when regressing X onto Y. Options are "lasso", "alasso","gglasso", "SGL", and "OLS" (default).
| init.method
Initialization method. Options are "lasso", "OLS", and "SVD" (default).
| X.groupidx
A vector of length \eqn{p} that indicates grouping structure of exposure \eqn{X}.
| standardize
A logical flag for exposure \eqn{X} and outcome \eqn{Y} standardization, prior to fitting the model.
| max.iter
A maximum number of iterations of SparseCCA. The default is $100$.
| conv
A tolerance value for convergence \eqn{epsilon} of SparseCCA. The default is $0.01$.
| maxnum
A maximal total number of permutations across all the clusters.
| maxB
A maximal number of permutations for a single cluster.
| permute.tmp.filepath
Filepath to save intermittent permutation results.
| permute
A logical flag for whether to run permutation test or not.
| nthread
A number of threads to parallelize permutation test and implementation of SparseCCA across all the clusters.
| FDR.thresh
FDR threshold. The default is $0.05$.
The framework of AclustsCCA
consists of two parts:
Implement A-clustering on DNA methylation data
Implement SparseCCA on each cluster identified by A-clustering
When implementing SparseCCA, partial residuals are used to adjust for potential confounders
For statistical inference, permutation test is performed
This entire process can be done using the code below, but I personally DO NOT suggest this for computational time and memory. Please read the next section for the suggested way of running AclustsCCA
.
# Load annotation file data(annot) # row: CpG sites # Load sample data data(sample.data) DATA.X <- sample.data$DATA.X # row: subjects (n), column: exposures (p) DATA.Y <- sample.data$DATA.Y # row: subjects (n), column: CpG sites (q) DATA.Z <- sample.data$DATA.Z # row: subjects (n), column: confounders (r) # Settings for Aclust dist.type <- "spearman" Aclust.method <- "average" dist.thresh <- 0.2 bp.thresh.clust <- 1000 bp.merge <- 999 # Settings for SparseCCA Xmethod <- "SGL" Ymethod <- "OLS" X.groupidx <- c(rep(1,5),rep(2,5),rep(3,5),rep(4,5)) maxB <- 300 nthread <- 2 AclustsCCA.result <- AclustsCCA(X=DATA.X, Y=DATA.Y, Z=DATA.Z, clusters.list=NULL, annot=annot, # parameters for A-clustering dist.type = dist.type, Aclust.method = Aclust.method, dist.thresh = dist.thresh, bp.thresh.clust = bp.thresh.clust, bp.merge = bp.merge, # parameters for SparseCCA Xmethod=Xmethod, Ymethod=Ymethod, X.groupidx=X.groupidx, # parameters for permutation test for AclustsCCA h=hBH, permute=TRUE, maxB=maxB, nthread=nthread, test.stat="cancors") TABLE1 <- data.table(summary_AclustsCCA(obj=AclustsCCA.result,annot=annot)) # Are the true clusters selected as significant? sample.data$TRUE.table$TRUE.Clusters; sort(TABLE1[Significant=="Yes",ClustIdx])
If you want to run more permutation test, then increase either maxnum
or maxB
and use the funtion AclustsCCA.cont
.
maxB <- maxB * 2 AclustsCCA.result.updated <- AclustsCCA.cont(obj=AclustsCCA.result, X=AclustsCCA.result$X.resid, Y=AclustsCCA.result$Y.resid, maxB=maxB) summary_AclustsCCA(obj=AclustsCCA.result.updated,annot=annot,n.top=9)
The framework of AclustsCCA
consists of two parts:
Implement A-clustering on DNA methylation data
Implement SparseCCA on each cluster identified by A-clustering
When implementing SparseCCA, partial residuals are used to adjust for potential confounders
For statistical inference, permutation test is performed
All of the above procedures can be implemented at once using AclustsCCA
function. Among these steps, permutation test is definitely a part that takes the most computational time to run and requires attention when running.
Therefore, I personally suggest running each part separately to save computational time and memory.
A-clustering is implemented to identify cluster and this part have to be ran only once. I personally suggest to run this part separately and save it for future use.
# Implement A-clustering all.clusters.list <- Aclust::assign.to.clusters(betas = t(DATA.Y), annot = annot, dist.type = dist.type, method = Aclust.method, dist.thresh = dist.thresh, bp.thresh.clust = bp.thresh.clust, bp.merge = bp.merge) # Summarize the result summary_Aclustering(all.clusters.list,annot) # We only need clusters with at least two probes clusters.list <- all.clusters.list[sapply(all.clusters.list,length)!=1]
When implementing SparseCCA, partial residuals are used to adjust for potential confounders. Again, computing partial residuals have to be ran only once. I personally suggest to run this part separately and save it for future use.
X.resid <- partial.residual(data=DATA.X,Z=DATA.Z,nthread=1) Y.resid <- partial.residual(data=DATA.Y,Z=DATA.Z,nthread=1)
As list of clusters are provided and partial residuals are computed, AclustsCCA
will only run SparseCCA on each cluster identified by A-clustering.
AclustsCCA.result <- AclustsCCA(X=X.resid, Y=Y.resid, clusters.list=clusters.list, # parameters for SparseCCA Xmethod=Xmethod, Ymethod=Ymethod, X.groupidx=X.groupidx, # parameters for permutation test for AclustsCCA maxB=maxB, permute=TRUE, nthread=nthread, test.stat="cancors")
If you want to run more permutation test, then increase either maxnum
or maxB
and use the funtion AclustsCCA.cont
.
AclustsCCA.result.updated <- AclustsCCA.cont(obj=AclustsCCA.result, X=X.resid, Y=Y.resid, maxB=maxB*2) summary_AclustsCCA(obj=AclustsCCA.result.updated,annot=annot)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.