knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = '100%' )
The package TestPMD provides a set of post selection inference and visualization tools to sparse canonical correlation analysis using Penalized Matrix Decomposition (PMD).
By assuming the variance matrix of two datasets are diagonal matrices, PMD provides a version of sparse canonical correlation analysis using low rank approximation of cross-covariance matrix.
TestPMD provides the following features:
Conduct the hypothesis testing on the canonical correlations (singular values of cross-covariance matrix) between two data matrices: PPTvalue()
Conduct the hypothesis testing on the canonical loadings (singular vectors of cross-covariance matrix) between two data matrices: CPTloading()
Determine the how many canonical components (best low rank of cross-covariance matrix) should be used for downstream analysis: PPTrank()
Visualize whether there is a separation between samples from different groups: plotScatter()
Visualize the canonical loadings using barplot: plotBar()
If the groups information of features are available, conduct the fisher exact test to do determine the enriched groups on the significant features: groupFisher()
Conduct literature search on specific features using easyPubMed to learn more about the significant features and interested covariate: getInfo()
You can run the Shiny App:
TestPMD::runExample()
library(PMA) library(TestPMD) library(ggplot2) library(ggpubr) library(cowplot) library(pander)
This is a dataset including metabolome and proteome data of 102 patient diagnosed with COVID-19 in Albany Medical Center, NY from 6 April 2020 to 1 May 2020. The samples are matched already. There are three data frames in covid data list, meta (for samples meta information) and metolites (raw metabolome count table) and protein (proteome raw count table) with samples in rows, features in columns.
For details, refer Overmyer K A, Shishkova E, Miller I J, et al. Large-scale multi-omic analysis of COVID-19 severity J. Cell systems, 2021, 12(1): 23-40. e7.
attach(covid) covid$meta[1:6,1:5] covid$metabolite[1:6, c(21,22,40)] covid$protein[1:6, 1:3]
Before performing the sparse canonical correlation analysis using PMA package, we need to take log to stabilize the data and standardize the data. TestPMD::standsdmu() can help with the standardization.
X <- standsdmu(log(covid$metabolite)) Y <- standsdmu(log(covid$protein + 1))
We conduct the sparse canonical correlation analysis using PMA package. The PMA::CCA function conduct the sparse canonical correlation analysis with the objective being:
$$min_{u,v}u^T\Sigma_{xy}v$$ $$s.t.||u||_1\leq c_1, ||v||_2\leq c_2, ||u||_2\leq 1, ||v||_2\leq 1 $$
This is a simplified version of sparse canonical correlation analysis by assuming the $\Sigma_x$ and $\Sigma_y$ are identity matrices.
The tests in TestPMD allow us to arbitrarily choose penalty parameter, we set it to be 0.9 in this example, which mean $c_1 =0.9sqrt(ncol(X))$, $c_2 =0.9sqrt(ncol(Y))$.
CCA_out <- PMA::CCA(x = X, z = Y, K = 3, typex = "standard", typez = "standard", penaltyx = 0.9, penaltyz = 0.9, standardize = F, trace = F)
Check the scatter plot the first pair of canonical variates to see if there is a separation between samples admitted to ICU or not. This is a common step in canonical correlation analysis, to check if the majoy variation is related to some group factor.
It is clear that in the first canonical component, the patients admitted to ICU (green dots) are separated from the ones who are not (red dots). The canonical correlation is large, around 0.83.
scatter_p <- plotScatter(X = X, Y = Y, CCA_out = CCA_out, groups = covid$meta$icu, K = 3) ggarrange(plotlist = scatter_p, nrow=1, common.legend = T, legend = "bottom")
It is natural to ask whether this canonical correlation is significant or should I include this component into my downstream analysis. This is a post selection inference question. PPTvalue() gives the p-value for this canonical correlation using projection based permutation tests with 1000 permutations.
The p-value for this canonical correlation is 0 which is very significant.
cc_pvalue <- PPTvalue(X = X, Y = Y, l_range = c(1), penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 1000)
With significance of this canonical correlation, sparse canonical correlation analysis actually did feature selection using sparsity. However, since we arbitrarily choose the penalty parameter, this feature selection may not be accurate. CPTloading() provides a correlation based permutation test to test whether each loading is significantly different from zero, indicating whether each feature truly contributes to this canonical component.
Since this is a multiple testing problem, we use FDR control procedure.
set.seed(123) mtb_pvalues <- CPTloading(X = X, Y = Y, side = "X", K = 1, r = 10, penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 1000) protein_pvalues <- CPTloading(X = X, Y = Y, side = "Y", K = 1, r = 10, penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 100) mtb_qvalues <- p.adjust(c(mtb_pvalues, protein_pvalues), method = "fdr")[1:length(mtb_pvalues)] protein_qvalues <- p.adjust(c(mtb_pvalues, protein_pvalues), method = "fdr")[(length(mtb_pvalues) + 1):(length(mtb_pvalues) + length(protein_pvalues))]
To facilitate the visualization, we manually create the groups information for metabolites and proteins. A bar plot of canonical loadings showing significance, values, and groups is used. This can provide an overview of the loadings and some insights on the group enrichment.
mtb_groups <- c(rep(paste("mtbG", 1:3, sep = "-"), rep(37, 3))) bar_p1 <- plotBar(data = X, loadings = CCA_out$u[, 1], feature_groups = mtb_groups, qvalues = mtb_qvalues) protein_groups <- c(rep(paste("proG", 1:5, sep = "-"), rep(86, 5)), rep("proG-6", 87)) bar_p2 <- plotBar(data = Y, loadings = CCA_out$v[, 1], qvalues = protein_qvalues, feature_groups = protein_groups) plot_grid(bar_p1, bar_p2, nrow = 2)
Sometimes, we can have the group information of the features, for instance, some metabolites may be generated from the same set ot cells, sharing functionality and properties, some genes are co-expressed. With the significance of canonical loadings, we can do group enrichment analysis. The intuition is as follows: a small group with large proportion of significant loading is more informative than a large group with small proportion of significant loadings. We identify some enriched groups in the first canonical component, and the functionality of this group may play important role in COVID-19 severity.
Here use the fake groups only for illustration of the function groupFisher(). The qvalues of each group is marked on the figure and the color can help compare the significance between groups.
mtb_fisher <- groupFisher(data = X, loadings = CCA_out$u[,1], qvalues = mtb_qvalues, feature_groups = mtb_groups, alpha = 0.05, draw = T) protein_fisher <- groupFisher(data = Y, loadings = CCA_out$v[,1], qvalues = protein_qvalues, feature_groups = protein_groups, alpha = 0.05, draw = T) plot_grid(mtb_fisher, protein_fisher, nrow = 2)
Sometimes, for statisticians, it is difficult to interpret the results due to lack of biology knowledge. For instance, in this analysis, we are not familar with the significant metabolites Phenylalanine What's the function of it? Is it naturally related to COVID-19 severity? Are there any independent study proving this before?
getInfo() is used to pull some related papers from PubMed. It searches Phenylalanine in PubMed database, and presents the top 5 related papers with or without abstract (depending on user).
infor_df <- getInfo(main = "Phenylalanine COVID", abstract = T) pandoc.table(infor_df, split.table = Inf)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.