knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  out.width = '100%'
)

Introduction

The package TestPMD provides a set of post selection inference and visualization tools to sparse canonical correlation analysis using Penalized Matrix Decomposition (PMD).

By assuming the variance matrix of two datasets are diagonal matrices, PMD provides a version of sparse canonical correlation analysis using low rank approximation of cross-covariance matrix.

TestPMD provides the following features:

Shiny app

You can run the Shiny App:

TestPMD::runExample()

Example exploring relationship between proteins and metabolites for patients with COVID-19

1. Load useful package

library(PMA)
library(TestPMD)
library(ggplot2)
library(ggpubr)
library(cowplot)
library(pander)

2. Load covid multi-omics data

This is a dataset including metabolome and proteome data of 102 patient diagnosed with COVID-19 in Albany Medical Center, NY from 6 April 2020 to 1 May 2020. The samples are matched already. There are three data frames in covid data list, meta (for samples meta information) and metolites (raw metabolome count table) and protein (proteome raw count table) with samples in rows, features in columns.

For details, refer Overmyer K A, Shishkova E, Miller I J, et al. Large-scale multi-omic analysis of COVID-19 severity J. Cell systems, 2021, 12(1): 23-40. e7.

attach(covid)
covid$meta[1:6,1:5]
covid$metabolite[1:6, c(21,22,40)]
covid$protein[1:6, 1:3]

3. Pre-processing

Before performing the sparse canonical correlation analysis using PMA package, we need to take log to stabilize the data and standardize the data. TestPMD::standsdmu() can help with the standardization.

X <- standsdmu(log(covid$metabolite))
Y <- standsdmu(log(covid$protein + 1))

4. Sparse canonical correlation analysis

We conduct the sparse canonical correlation analysis using PMA package. The PMA::CCA function conduct the sparse canonical correlation analysis with the objective being:

$$min_{u,v}u^T\Sigma_{xy}v$$ $$s.t.||u||_1\leq c_1, ||v||_2\leq c_2, ||u||_2\leq 1, ||v||_2\leq 1 $$

This is a simplified version of sparse canonical correlation analysis by assuming the $\Sigma_x$ and $\Sigma_y$ are identity matrices.

The tests in TestPMD allow us to arbitrarily choose penalty parameter, we set it to be 0.9 in this example, which mean $c_1 =0.9sqrt(ncol(X))$, $c_2 =0.9sqrt(ncol(Y))$.

CCA_out <- PMA::CCA(x = X, z = Y, K = 3, typex = "standard", typez = "standard", penaltyx = 0.9, penaltyz = 0.9, standardize = F, trace = F)

5. Visualization: scatter plot

Check the scatter plot the first pair of canonical variates to see if there is a separation between samples admitted to ICU or not. This is a common step in canonical correlation analysis, to check if the majoy variation is related to some group factor.

It is clear that in the first canonical component, the patients admitted to ICU (green dots) are separated from the ones who are not (red dots). The canonical correlation is large, around 0.83.

scatter_p <- plotScatter(X = X, Y = Y, CCA_out = CCA_out, groups = covid$meta$icu, K = 3)
ggarrange(plotlist = scatter_p, nrow=1, common.legend = T, legend = "bottom")

6. Inference: is this canonical correlation significant?

It is natural to ask whether this canonical correlation is significant or should I include this component into my downstream analysis. This is a post selection inference question. PPTvalue() gives the p-value for this canonical correlation using projection based permutation tests with 1000 permutations.

The p-value for this canonical correlation is 0 which is very significant.

cc_pvalue <- PPTvalue(X = X, Y = Y, l_range = c(1), penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 1000)

7. Inference: which proteins and metabolites truly contribute to this canonical correlation?

With significance of this canonical correlation, sparse canonical correlation analysis actually did feature selection using sparsity. However, since we arbitrarily choose the penalty parameter, this feature selection may not be accurate. CPTloading() provides a correlation based permutation test to test whether each loading is significantly different from zero, indicating whether each feature truly contributes to this canonical component.

Since this is a multiple testing problem, we use FDR control procedure.

set.seed(123)

mtb_pvalues <- CPTloading(X = X, Y = Y, side = "X", K = 1, r = 10, penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 1000)

protein_pvalues <- CPTloading(X = X, Y = Y, side = "Y", K = 1, r = 10, penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 100)

mtb_qvalues <- p.adjust(c(mtb_pvalues, protein_pvalues), method = "fdr")[1:length(mtb_pvalues)]

protein_qvalues <- p.adjust(c(mtb_pvalues, protein_pvalues), method = "fdr")[(length(mtb_pvalues) + 1):(length(mtb_pvalues) + length(protein_pvalues))]

8. Visualization: significance of canonical loadings

To facilitate the visualization, we manually create the groups information for metabolites and proteins. A bar plot of canonical loadings showing significance, values, and groups is used. This can provide an overview of the loadings and some insights on the group enrichment.

mtb_groups <- c(rep(paste("mtbG", 1:3, sep = "-"), rep(37, 3)))
bar_p1 <- plotBar(data = X, loadings = CCA_out$u[, 1], feature_groups = mtb_groups, qvalues = mtb_qvalues)

protein_groups <- c(rep(paste("proG", 1:5, sep = "-"), rep(86, 5)), rep("proG-6", 87))
bar_p2 <- plotBar(data = Y, loadings = CCA_out$v[, 1], qvalues = protein_qvalues, feature_groups = protein_groups)

plot_grid(bar_p1, bar_p2, nrow = 2)

9. Group enrichment analysis

Sometimes, we can have the group information of the features, for instance, some metabolites may be generated from the same set ot cells, sharing functionality and properties, some genes are co-expressed. With the significance of canonical loadings, we can do group enrichment analysis. The intuition is as follows: a small group with large proportion of significant loading is more informative than a large group with small proportion of significant loadings. We identify some enriched groups in the first canonical component, and the functionality of this group may play important role in COVID-19 severity.

Here use the fake groups only for illustration of the function groupFisher(). The qvalues of each group is marked on the figure and the color can help compare the significance between groups.

mtb_fisher <- groupFisher(data = X, loadings = CCA_out$u[,1], qvalues = mtb_qvalues, feature_groups = mtb_groups, alpha = 0.05, draw = T)
protein_fisher <- groupFisher(data = Y, loadings = CCA_out$v[,1], qvalues = protein_qvalues, feature_groups = protein_groups, alpha = 0.05, draw = T)
plot_grid(mtb_fisher, protein_fisher, nrow = 2)

10. How does the significant metabolite Phenylalanine related to COVID-19, is there any independent study?

Sometimes, for statisticians, it is difficult to interpret the results due to lack of biology knowledge. For instance, in this analysis, we are not familar with the significant metabolites Phenylalanine What's the function of it? Is it naturally related to COVID-19 severity? Are there any independent study proving this before?

getInfo() is used to pull some related papers from PubMed. It searches Phenylalanine in PubMed database, and presents the top 5 related papers with or without abstract (depending on user).

infor_df <- getInfo(main = "Phenylalanine COVID", abstract = T)
pandoc.table(infor_df, split.table = Inf)


YunhuiQi/TestPMD documentation built on May 5, 2022, 8:23 p.m.