The goal of TestPMD is to provide a set of post selection inference tools for the results using Penalized Matrix Decomposition (PMD). Current version only provide the inference tools for sparse canonical correlation analysis using PMD.
You can install the development version of TestPMD from GitHub with:
# install.packages("devtools")
devtools::install_github("YunhuiQi/TestPMD")
Check the vignette at https://yunhuiqi.github.io/TestPMD/ .
You can also run the Shiny App
TestPMD::runExample()
The package TestPMD provides a set of post selection inference and visualization tools to sparse canonical correlation analysis using Penalized Matrix Decomposition (PMD).
By assuming the variance matrix of two datasets are diagonal matrices, PMD provides a version of sparse canonical correlation analysis using low rank approximation of cross-covariance matrix.
TestPMD provides the following features:
Conduct the hypothesis testing on the canonical correlations (singular values of cross-covariance matrix) between two data matrices: PPTvalue()
Conduct the hypothesis testing on the canonical loadings (singular vectors of cross-covariance matrix) between two data matrices: CPTloading()
Determine the how many canonical components (best low rank of cross-covariance matrix) should be used for downstream analysis: PPTrank()
Visualize whether there is a separation between samples from different groups: plotScatter()
Visualize the canonical loadings using barplot: plotBar()
If the groups information of features are available, conduct the fisher exact test to do determine the enriched groups on the significant features: groupFisher()
Conduct literature search on specific features using easyPubMed to learn more about the significant features and interested covariate: getInfo()
library(PMA)
library(TestPMD)
library(ggplot2)
library(ggpubr)
library(cowplot)
#>
#> Attaching package: 'cowplot'
#> The following object is masked from 'package:ggpubr':
#>
#> get_legend
library(pander)
This is a dataset including metabolome and proteome data of 102 patient diagnosed with COVID-19 in Albany Medical Center, NY from 6 April 2020 to 1 May 2020. The samples are matched already. There are three data frames in covid data list, meta (for samples meta information) and metolites (raw metabolome count table) and protein (proteome raw count table) with samples in rows, features in columns.
For details, refer Overmyer K A, Shishkova E, Miller I J, et al. Large-scale multi-omic analysis of COVID-19 severity J. Cell systems, 2021, 12(1): 23-40. e7.
attach(covid)
covid$meta[1:6,1:5]
#> sample_id covid age_less_than_90 gender icu
#> 1 1 1 39 M 0
#> 2 2 1 63 M 0
#> 3 3 1 33 M 0
#> 4 4 1 49 M 0
#> 5 5 1 49 M 0
#> 6 6 1 45 M 0
covid$metabolite[1:6, c(21,22,40)]
#> L.Kynurenine.RT18.688449 L.Leucine.RT8.057321 Salicylic.acid
#> 1 17.54496 19.18914 12.47894
#> 2 16.63684 19.84709 13.24714
#> 3 17.54002 19.56312 18.12925
#> 4 15.29170 18.91955 13.36410
#> 5 16.30920 19.73141 13.99100
#> 6 16.72706 19.60679 13.37770
covid$protein[1:6, 1:3]
#> A0A024R6I7.A0A0G2JRN3 A0A075B6H9 A0A075B6I0
#> 1 2.7403e+11 151670000 329220000
#> 2 1.7555e+11 717910000 555230000
#> 3 1.9816e+11 151440000 519740000
#> 4 2.1915e+11 493920000 773840000
#> 5 2.7176e+11 161000000 408530000
#> 6 1.9302e+11 778020000 639290000
Before performing the sparse canonical correlation analysis using PMA package, we need to take log to stabilize the data and standardize the data. TestPMD::standsdmu() can help with the standardization.
X <- standsdmu(log(covid$metabolite))
Y <- standsdmu(log(covid$protein + 1))
We conduct the sparse canonical correlation analysis using PMA package. The PMA::CCA function conduct the sparse canonical correlation analysis with the objective being:
minu, vuTΣxyv s.t.\|\|u\|\|1 ≤ c1, \|\|v\|\|2 ≤ c2, \|\|u\|\|2 ≤ 1, \|\|v\|\|2 ≤ 1
This is a simplified version of sparse canonical correlation analysis by assuming the Σx and Σy are identity matrices.
The tests in TestPMD allow us to arbitrarily choose penalty parameter, we set it to be 0.9 in this example, which mean c1 = 0.9 * sqrt(ncol(X)), c2 = 0.9 * sqrt(ncol(Y)).
CCA_out <- PMA::CCA(x = X, z = Y, K = 3, typex = "standard", typez = "standard", penaltyx = 0.9, penaltyz = 0.9, standardize = F, trace = F)
Check the scatter plot the first pair of canonical variates to see if there is a separation between samples admitted to ICU or not. This is a common step in canonical correlation analysis, to check if the majoy variation is related to some group factor.
It is clear that in the first canonical component, the patients admitted to ICU (green dots) are separated from the ones who are not (red dots). The canonical correlation is large, around 0.83.
scatter_p <- plotScatter(X = X, Y = Y, CCA_out = CCA_out, groups = covid$meta$icu, K = 3)
ggarrange(plotlist = scatter_p, nrow=1, common.legend = T, legend = "bottom")
It is natural to ask whether this canonical correlation is significant or should I include this component into my downstream analysis. This is a post selection inference question. PPTvalue() gives the p-value for this canonical correlation using projection based permutation tests with 1000 permutations.
The p-value for this canonical correlation is 0 which is very significant.
cc_pvalue <- PPTvalue(X = X, Y = Y, l_range = c(1), penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 1000)
#> [1] "Using provided penalty parameter for PMD"
#> [1] "The provided penalty parameters for X and Y are: 0.9, 0.9"
#> [1] "p-values for canonical correlations at positions 1 are: 0"
With significance of this canonical correlation, sparse canonical correlation analysis actually did feature selection using sparsity. However, since we arbitrarily choose the penalty parameter, this feature selection may not be accurate. CPTloading() provides a correlation based permutation test to test whether each loading is significantly different from zero, indicating whether each feature truly contributes to this canonical component.
Since this is a multiple testing problem, we use FDR control procedure.
set.seed(123)
mtb_pvalues <- CPTloading(X = X, Y = Y, side = "X", K = 1, r = 10, penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 1000)
#> [1] "Using provided penalty parameter for PMD"
#> [1] "The provided penalty parameters for X and Y are: 0.9, 0.9"
protein_pvalues <- CPTloading(X = X, Y = Y, side = "Y", K = 1, r = 10, penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 100)
#> [1] "Using provided penalty parameter for PMD"
#> [1] "The provided penalty parameters for X and Y are: 0.9, 0.9"
mtb_qvalues <- p.adjust(c(mtb_pvalues, protein_pvalues), method = "fdr")[1:length(mtb_pvalues)]
protein_qvalues <- p.adjust(c(mtb_pvalues, protein_pvalues), method = "fdr")[(length(mtb_pvalues) + 1):(length(mtb_pvalues) + length(protein_pvalues))]
To facilitate the visualization, we manually create the groups information for metabolites and proteins. A bar plot of canonical loadings showing significance, values, and groups is used. This can provide an overview of the loadings and some insights on the group enrichment.
mtb_groups <- c(rep(paste("mtbG", 1:3, sep = "-"), rep(37, 3)))
bar_p1 <- plotBar(data = X, loadings = CCA_out$u[, 1], feature_groups = mtb_groups, qvalues = mtb_qvalues)
protein_groups <- c(rep(paste("proG", 1:5, sep = "-"), rep(86, 5)), rep("proG-6", 87))
bar_p2 <- plotBar(data = Y, loadings = CCA_out$v[, 1], qvalues = protein_qvalues, feature_groups = protein_groups)
plot_grid(bar_p1, bar_p2, nrow = 2)
Sometimes, we can have the group information of the features, for instance, some metabolites may be generated from the same set ot cells, sharing functionality and properties, some genes are co-expressed. With the significance of canonical loadings, we can do group enrichment analysis. The intuition is as follows: a small group with large proportion of significant loading is more informative than a large group with small proportion of significant loadings. We identify some enriched groups in the first canonical component, and the functionality of this group may play important role in COVID-19 severity.
Here use the fake groups only for illustration of the function groupFisher(). The qvalues of each group is marked on the figure and the color can help compare the significance between groups.
mtb_fisher <- groupFisher(data = X, loadings = CCA_out$u[,1], qvalues = mtb_qvalues, feature_groups = mtb_groups, alpha = 0.05, draw = T)
#> mtbG-1 mtbG-2 mtbG-3
#> 0.98243470 0.02338315 0.98243470
protein_fisher <- groupFisher(data = Y, loadings = CCA_out$v[,1], qvalues = protein_qvalues, feature_groups = protein_groups, alpha = 0.05, draw = T)
#> proG-1 proG-2 proG-3 proG-4 proG-5 proG-6
#> 0.99999989 0.55160249 0.55160249 0.29285098 0.07418899 0.55160249
plot_grid(mtb_fisher, protein_fisher, nrow = 2)
Sometimes, for statisticians, it is difficult to interpret the results due to lack of biology knowledge. For instance, in this analysis, we are not familar with the significant metabolites Phenylalanine What’s the function of it? Is it naturally related to COVID-19 severity? Are there any independent study proving this before?
getInfo() is used to pull some related papers from PubMed. It searches Phenylalanine in PubMed database, and presents the top 5 related papers with or without abstract (depending on user).
infor_df <- getInfo(main = "Phenylalanine COVID", abstract = T)
[1] “58 records found.”
pandoc.table(infor_df, split.table = Inf)
| Title | Author | Journal | Date | Abstract | |:----------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------:|:----:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Phytoconstituents from ten natural herbs as potent inhibitors of main protease enzyme of SARS-COV-2: In silico study. | Nitish Kumar, Atamjit Singh, Harmandeep Kaur Gulati, Kavita Bhagat, Komalpreet Kaur, Jaspreet Kaur, Shilpa Dudhal, Amit Duggal, Puja Gulati, Harbinder Singh, Jatinder Vir Singh, Preet Mohinder Singh Bedi | Phytomedicine Plus : International journal of phytotherapy and phytopharmacology | 2021 | Lack of treatment of novel Coronavirus disease led to the search of specific antivirals that are capable to inhibit the replication of the virus. The plant kingdom has demonstrated to be an important source of new molecules with antiviral potential. | | Inflammatory Bowel Disease and COVID-19: How Microbiomics and Metabolomics Depict Two Sides of the Same Coin. | Gian Mario Cortes, Maria Antonietta Marcialis, Flaminia Bardanzellu, Angelica Corrias, Vassilios Fanos, Michele Mussap | Frontiers in microbiology | 2022 | The present study aims to utilize various computational tools to identify the most eligible drug candidate that have capabilities to halt the replication of SARS-COV-2 virus by inhibiting Main protease (Mpro) enzyme. | | pH- and Calcium-Dependent Aromatic Network in the SARS-CoV-2 Envelope Protein. | João Medeiros-Silva, Noah H Somberg, Harrison K Wang, Matthew J McKay, Venkata S Mandala, Aurelio J Dregni, Mei Hong | Journal of the American Chemical Society | 2022 | We have selected plants whose extracts have inhibitory potential against previously discovered coronaviruses. Their phytoconstituents were surveyed and a library of 100 molecules was prepared. Then, computational tools such as molecular docking, ADMET and molecular dynamic simulations were utilized to screen the compounds and evaluate them against Mpro enzyme. | | Peptide-Induced Fractal Assembly of Silver Nanoparticles for Visual Detection of Disease Biomarkers. | Maurice Retout, Yash Mantri, Zhicheng Jin, Jiajing Zhou, Grégoire Noël, Brian Donovan, Wonjun Yim, Jesse V Jokerst | ACS nano | 2022 | All the phytoconstituents showed good binding affinities towards Mpro enzyme. Among them laurolitsine possesses the highest binding affinity i.e. -294.1533 kcal/mol. On ADMET analysis of best three ligands were simulated for 1.2 ns, then the stable ligand among them was further simulated for 20 ns. Results revealed that no conformational changes were observed in the laurolitsine w.r.t. protein residues and low RMSD value suggested that the Laurolitsine-protein complex was stable for 20 ns. | | Fatal Neurodissemination and SARS-CoV-2 Tropism in K18-hACE2 Mice Is Only Partially Dependent on hACE2 Expression. | Mariano Carossino, Devin Kenney, Aoife K O’Connell, Paige Montanaro, Anna E Tseng, Hans P Gertje, Kyle A Grosz, Maria Ericsson, Bertrand R Huber, Susanna A Kurnick, Saravanan Subramaniam, Thomas A Kirkland, Joel R Walker, Kevin P Francis, Alexander D Klose, Neal Paragas, Markus Bosmann, Mohsan Saeed, Udeni B R Balasuriya, Florian Douam, Nicholas A Crossland | Viruses | 2022 | Laurolitsine, an active constituent of roots of Lindera aggregata, was found to be having good ADMET profile and have capabilities to halt the activity of the enzyme. Therefore, this makes laurolitsine a good drug candidate for the treatment of COVID-19. |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.