README.md

TestPMD

Codecov test
coverage License:
MIT R-CMD-check

Overview

The goal of TestPMD is to provide a set of post selection inference tools for the results using Penalized Matrix Decomposition (PMD). Current version only provide the inference tools for sparse canonical correlation analysis using PMD.

Installation

You can install the development version of TestPMD from GitHub with:

# install.packages("devtools")
devtools::install_github("YunhuiQi/TestPMD")

Website

Check the vignette at https://yunhuiqi.github.io/TestPMD/ .

Shiny app

You can also run the Shiny App

TestPMD::runExample()

Introduction

The package TestPMD provides a set of post selection inference and visualization tools to sparse canonical correlation analysis using Penalized Matrix Decomposition (PMD).

By assuming the variance matrix of two datasets are diagonal matrices, PMD provides a version of sparse canonical correlation analysis using low rank approximation of cross-covariance matrix.

TestPMD provides the following features:

Example exploring relationship between proteins and metabolites for patients with COVID-19

1. Load useful package

library(PMA)
library(TestPMD)
library(ggplot2)
library(ggpubr)
library(cowplot)
#> 
#> Attaching package: 'cowplot'
#> The following object is masked from 'package:ggpubr':
#> 
#>     get_legend
library(pander)

2. Load covid multi-omics data

This is a dataset including metabolome and proteome data of 102 patient diagnosed with COVID-19 in Albany Medical Center, NY from 6 April 2020 to 1 May 2020. The samples are matched already. There are three data frames in covid data list, meta (for samples meta information) and metolites (raw metabolome count table) and protein (proteome raw count table) with samples in rows, features in columns.

For details, refer Overmyer K A, Shishkova E, Miller I J, et al. Large-scale multi-omic analysis of COVID-19 severity J. Cell systems, 2021, 12(1): 23-40. e7.

attach(covid)
covid$meta[1:6,1:5]
#>   sample_id covid age_less_than_90 gender icu
#> 1         1     1               39      M   0
#> 2         2     1               63      M   0
#> 3         3     1               33      M   0
#> 4         4     1               49      M   0
#> 5         5     1               49      M   0
#> 6         6     1               45      M   0
covid$metabolite[1:6, c(21,22,40)]
#>   L.Kynurenine.RT18.688449 L.Leucine.RT8.057321 Salicylic.acid
#> 1                 17.54496             19.18914       12.47894
#> 2                 16.63684             19.84709       13.24714
#> 3                 17.54002             19.56312       18.12925
#> 4                 15.29170             18.91955       13.36410
#> 5                 16.30920             19.73141       13.99100
#> 6                 16.72706             19.60679       13.37770
covid$protein[1:6, 1:3]
#>   A0A024R6I7.A0A0G2JRN3 A0A075B6H9 A0A075B6I0
#> 1            2.7403e+11  151670000  329220000
#> 2            1.7555e+11  717910000  555230000
#> 3            1.9816e+11  151440000  519740000
#> 4            2.1915e+11  493920000  773840000
#> 5            2.7176e+11  161000000  408530000
#> 6            1.9302e+11  778020000  639290000

3. Pre-processing

Before performing the sparse canonical correlation analysis using PMA package, we need to take log to stabilize the data and standardize the data. TestPMD::standsdmu() can help with the standardization.

X <- standsdmu(log(covid$metabolite))
Y <- standsdmu(log(covid$protein + 1))

4. Sparse canonical correlation analysis

We conduct the sparse canonical correlation analysis using PMA package. The PMA::CCA function conduct the sparse canonical correlation analysis with the objective being:

minu, vuTΣxyv s.t.\|\|u\|\|1 ≤ c1, \|\|v\|\|2 ≤ c2, \|\|u\|\|2 ≤ 1, \|\|v\|\|2 ≤ 1

This is a simplified version of sparse canonical correlation analysis by assuming the Σx and Σy are identity matrices.

The tests in TestPMD allow us to arbitrarily choose penalty parameter, we set it to be 0.9 in this example, which mean c1 = 0.9 * sqrt(ncol(X)), c2 = 0.9 * sqrt(ncol(Y)).

CCA_out <- PMA::CCA(x = X, z = Y, K = 3, typex = "standard", typez = "standard", penaltyx = 0.9, penaltyz = 0.9, standardize = F, trace = F)

5. Visualization: scatter plot

Check the scatter plot the first pair of canonical variates to see if there is a separation between samples admitted to ICU or not. This is a common step in canonical correlation analysis, to check if the majoy variation is related to some group factor.

It is clear that in the first canonical component, the patients admitted to ICU (green dots) are separated from the ones who are not (red dots). The canonical correlation is large, around 0.83.

scatter_p <- plotScatter(X = X, Y = Y, CCA_out = CCA_out, groups = covid$meta$icu, K = 3)
ggarrange(plotlist = scatter_p, nrow=1, common.legend = T, legend = "bottom")

6. Inference: is this canonical correlation significant?

It is natural to ask whether this canonical correlation is significant or should I include this component into my downstream analysis. This is a post selection inference question. PPTvalue() gives the p-value for this canonical correlation using projection based permutation tests with 1000 permutations.

The p-value for this canonical correlation is 0 which is very significant.

cc_pvalue <- PPTvalue(X = X, Y = Y, l_range = c(1), penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 1000)
#> [1] "Using provided penalty parameter for PMD"
#> [1] "The provided penalty parameters for X and Y are: 0.9, 0.9"
#> [1] "p-values for canonical correlations at positions 1 are: 0"

7. Inference: which proteins and metabolites truly contribute to this canonical correlation?

With significance of this canonical correlation, sparse canonical correlation analysis actually did feature selection using sparsity. However, since we arbitrarily choose the penalty parameter, this feature selection may not be accurate. CPTloading() provides a correlation based permutation test to test whether each loading is significantly different from zero, indicating whether each feature truly contributes to this canonical component.

Since this is a multiple testing problem, we use FDR control procedure.

set.seed(123)

mtb_pvalues <- CPTloading(X = X, Y = Y, side = "X", K = 1, r = 10, penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 1000)
#> [1] "Using provided penalty parameter for PMD"
#> [1] "The provided penalty parameters for X and Y are: 0.9, 0.9"

protein_pvalues <- CPTloading(X = X, Y = Y, side = "Y", K = 1, r = 10, penalty = "Fixed", rho_x = 0.9, rho_y = 0.9, permutation_no = 100)
#> [1] "Using provided penalty parameter for PMD"
#> [1] "The provided penalty parameters for X and Y are: 0.9, 0.9"

mtb_qvalues <- p.adjust(c(mtb_pvalues, protein_pvalues), method = "fdr")[1:length(mtb_pvalues)]

protein_qvalues <- p.adjust(c(mtb_pvalues, protein_pvalues), method = "fdr")[(length(mtb_pvalues) + 1):(length(mtb_pvalues) + length(protein_pvalues))]

8. Visualization: significance of canonical loadings

To facilitate the visualization, we manually create the groups information for metabolites and proteins. A bar plot of canonical loadings showing significance, values, and groups is used. This can provide an overview of the loadings and some insights on the group enrichment.

mtb_groups <- c(rep(paste("mtbG", 1:3, sep = "-"), rep(37, 3)))
bar_p1 <- plotBar(data = X, loadings = CCA_out$u[, 1], feature_groups = mtb_groups, qvalues = mtb_qvalues)

protein_groups <- c(rep(paste("proG", 1:5, sep = "-"), rep(86, 5)), rep("proG-6", 87))
bar_p2 <- plotBar(data = Y, loadings = CCA_out$v[, 1], qvalues = protein_qvalues, feature_groups = protein_groups)

plot_grid(bar_p1, bar_p2, nrow = 2)

9. Group enrichment analysis

Sometimes, we can have the group information of the features, for instance, some metabolites may be generated from the same set ot cells, sharing functionality and properties, some genes are co-expressed. With the significance of canonical loadings, we can do group enrichment analysis. The intuition is as follows: a small group with large proportion of significant loading is more informative than a large group with small proportion of significant loadings. We identify some enriched groups in the first canonical component, and the functionality of this group may play important role in COVID-19 severity.

Here use the fake groups only for illustration of the function groupFisher(). The qvalues of each group is marked on the figure and the color can help compare the significance between groups.

mtb_fisher <- groupFisher(data = X, loadings = CCA_out$u[,1], qvalues = mtb_qvalues, feature_groups = mtb_groups, alpha = 0.05, draw = T)
#>     mtbG-1     mtbG-2     mtbG-3 
#> 0.98243470 0.02338315 0.98243470
protein_fisher <- groupFisher(data = Y, loadings = CCA_out$v[,1], qvalues = protein_qvalues, feature_groups = protein_groups, alpha = 0.05, draw = T)
#>     proG-1     proG-2     proG-3     proG-4     proG-5     proG-6 
#> 0.99999989 0.55160249 0.55160249 0.29285098 0.07418899 0.55160249
plot_grid(mtb_fisher, protein_fisher, nrow = 2)

10. How does the significant metabolite Phenylalanine related to COVID-19, is there any independent study?

Sometimes, for statisticians, it is difficult to interpret the results due to lack of biology knowledge. For instance, in this analysis, we are not familar with the significant metabolites Phenylalanine What’s the function of it? Is it naturally related to COVID-19 severity? Are there any independent study proving this before?

getInfo() is used to pull some related papers from PubMed. It searches Phenylalanine in PubMed database, and presents the top 5 related papers with or without abstract (depending on user).

infor_df <- getInfo(main = "Phenylalanine COVID", abstract = T)

[1] “58 records found.”

pandoc.table(infor_df, split.table = Inf)

| Title | Author | Journal | Date | Abstract | |:----------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------:|:----:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Phytoconstituents from ten natural herbs as potent inhibitors of main protease enzyme of SARS-COV-2: In silico study. | Nitish Kumar, Atamjit Singh, Harmandeep Kaur Gulati, Kavita Bhagat, Komalpreet Kaur, Jaspreet Kaur, Shilpa Dudhal, Amit Duggal, Puja Gulati, Harbinder Singh, Jatinder Vir Singh, Preet Mohinder Singh Bedi | Phytomedicine Plus : International journal of phytotherapy and phytopharmacology | 2021 | Lack of treatment of novel Coronavirus disease led to the search of specific antivirals that are capable to inhibit the replication of the virus. The plant kingdom has demonstrated to be an important source of new molecules with antiviral potential. | | Inflammatory Bowel Disease and COVID-19: How Microbiomics and Metabolomics Depict Two Sides of the Same Coin. | Gian Mario Cortes, Maria Antonietta Marcialis, Flaminia Bardanzellu, Angelica Corrias, Vassilios Fanos, Michele Mussap | Frontiers in microbiology | 2022 | The present study aims to utilize various computational tools to identify the most eligible drug candidate that have capabilities to halt the replication of SARS-COV-2 virus by inhibiting Main protease (Mpro) enzyme. | | pH- and Calcium-Dependent Aromatic Network in the SARS-CoV-2 Envelope Protein. | João Medeiros-Silva, Noah H Somberg, Harrison K Wang, Matthew J McKay, Venkata S Mandala, Aurelio J Dregni, Mei Hong | Journal of the American Chemical Society | 2022 | We have selected plants whose extracts have inhibitory potential against previously discovered coronaviruses. Their phytoconstituents were surveyed and a library of 100 molecules was prepared. Then, computational tools such as molecular docking, ADMET and molecular dynamic simulations were utilized to screen the compounds and evaluate them against Mpro enzyme. | | Peptide-Induced Fractal Assembly of Silver Nanoparticles for Visual Detection of Disease Biomarkers. | Maurice Retout, Yash Mantri, Zhicheng Jin, Jiajing Zhou, Grégoire Noël, Brian Donovan, Wonjun Yim, Jesse V Jokerst | ACS nano | 2022 | All the phytoconstituents showed good binding affinities towards Mpro enzyme. Among them laurolitsine possesses the highest binding affinity i.e. -294.1533 kcal/mol. On ADMET analysis of best three ligands were simulated for 1.2 ns, then the stable ligand among them was further simulated for 20 ns. Results revealed that no conformational changes were observed in the laurolitsine w.r.t. protein residues and low RMSD value suggested that the Laurolitsine-protein complex was stable for 20 ns. | | Fatal Neurodissemination and SARS-CoV-2 Tropism in K18-hACE2 Mice Is Only Partially Dependent on hACE2 Expression. | Mariano Carossino, Devin Kenney, Aoife K O’Connell, Paige Montanaro, Anna E Tseng, Hans P Gertje, Kyle A Grosz, Maria Ericsson, Bertrand R Huber, Susanna A Kurnick, Saravanan Subramaniam, Thomas A Kirkland, Joel R Walker, Kevin P Francis, Alexander D Klose, Neal Paragas, Markus Bosmann, Mohsan Saeed, Udeni B R Balasuriya, Florian Douam, Nicholas A Crossland | Viruses | 2022 | Laurolitsine, an active constituent of roots of Lindera aggregata, was found to be having good ADMET profile and have capabilities to halt the activity of the enzyme. Therefore, this makes laurolitsine a good drug candidate for the treatment of COVID-19. |



YunhuiQi/TestPMD documentation built on May 5, 2022, 8:23 p.m.