Welcome to the Pathway analysis via network smoothing (PANTS
) R package. In this vignette, I'll illustrate a case study of the package for integrating proteomics and metabolomics data to identify significant pathways. I'll use a small toy dataset.
See README.md.
library(limma) library(ezlimma) library(PANTS)
To illustrate PANTS
, we'll simulate abundance of 4 analytes on 10 samples, with 5 cases and 5 controls, as M
. We simulate the first analyte (a
) as up in cases.
We treat this data as being already processed. For many datasets, there are zeroes or missing values that might need to be imputed; samples need to be normalized to make them comparable and amenable to statistical tests; absent analytes need to be removed; sample outliers need to be assessed to examine whether some experimental variables or batch effects need to be accounted for, or the samples need to be removed or down-weighted; and trends between an analyte's mean expression and its variance should be accounted for, especially in RNA-seq data, with limma
's voom
function.
set.seed(0) M <- matrix(rnorm(n=40), ncol=10, dimnames=list(letters[1:4], paste0("s", 1:10))) M["a", 1:5] <- M["a", 1:5]+5 pheno.v <- rep(c("case", "ctrl"), each=5)
We create a toy interaction network with analyte nodes matching our dataset:
el <- rbind(t(combn(letters[1:4], 2))[-2,], c("a", "d"))
To get an interaction network for your dataset, one option is Pathway Commons, which integrates many public databases. TO download an interaction network from Pathway Commons, select the most recent version from https://www.pathwaycommons.org/archives/PC2/, e.g. v11, and then download https://www.pathwaycommons.org/archives/PC2/v11/PathwayCommons11.All.hgnc.sif.gz, or its equivalent for the most recent version, which holds all interactions in Simple Interaction Format (SIF). Each row of a SIF file is physical_entity_id sif2edgelist
.
PANTS
smooths statistics over the network using a network kernel derived from a graph. So we transform our edge-list into a simplified graph:
gr <- edgelist2graph(el)
We create toy pathways with analytes matching our dataset:
gmt <- list(pwy1=list(name="pwy1", description="pwy1", genes=c("a", "b", "c")), pwy2=list(name="pwy2", description="pwy2", genes=c("b", "c", "d")))
You can obtain predefined pathways, most commonly as GMT files from the same sources as for any gene set analysis. For example, collections of gene set databases are available from the Broad Insitute's MSigDB, Harmonizome or Pathway Commons. For pathways that include both metabolites and proteins, we downloaded a CSV file for proteins and one for metabolites from SMPDB, and then combined these two files. We need the analytes in our predefined pathways to have IDs consistent with those of our interaction network. If they aren't this way, an ID conversion is needed. You can read these GMT files into R with ezlimma::read_gmt
.
We then transform the pathways into a matrix that PANTS
accepts:
G <- gmt2Gmat(gmt)
This matrix is used for smoothing scores across the network, like how heat diffuses. It can be a dense matrix, a sparse matrix from package Matrix
, or NULL
to avoid smoothing.
ker <- graph2kernel(gr)
We execute the main function, pants
, by testing a contrast (or comparison) on this dataset, e.g. case vs. control. We limit the number of permutations to speed up the calculation. In general, if the object
or nperm
is large, this step will be slow.
contr <- c(CASEvsCTRL="case-ctrl") res.lst <- pants(object=M, phenotype=pheno.v, contrast.v=contr, ker=ker, Gmat=G, nperm=50, name="vinny") pwy.stats <- res.lst$pwy.stats
Draw network diagram of driving nodes for pwy1
.
feat.tab <- res.lst$feature.stats PANTS::plot_pwy(feat.tab = feat.tab, gr=gr, ker=ker, Gmat=G, pwy="pwy1", ntop=2, name=NA)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.