Introduction

Welcome to the Pathway analysis via network smoothing (PANTS) R package. In this vignette, I'll illustrate a case study of the package for integrating proteomics and metabolomics data to identify significant pathways. I'll use a small toy dataset.

Install & load

See README.md.

library(limma)
library(ezlimma)
library(PANTS)

Toy dataset

To illustrate PANTS, we'll simulate abundance of 4 analytes on 10 samples, with 5 cases and 5 controls, as M. We simulate the first analyte (a) as up in cases.

We treat this data as being already processed. For many datasets, there are zeroes or missing values that might need to be imputed; samples need to be normalized to make them comparable and amenable to statistical tests; absent analytes need to be removed; sample outliers need to be assessed to examine whether some experimental variables or batch effects need to be accounted for, or the samples need to be removed or down-weighted; and trends between an analyte's mean expression and its variance should be accounted for, especially in RNA-seq data, with limma's voom function.

set.seed(0)
M <- matrix(rnorm(n=40), ncol=10, dimnames=list(letters[1:4], paste0("s", 1:10)))
M["a", 1:5] <- M["a", 1:5]+5
pheno.v <- rep(c("case", "ctrl"), each=5)

Interaction network

We create a toy interaction network with analyte nodes matching our dataset:

el <- rbind(t(combn(letters[1:4], 2))[-2,], c("a", "d"))

To get an interaction network for your dataset, one option is Pathway Commons, which integrates many public databases. TO download an interaction network from Pathway Commons, select the most recent version from https://www.pathwaycommons.org/archives/PC2/, e.g. v11, and then download https://www.pathwaycommons.org/archives/PC2/v11/PathwayCommons11.All.hgnc.sif.gz, or its equivalent for the most recent version, which holds all interactions in Simple Interaction Format (SIF). Each row of a SIF file is physical_entity_id physical_entity_id. The 1st and 3rd columns represent the edges of the network, an edgelist. You can subset by interaction types in the SIF, or to proteins or metabolites (metabolite IDs can be identified as starting with "CHEBI:"). You can transform the SIF into an edgelist, using sif2edgelist.

Graph

PANTS smooths statistics over the network using a network kernel derived from a graph. So we transform our edge-list into a simplified graph:

gr <- edgelist2graph(el)

Predefined pathways

We create toy pathways with analytes matching our dataset:

gmt <- list(pwy1=list(name="pwy1", description="pwy1", genes=c("a", "b", "c")),
            pwy2=list(name="pwy2", description="pwy2", genes=c("b", "c", "d")))

You can obtain predefined pathways, most commonly as GMT files from the same sources as for any gene set analysis. For example, collections of gene set databases are available from the Broad Insitute's MSigDB, Harmonizome or Pathway Commons. For pathways that include both metabolites and proteins, we downloaded a CSV file for proteins and one for metabolites from SMPDB, and then combined these two files. We need the analytes in our predefined pathways to have IDs consistent with those of our interaction network. If they aren't this way, an ID conversion is needed. You can read these GMT files into R with ezlimma::read_gmt.

We then transform the pathways into a matrix that PANTS accepts:

G <- gmt2Gmat(gmt)

Calculate Laplacian kernel matrix

This matrix is used for smoothing scores across the network, like how heat diffuses. It can be a dense matrix, a sparse matrix from package Matrix, or NULL to avoid smoothing.

ker <- graph2kernel(gr)

Pathway analysis

We execute the main function, pants, by testing a contrast (or comparison) on this dataset, e.g. case vs. control. We limit the number of permutations to speed up the calculation. In general, if the object or nperm is large, this step will be slow.

contr <- c(CASEvsCTRL="case-ctrl")
res.lst <- pants(object=M, phenotype=pheno.v, contrast.v=contr, ker=ker, Gmat=G, nperm=50, name="vinny")
pwy.stats <- res.lst$pwy.stats

Plot

Draw network diagram of driving nodes for pwy1.

feat.tab <- res.lst$feature.stats
PANTS::plot_pwy(feat.tab = feat.tab, gr=gr, ker=ker, Gmat=G, pwy="pwy1", ntop=2, name=NA)


jdreyf/PANTS documentation built on July 18, 2019, 10:12 a.m.