In biodataganache/leapR: Layered enrichment analysis of pathways R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = FALSE
)

This is intended to be a short introduction to the leapr package. First we need to load the required libraries:

#load the library
library(leapR)
library(gplots)

Definitions

Dataset - an expression dataset, contained in the Bioconductor \code{ExpressionSet} object, that at the bare minimum has a matrix of components (rows) measured in the same system under multiple different conditions (columns)

Component - the things being measured, genes, proteins, methylation site, phosphosite, etc. For functional (currently) the component must be associated with a gene name. That is, there's not currently a way to calculate pathway enrichment using lipids.

Pathway - a set of components that works together to accomplish something or are related to each other in some other way. This includes classic signaling and metabolic pathways, but also molecular function and localization categories and other groups of related components, like genome location, conservation, etc.

Condition - a sample where the treatment, environmental conditions, patient, time point or some combination of those is varied.

The overall idea for functional enrichment is to determine which pathways are statistically over-represented in one group versus another, display statistically differential abundance from one group to another, or are statistically differentially distributed in a ranked list based on the abundance of one sample. Each of these purposes has a different underlying statistical test (or family of tests) and the results of each can be interpreted in somewhat different ways. The purpose of this vignette is to give the user a very brief introduction on how to use the package, not to discuss the underlying statistical choices that need to be made when analyzing such data.

There are a number of caveats (probably non-exhaustive) with doing this kind of analysis.

Important points for consideration

Data normalization

One important point is to use data that's been normalized in a particular way to do these analyses. Data here has been normalized as a Z score by row (gene/protein/etc.). So, for each row, calculate the mean and standard deviation across all the conditions (columns) and then express as a Z score.

Here's why. All high-throughput technologies (microarray, RNAseq, MS-assisted proteomics, metabolomics, lipidomics, etc.) suffer from the same limitation. The detectability of each molecule being detected (protein, RNA, etc.) is different and, in general, it's impossible to accurately determine how detectable each one is. The multi-omic functional enrichment process lumps together measurements from different components (proteins, genes, etc.) to summarize a pathway. If the component measurements aren't directly comparable (they aren't) then this can and will introduce significant systematic errors and won't produce the results you're looking for. Careful consideration must be given that the results of the analysis reflect the question being asked and that the normalization method hasn't obscured the desired results.

Background

The background of comparison for functional enrichment is always important, but really only impacts the Fisher's exact tests in my applications below. The background is the answer to "my functional group of interest is statistically enriched relative to what?" For Fisher's exact tests this is really important. Generally, it is best to compare enrichment versus those components observed in the data - not against the universe of all possible components. For example, a proteomics dataset from plasma may have a limited number of proteins observed relative to all possible human proteins. Functional enrichment using all possible proteins will yeild very different results than if it's performed using only those proteins observed in plasma as a background.

Multiple hypothesis correction

When testing the statistical significance of differences in a lot of pathways it's necessary to correct for multiple hypotheses. This essentially accounts for the possibility you might see SOMETHING significant by chance if you just test enough things- so it moves p values in a less significant direction. The more things you test, the greater this move will be. So pathway databases with lots of pathways are affected more by this correction, making it harder to get a significant result (which is a good thing actually).

Pathway databases

Two 'databases' (organized text files) are included for pathways. The first is taken from the NCI's Pathway Interaction Database (PID) and covers signaling pathways in human - but is no longer being actively maintained. The second is taken from the MSIGDB (the same database used by GSEA) that has lots of different pathways gathered from multiple different sources- some are more useful than others. The MSIGDB is good for giving more options- but runs into multiple hypothesis correction issues (see above). They can be loaded as follows:

``` {r pathdb} data(msigdb) data(ncipid)

### Identifiers
The identifiers (gene names, e.g.) for the data input MUST match the identifiers used in the pathway database. The two included human databases use the HGNC-approved gene names. Which means your data has to use the same identifiers.

###  Example data
A sample data set is included that is from the CPTAC study of 169 ovarian tumors. We include each dataset as an \code{ExpressionSet} object to enable interoperability with other tools, and store example files as `rda` on [Figshare](https://figshare.com/account/projects/251807/articles/29250845) as example.

This data can be loaded as follows:
``` {r omicsdata, message=FALSE, warning=FALSE}

pdata <- download.file('https://figshare.com/ndownloader/files/55781147',method='libcurl',destfile='protData.rda')
#protdata<-read.csv("protData.rda",check.names=FALSE,row.names=1)|>
#  as.matrix()
load('protData.rda')

p <- file.remove("protData.rda")


#transdata<-read.csv(paste0(datadir,'transdata.csv.gz'),check.names=FALSE,row.names=1)
tdata <- download.file("https://figshare.com/ndownloader/files/55781153",method='libcurl',destfile='transData.rda')
load('transData.rda')
p <- file.remove("transData.rda")

phdata<-download.file('https://figshare.com/ndownloader/files/55781120',method='libcurl',destfile = 'phosData.rda')
#phosphodata<-read.csv("phdata",check.names=FALSE,row.names=1)
load('phosData.rda')
p <- file.remove('phosData.rda')

We also include some groups of patients to compare stored as R data objects: ``` {r ptlists} data(shortlist) data(longlist)

columns that we want to use for results

cols_to_display = c('ingroup_n','outgroup_n','background_n','pvalue','BH_pvalue','SignedBH_pvalue')

The data are now loaded and ready to go through some of the examples.

# Examples

## Example 1: Comparison of one condition/group versus another condition/group.

There are a number of ways to do this. I generally use a simple approach which
assesses the statistical difference in distributions between the abundance values
from all the members of a pathway in all the group members from one group with those
from the other group using a t test.

### Caveat
This is a 'bag of values' approach and it does not pay attention to the
relationships between values in different groups (i.e. that each group has measurements
for the same component). There are likely issues that arise because of this and
caveats associated with it. However, it works fairly well.

### Description
In this example we are assessing the enrichment of pathways in
a group of short surviving patients versus in a group of long surviving patients.
We can also do a single patient-to-patient comparison or compare a single patient
to a group of patients.

### Interpretation
Better corrected p-values are more enriched. However, you can get good
p-values when the algorithm only considers a limited number of components from a pathway. That is, the pathway may have 30 members and the p-value is coming from values from just 3 members. You can look at the `ingroup_n` column from the result matrix to see this (and screen out if desired).

It is VERY important to also consider the effect size. That is, the difference between the mean of one group and the mean of the other group. If there are large numbers of components in the pathway being compared it is relatively easy to get a significant p value with small effect size. Though this may be a real difference it is often not as interesting as a smaller group with worse p value and greater effect size. You can look at the effect size by comparing the `ingroup_mean` and `outgroup_mean` columns.

```r

# in this example we lump a bunch of patients together (the 'short survivors') and compare
# them to another group (the 'long survivors')

###using enrichment_wrapper function
protdata.enrichment.svl = leapR::leapR(geneset = ncipid,
enrichment_method = 'enrichment_comparison',
eset = pset,
primary_columns = shortlist,
secondary_columns = longlist)

rmarkdown::paged_table(protdata.enrichment.svl[order(protdata.enrichment.svl[,"pvalue"]),cols_to_display])
# another application is to compare just one patient against another (this would be the
# equivalent of comparing one time point to another)

###using enrichment_wrapper function
protdata.enrichment.svl.ovo = leapR::leapR(geneset = ncipid,
enrichment_method = 'enrichment_comparison',
eset = pset,
primary_columns = shortlist[1],
secondary_columns = longlist[1])
rmarkdown::paged_table(protdata.enrichment.svl.ovo[order(protdata.enrichment.svl.ovo[,"pvalue"]),cols_to_display])

When we only compare one sample to another, we get no enriched pathways.

Example 2: Fisher's exact test

For this test I use Fisher's exact which is a simple comparison of the overlap of two sets (think of it like a statistical Venn diagram with two groups). It's also referred to as a hypergeometric test.

Caveats

Caveat 1. Fisher's exact does not consider abundance values but only lists of components. Generally this requires some separation of a group of interest using differential expression, module membership (from a network for example), or some other method.

Caveat 2. The choice of background for comparison can make a big difference on outcome. For example, in a proteomics experiment where you're looking at enrichment in a group of highly differentially expressed proteins, you could choose to use all possible proteins as a background, or you could use just those proteins that were observed by proteomics (generally a much more limited set). The second option is generally the best since the first options will result in (partly to mostly) functions that are enriched in proteins that are seen in proteomics. That is, the most abundant proteins, which is generally not the desired outcome.

Description

In the example below I construct a genelist of interest using a simple abundance threshold on the data then use a background of all the genes in the example dataset (which is a limited number). I then do a simple hierarchical clustering on the data, extract modules, and step through each module to calculate enrichment for them, outputting the results into a separate text file.

Interpretation

As with the t test comparison above it is important to look at the number of pathway members included in the comparison (look at the in_path column). There is no 'effect size' problem with Fisher's exact since it's just a set comparison, but it's important to note that significant p values can arise from a pathway being underrepresented in the genelist, which often times is not the desired result. The foldx column gives a ratio of in versus not in the genelist, values > 1 being enriched and <1 being depleted.

``` {r fishers, echo=TRUE, warning=FALSE, message=FALSE}

for this example we will construct a list of genes from the expression data

to emulate what you might be inputting

genelist = rownames(pset)[which(Biobase::exprs(pset)[,1]>0.5)]

background = rownames(protdata)

using enrichment_wrapper function

protdata.enrichment.order = leapR::leapR(geneset=ncipid, enrichment_method="enrichment_in_order", eset = pset, #threshold=.5, primary_columns="TCGA-13-1484")

rmarkdown::paged_table(protdata.enrichment.order[order(protdata.enrichment.order[,"pvalue"]),cols_to_display])

in this example we construct some modules from the hierarchical clustering of the

data

protdata_naf = Biobase::exprs(pset)

hierarchical clustering is not too happy with lots of missing values

so we'll do a zero fill on this to get the modules

protdata_naf[which(is.na(protdata_naf))] = 0

construct the hierarchical clustering using the 'wardD' method, which

seems to give more even sized modules

protdata_hc = hclust(dist(protdata_naf), method="ward.D2")

arbitrarily we'll chop the clusters into 5 modules

modules = cutree(protdata_hc, k=5)

sara: created list

clusters = lapply(unique(modules),function(x) names(which(modules==x)))

modules is a named list of values where each value is a module

number and the name is the gene name

To do enrichment for one module (module 1 in this case) do this

protdata.enrichment.sets.module_1 = leapR::leapR(geneset=ncipid, enrichment_method="enrichment_in_sets", background=pset, targets=names(modules[which(modules==1)]))

To do enrichment on all modules and return the list of enrichment results tables do this

protdata.enrichment.sets.modules <-do.call(rbind,leapR::cluster_enrichment(geneset=ncipid, clusters=clusters, background=modules, sigfilter=0.25))

nothing is enriched

rmarkdown::paged_table(protdata.enrichment.sets.modules[,cols_to_display])

## Example 3. The Kolmogorov–Smirnov test (KS)
Similar to the popular GSEA, KS tests whether a group of components (the pathway) is distributed in a statistically significant manner in a ranked list of components. That is, if all the members of the pathway are clustered together at the top of the list (highly abundant, e.g.) or at the bottom of the list (low abundance, e.g.) this will return good p values. I should note that GSEA uses a more sophisticated approach than this and their application has a lot of bells and whistles.

### Description
In the example below I'm simply calculating enrichment for one of the patients in the list (arbitrarily selected). The ranking value is relative protein abundance in this case, but can be any continuous measure or derived value. For example, you could calculate the topology of all proteins in a network and use the topology measure (degree) as the measure. 

### Interpretation
Similar to the other examples be cautious of pathways with good p values that consider a small number of pathway numbers (in_path column). The MeanPath column gives a measure that shows how far above or below the median the mean rank of the pathway is (normalized to -1,1). The Zscore column is a Zscore calculated on the basis of the mean percentage rank of the pathway relative to the mean of the entire list divided by the standard deviation of the pathway rank. The foldx column expresses the mean percentage rank of the pathway relative to the entire list - closer to 0 is higher in the list and closer to 1 is closer to the bottom of the list. Each of these should give consistent results, but will be somewhat different.

``` {r ks, echo=TRUE, warning = FALSE, message=FALSE}
# This is how you calculate enrichment in a ranked list (for example from topology)
###using enrichment_wrapper function
protdata.enrichment.sets = leapR::leapR(geneset=ncipid, "enrichment_in_order", 
                                            eset = pset, 
                                            primary_columns=shortlist[1])


#rmarkdown::paged_table(protdata.enrichment.sets[order(protdata.enrichment.sets[,"pvalue"]),][1:10,])
rmarkdown::paged_table(protdata.enrichment.sets[order(protdata.enrichment.sets[,"pvalue"]),cols_to_display])

In this section I'll go through the ideas behind and use of several different kinds of enrichment that I've been working on. Consider this section to be under development and so use at your risk.

Example 4. Enrichment in correlation

The idea here is to use the correlation of pathway members to each other versus to non-pathway members as a way to assess functional enrichment. This idea seems sound- pathways that are varying in a correlated way across a bunch of conditions (say time points or patients) may be more active and more important than others. However, more testing and validation is needed to show that this is the case.

Interpretation

The ingroup_mean gives the mean correlation of the pathway members to each other and outgroup_mean gives the correlation of the pathway members to non-pathway members. Background_mean gives the mean correlation of all non-pathway members. The pvalue and BH_pvalue are for the pathway members to each other versus those pathway members to non-pathway components. The pvalue_background and BH_pvalue_background are for the pathway member correlation relative to non-pathway member correlation (which is similar but slightly different than the other p-values).

``` {r correlation, echo=TRUE, warning=FALSE, message=FALSE}

using enrichment_wrapper function

protdata.enrichment.correlation = leapR::leapR(geneset=ncipid, enrichment_method = "correlation_enrichment", eset = pset)

rmarkdown::paged_table(head(protdata.enrichment.correlation[order(protdata.enrichment.correlation[,"pvalue"]),cols_to_display]))

protdata.enrichment.correlation.short = leapR::leapR(geneset=ncipid, enrichment_method = "correlation_enrichment", eset = pset[,shortlist]) rmarkdown::paged_table(head(protdata.enrichment.correlation.short[order(protdata.enrichment.correlation.short[,"pvalue"]),cols_to_display]))

protdata.enrichment.correlation.long = leapR::leapR(geneset=ncipid, enrichment_method = "correlation_enrichment", eset = pset[,longlist]) rmarkdown::paged_table(head(protdata.enrichment.correlation.long[order(protdata.enrichment.correlation.long[,"pvalue"]),cols_to_display]))

## Example 5. Phosphoproteomics data analysis

In this example we will use phosphoproteomics data to assess the enrichment in known kinase substrates (a proxy for kinase activity)

``` {r phosphoproteomics, echo=TRUE, warning=FALSE, message=FALSE, include=FALSE}

# load phosphoproteomics data for 69 tumors

#prots <- data.frame(psite=rownames(phosphodata),hgnc_id=phosphodata[,1])
#rownames(prots)<-prots$psite

#phset <- Biobase::ExpressionSet(as.matrix(phosphodata[,2:ncol(phosphodata)]),
#                                featureData=Biobase::AnnotatedDataFrame(prots), 
#                                annotation='Phosphoproteomics')
#data("phosphodata")
data("kinasesubstrates")

# for an individual tumor calculate the Kinase-Substrate Enrichment (similar to KSEA)
#     This uses the site-specific phosphorylation data to determine which kinases
#     might be active by assessing the enrichment of the phosphorylation of their known substrates

phosphodata.ksea.order = leapR::leapR(geneset=kinasesubstrates,
                               enrichment_method="enrichment_in_order",
                               eset = phset,
                               primary_columns="TCGA-13-1484")

rmarkdown::paged_table(phosphodata.ksea.order[order(phosphodata.ksea.order[,"pvalue"]),cols_to_display])


# now do the same thing but use a threshold
phosphodata.sets.order = leapR::leapR(geneset=kinasesubstrates,
                               enrichment_method="enrichment_in_sets",
                               eset = phset, threshold=0.5,
                               primary_columns="TCGA-13-1484")

rmarkdown::paged_table(phosphodata.sets.order[order(phosphodata.sets.order[,"pvalue"]),cols_to_display])

Figure 2

We will compare the ability of transcriptomics, proteomics, and phosphoproteomics to inform about differences between short and long surviving patient groups.

This spans the multiple enrichment methods in leapR and also includes multi-omics

The resulting heatmap is presented as Figure 2 in the paper.

``` {r figure_2, echo=TRUE, warning=FALSE, message=FALSE, error=FALSE}

load the single omic and multi-omic pathway databases

data("krbpaths") data("mo_krbpaths")

comparison enrichment in transcriptional data

transdata.comp.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method='enrichment_comparison', eset = tset, primary_columns=shortlist, secondary_columns=longlist)

comparison enrichment in proteomics data

this is the same code used above, just repeated here for clarity

protdata.comp.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method='enrichment_comparison', eset = pset, primary_columns=shortlist, secondary_columns=longlist)

comparison enrichment in phosphoproteomics data

phosphodata.comp.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method='enrichment_comparison', eset = phset, primary_columns=shortlist, secondary_columns=longlist, id_column='hgnc_id')

set enrichment in transcriptomics data

perform the comparison t tes

tset <- leapR::calcTTest(tset, shortlist, longlist)

we can then call enrichment on that object, now referencing the column name in the featureData

transdata.svl.ttest = t(sapply(rownames(transdata), function (r) {

if(length(which(!is.na(transdata[r,shortlist])))>2 && length(which(!is.na(transdata[r,longlist])))>1){

res <- t.test(transdata[r,shortlist], transdata[r,longlist])

# error=return(c(NA,NA)))

# if (is(res,'try-error')) return(c(NA, NA));

return(c(pvalue=res$p.value, difference =res$estimate[[1]]-res$estimate[[2]]))

}

else{

return(c(pvalue=NA,difference=NA))

}

}))

now we can get those tarets that are significant and feed them into the set enrichment

targets <- as.data.frame(transdata.svl.ttest)|>

subset(pvalue<0.05)|>

rownames()

now we need to run enrichment in sets with target list, not eset

transdata.set.enrichment.svl = leapR::leapR(geneset=krbpaths, eset=tset, enrichment_method='enrichment_in_sets', primary_columns="pvalue", greaterthan=FALSE, threshold=0.05)

perform the comparison t test

protdata.svl.ttest = t(sapply(rownames(protdata), function (r) {

if(length(which(!is.na(protdata[r,shortlist])))>2 && length(which(!is.na(protdata[r,longlist])))>1){

res <- t.test(protdata[r,shortlist], protdata[r,longlist])

# error=return(c(NA,NA)))

# if (is(res,'try-error')) return(c(NA, NA));

return(c(pvalue=res$p.value, difference =res$estimate[[1]]-res$estimate[[2]]))

}

else{

return(c(pvalue=NA,difference=NA))

}

}))

prot.targets<-subset(protdata.svl.ttest,!is.na('pvalue'))|>

subset(pvalue<0.05)|>

rownames()4

pset <- leapR::calcTTest(pset, shortlist, longlist)

protdata.set.enrichment.svl = leapR::leapR(geneset = krbpaths, eset=pset, enrichment_method = 'enrichment_in_sets', primary_columns = "pvalue", greaterthan=FALSE, threshold=0.05)

set enrichment in phosphoproteomics data

perform the comparison t test

phosphodata.svl.ttest = t(sapply(rownames(phosphodata), function (r) {

res=try(t.test(phosphodata[r,shortlist], phosphodata[r,longlist]),silent=TRUE);

if (is(res,'try-error')) return(c(NA, NA));

return(c(res$p.value, res$estimate[[1]]-res$estimate[[2]]))}))

phosphodata.svl.ttest = data.frame(protein=phosphodata[,1], pvalue=phosphodata.svl.ttest[,1],

difference=phosphodata.svl.ttest[,2])

colnames(phosphodata.svl.ttest) = c('protein', 'p-value','difference')

phset <- leapR::calcTTest(phset, shortlist, longlist)

phosphodata.set.enrichment.svl = leapR::leapR(geneset = krbpaths, enrichment_method = 'enrichment_in_sets', id_column="hgnc_id", eset = phset, primary_columns = "pvalue", greaterthan = FALSE, threshold = 0.05)

order enrichment in transcriptomics data

transdata.order.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method='enrichment_in_order', eset = tset, primary_columns="difference")

order enrichment in proteomics data

protdata.order.enrichment.svl = leapR::leapR(geneset = krbpaths, enrichment_method = 'enrichment_in_order', eset = pset, primary_columns = "difference")

order enrichment in phosphoproteomics data

phosphodata.order.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method='enrichment_in_order', id_column='hgnc_id', eset = phset, primary_columns="difference")

correlation difference in transcriptomics data

transdata.corr.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method="correlation_comparison", eset = tset, primary_columns=shortlist, secondary_columns=longlist)

correlation difference in proteomics data

protdata.corr.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method="correlation_comparison", eset = pset, primary_columns=shortlist, secondary_columns=longlist)

correlation difference in phosphoproteomics data

phosphodata.corr.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method="correlation_comparison", eset = phset, primary_columns=shortlist, secondary_columns=longlist, id_column='hgnc_id')

combine the omics data into one with prefix tags

comboset = leapR::combine_omics(list(pset, phset, tset), c(NA,'hgnc_id',NA))

comparison enrichment for combodata

when we use expression set, we do not need to use the mo_krbpaths since the id mapping column is used

combodata.enrichment.svl = leapR::leapR(geneset=krbpaths,#mo_krbpaths, enrichment_method='enrichment_comparison', eset = comboset, primary_columns=shortlist, secondary_columns=longlist, id_column='id')

set enrichment in combo data

perform the comparison t test

comboset <- leapR::calcTTest(comboset, shortlist, longlist)

combodata.svl.ttest = t(sapply(rownames(combodata), function (r) {

res=try(t.test(combodata[r,shortlist], combodata[r,longlist]),silent=TRUE);

if (is(res,'try-error')) return(c(NA, NA));

return(c(res$p.value, res$estimate[[1]]-res$estimate[[2]]))}))

combodata.svl.ttest = data.frame(protein=combodata[,1], pvalue=combodata.svl.ttest[,1],

difference=combodata.svl.ttest[,2])

colnames(combodata.svl.ttest) = c('protein', 'p-value','difference')

combodata.set.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method='enrichment_in_sets', eset = comboset, primary_columns="pvalue", id_column = 'id', greaterthan=FALSE, threshold=0.05)

order enrichment in combo data

combodata.order.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method='enrichment_in_order', eset = comboset, primary_columns="difference", id_column = 'id')

correlation difference in combo data

combodata.corr.enrichment.svl = leapR::leapR(geneset=krbpaths, enrichment_method="correlation_comparison", eset = comboset, primary_columns=shortlist, id_column = 'id', secondary_columns=longlist)

now take all these results and combine them into one figure

all_results = list(transdata.comp.enrichment.svl, protdata.comp.enrichment.svl, phosphodata.comp.enrichment.svl, combodata.enrichment.svl,

               transdata.set.enrichment.svl,
               protdata.set.enrichment.svl,
               phosphodata.set.enrichment.svl,
               combodata.set.enrichment.svl,

               transdata.order.enrichment.svl,
               protdata.order.enrichment.svl,
               phosphodata.order.enrichment.svl,
               combodata.order.enrichment.svl,

               transdata.corr.enrichment.svl,
               protdata.corr.enrichment.svl,
               phosphodata.corr.enrichment.svl,
               combodata.corr.enrichment.svl)

pathways_of_interest = c("KEGG_APOPTOSIS", "KEGG_CELL_CYCLE", "KEGG_ERBB_SIGNALING_PATHWAY", "KEGG_FOCAL_ADHESION", "KEGG_INSULIN_SIGNALING_PATHWAY", "KEGG_MAPK_SIGNALING_PATHWAY", "KEGG_MISMATCH_REPAIR", "KEGG_MTOR_SIGNALING_PATHWAY" , "KEGG_OXIDATIVE_PHOSPHORYLATION", "KEGG_P53_SIGNALING_PATHWAY", "KEGG_PATHWAYS_IN_CANCER", "KEGG_PROTEASOME", "KEGG_RIBOSOME", "KEGG_VEGF_SIGNALING_PATHWAY", "KEGG_WNT_SIGNALING_PATHWAY")

results.frame = data.frame(pathway=pathways_of_interest, td.comp=all_results[[1]][pathways_of_interest,"BH_pvalue"]<0.05, pd.comp=all_results[[2]][pathways_of_interest,"BH_pvalue"]<0.05, fd.comp=all_results[[3]][pathways_of_interest,"BH_pvalue"]<0.05, cd.comp=all_results[[4]][pathways_of_interest,"BH_pvalue"]<0.05, td.set=all_results[[5]][pathways_of_interest,"BH_pvalue"]<0.05, pd.set=all_results[[6]][pathways_of_interest,"BH_pvalue"]<0.05, fd.set=all_results[[7]][pathways_of_interest,"BH_pvalue"]<0.05, cd.set=all_results[[8]][pathways_of_interest,"BH_pvalue"]<0.05, td.order=all_results[[9]][pathways_of_interest,"BH_pvalue"]<0.05, pd.order=all_results[[10]][pathways_of_interest,"BH_pvalue"]<0.05, fd.order=all_results[[11]][pathways_of_interest,"BH_pvalue"]<0.05, cd.order=all_results[[12]][pathways_of_interest,"BH_pvalue"]<0.05, td.corr=all_results[[13]][pathways_of_interest,"BH_pvalue"]<0.05, pd.corr=all_results[[14]][pathways_of_interest,"BH_pvalue"]<0.05, fd.corr=all_results[[15]][pathways_of_interest,"BH_pvalue"]<0.05, cd.corr=all_results[[16]][pathways_of_interest,"BH_pvalue"]<0.05)

results.frame.or = data.frame(pathway=pathways_of_interest, td.comp=all_results[[1]][pathways_of_interest,"oddsratio"], pd.comp=all_results[[2]][pathways_of_interest,"oddsratio"], fd.comp=all_results[[3]][pathways_of_interest,"oddsratio"], cd.comp=all_results[[4]][pathways_of_interest,"oddsratio"], td.set=log(all_results[[5]][pathways_of_interest,"oddsratio"],2), pd.set=log(all_results[[6]][pathways_of_interest,"oddsratio"],2), fd.set=log(all_results[[7]][pathways_of_interest,"oddsratio"],2), cd.set=log(all_results[[8]][pathways_of_interest,"oddsratio"],2), td.order=all_results[[9]][pathways_of_interest,"oddsratio"], pd.order=all_results[[10]][pathways_of_interest,"oddsratio"], fd.order=all_results[[11]][pathways_of_interest,"oddsratio"], cd.order=all_results[[12]][pathways_of_interest,"oddsratio"], td.corr=all_results[[13]][pathways_of_interest,"oddsratio"], pd.corr=all_results[[14]][pathways_of_interest,"oddsratio"], fd.corr=all_results[[15]][pathways_of_interest,"oddsratio"], cd.corr=all_results[[16]][pathways_of_interest,"oddsratio"])

rownames(results.frame) = results.frame[,1] rownames(results.frame.or) = results.frame.or[,1] results.frame.sig = results.frame[,2:17]*results.frame.or[,2:17]

heatmap.2(as.matrix(results.frame.sig[,c(1:4,9:16)]), Colv=NA, trace="none", breaks=c(-1,-.1,-0.0001,0,0.1,1), col=c("blue","lightblue", "grey","pink", "red"), dendrogram="none")

Figure 3. An application of the KSEA-like approach in leapR as applied to our
          example data. In this example we are looking for known substrate sets
          of kinases (from Phosphosite Plus) that are enriched in the short vs
          long comparison of phosphopeptides.
``` {r figure_3, warning=FALSE, message=FALSE}
# this comparison of abundance in substrates between case and control
#     is lopsided in the sense that phosphorylation levels were previously
#     reported to be overall higher in the short survivors. Thus the
#     results are not terribly interesting (all kinases are in the same direction)
phosphodata.ksea.comp.svl = leapR::leapR(geneset=kinasesubstrates,
                              enrichment_method="enrichment_comparison",
                              eset = phset,
                              primary_columns=shortlist, secondary_columns=longlist)


# thus for the example we'll look at correlation between known substrates in the
#      case v control conditions
phosphodata.ksea.corr.svl = leapR::leapR(geneset=kinasesubstrates,
                                  enrichment_method="correlation_comparison",
                                  eset = phset, 
                                  primary_columns=shortlist,
                                  secondary_columns=longlist)

# for the example we are using an UNCORRECTED PVALUE
#     which will allow us to plot more values, but 
#     for real applications it's necessary to use the
#     CORRECTED PVALUE

# here are all the kinases that are *significant (*uncorrected) from the analysis
ksea_result = phosphodata.ksea.corr.svl[order(phosphodata.ksea.corr.svl[,"pvalue"]),][1:9,]
ksea_cols = rep("grey", 9)
ksea_cols[which(ksea_result[,"oddsratio"]>0)] = "black"

# plot left panel: correlation significance of top most significant kinases
barplot(ksea_result[,"oddsratio"], horiz=TRUE, xlim=c(-1,0.5),
         names.arg=rownames(ksea_result), las=1, col=ksea_cols)

# plot right panel: abundance comparison results of the same kinases
barplot(phosphodata.ksea.comp.svl[rownames(ksea_result),"oddsratio"],
         horiz=TRUE, names.arg=rownames(ksea_result), las=1, col="black")

Lastly we print out the session info!

sessionInfo()

biodataganache/leapR documentation built on July 16, 2025, 11:10 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

biodataganache/leapR Layered enrichment analysis of pathways R

In biodataganache/leapR: Layered enrichment analysis of pathways R

Definitions

Important points for consideration

Data normalization

Background

Multiple hypothesis correction

Pathway databases

columns that we want to use for results

Example 2: Fisher's exact test

Caveats

Description

Interpretation

for this example we will construct a list of genes from the expression data

to emulate what you might be inputting

background = rownames(protdata)

using enrichment_wrapper function

in this example we construct some modules from the hierarchical clustering of the

data

hierarchical clustering is not too happy with lots of missing values

so we'll do a zero fill on this to get the modules

construct the hierarchical clustering using the 'wardD' method, which

seems to give more even sized modules

arbitrarily we'll chop the clusters into 5 modules

sara: created list

modules is a named list of values where each value is a module

number and the name is the gene name

To do enrichment for one module (module 1 in this case) do this

To do enrichment on all modules and return the list of enrichment results tables do this

nothing is enriched

Example 4. Enrichment in correlation

Interpretation

using enrichment_wrapper function

Figure 2

load the single omic and multi-omic pathway databases

comparison enrichment in transcriptional data

comparison enrichment in proteomics data

this is the same code used above, just repeated here for clarity

comparison enrichment in phosphoproteomics data

set enrichment in transcriptomics data

perform the comparison t tes

we can then call enrichment on that object, now referencing the column name in the featureData

transdata.svl.ttest = t(sapply(rownames(transdata), function (r) {

if(length(which(!is.na(transdata[r,shortlist])))>2 && length(which(!is.na(transdata[r,longlist])))>1){

res <- t.test(transdata[r,shortlist], transdata[r,longlist])

# error=return(c(NA,NA)))

# if (is(res,'try-error')) return(c(NA, NA));

return(c(pvalue=res$p.value, difference =res$estimate[[1]]-res$estimate[[2]]))

}

else{

return(c(pvalue=NA,difference=NA))

}

}))

now we can get those tarets that are significant and feed them into the set enrichment

targets <- as.data.frame(transdata.svl.ttest)|>

subset(pvalue<0.05)|>

rownames()

now we need to run enrichment in sets with target list, not eset

perform the comparison t test

protdata.svl.ttest = t(sapply(rownames(protdata), function (r) {

if(length(which(!is.na(protdata[r,shortlist])))>2 && length(which(!is.na(protdata[r,longlist])))>1){

res <- t.test(protdata[r,shortlist], protdata[r,longlist])

# error=return(c(NA,NA)))

# if (is(res,'try-error')) return(c(NA, NA));

return(c(pvalue=res$p.value, difference =res$estimate[[1]]-res$estimate[[2]]))

}

else{

return(c(pvalue=NA,difference=NA))

}

}))

prot.targets<-subset(protdata.svl.ttest,!is.na('pvalue'))|>

subset(pvalue<0.05)|>

rownames()4

set enrichment in phosphoproteomics data

perform the comparison t test

phosphodata.svl.ttest = t(sapply(rownames(phosphodata), function (r) {

res=try(t.test(phosphodata[r,shortlist], phosphodata[r,longlist]),silent=TRUE);

if (is(res,'try-error')) return(c(NA, NA));

return(c(res$p.value, res$estimate[[1]]-res$estimate[[2]]))}))

phosphodata.svl.ttest = data.frame(protein=phosphodata[,1], pvalue=phosphodata.svl.ttest[,1],

biodataganache/leapR
Layered enrichment analysis of pathways R