Introduction

The multiGSEA package was designed to run a robust GSEA-based pathway enrichment for multiple omics layers. The enrichment is calculated for each omics layer separately and aggregated p-values are calculated afterwards to derive a composite multi-omics pathway enrichment.

Pathway definitions can be downloaded from up to eight different pathway databases by means of the graphite Bioconductor package [@graphite:2018]. Feature mapping for transcripts and proteins is supported towards Entrez Gene IDs, Uniprot, Gene Symbol, RefSeq, and Ensembl IDs. The mapping is accomplished through the AnnotationDbi package [@AnnotationDbi:2019] and currently supported for 11 different model organisms including human, mouse, and rat. ID conversion of metabolite features to Comptox Dashboard IDs (DTXCID, DTXSID), CAS-numbers, Pubchem IDs (CID), HMDB, KEGG, ChEBI, Drugbank IDs, or common metabolite names is accomplished through the AnnotationHub package metabliteIDmapping. This package provides a comprehensive ID mapping for more than 1.1 million entries.

This vignette covers a simple example workflow illustrating how the multiGSEA package works. The omics data sets that will be used throughout the example were originally provided by Quiros et al. [@Quiros:2017]. In their publication the authors analyzed the mitochondrial response to four different toxicants, including Actinonin, Diclofenac, FCCB, and Mito-Block (MB), within the transcriptome, proteome, and metabolome layer. The transcriptome data can be downloaded from NCBI GEO, the proteome data from the ProteomeXchange Consortium, and the non-targeted metabolome raw data can be found in the online supplement.

Installation

There are two different ways to install the multiGSEA package.

First, the multiGSEA package is part of Bioconductor. Hence, the best way to install the package is via BiocManager. Start R (>=4.0.0) and run the following commands:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

# The following initializes usage of Bioc devel
BiocManager::install(version = "devel")

BiocManager::install("multiGSEA")

Alternatively, the multiGSEA package is hosted on our github page https://github.com/yigbt/multiGSEA and can be installed via the devtools package:

install.packages("devtools")
devtools::install_github("https://github.com/yigbt/multiGSEA")

Once installed, just load the multiGSEA package with:

library(multiGSEA)

Example workflow

A common workflow which is exemplified in this vignette is typically separated in the following steps:

  1. Load required libraries, including the multiGSEA package, and omics data sets.
  2. Create data structure for enrichment analysis.
  3. Download and customize the pathway definitions.
  4. Run the pathway enrichment for each omics layer.
  5. Calculate the aggregated pathway enrichment.

Load libraries and omics data

At first, we need to load the necessary packages to run the multi-omics enrichment analysis. In our example data we have to deal with omics data that has been measured in human cell lines. We therefore need the org.Hs.eg.db package [@CarlsonHs:2019] for transcript and protein mapping. In case the omics data was measured in mouse or rat, we would need the packages org.Mm.eg.db [@CarlsonMm:2019] and org.Rn.eg.db [@CarlsonRn:2019], respectively.

library("org.Hs.eg.db")

In principle, multiGSEA is able to deal with 11 different model organisms. A summary of supported organisms, their naming format within multiGSEA and their respective AnnotationDbi package is shown in Table \@ref(tab:organismsTable).

caption <- "Supported organisms, their abbreviations being used in `multiGSEA`,
            and mapping database that are needed for feature conversion.
            Supported abbreviations can be seen with `getOrganisms()`"
df <- data.frame(
  Organisms = c(
    "Human", "Mouse", "Rat", "Dog", "Cow", "Pig",
    "Chicken", "Zebrafish", "Frog", "Fruit Fly",
    "C.elegans"
  ),
  Abbreviations = c(
    "hsapiens", "mmusculus", "rnorvegicus",
    "cfamiliaris", "btaurus", "sscrofa",
    "ggallus", "drerio", "xlaevis",
    "dmelanogaster", "celegans"
  ),
  Mapping = c(
    "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db",
    "org.Cf.eg.db", "org.Bt.eg.db", "org.Ss.eg.db",
    "org.Gg.eg.db", "org.Xl.eg.db", "org.Dr.eg.db",
    "org.Dm.eg.db", "org.Ce.eg.db"
  )
)

knitr::kable(df, caption = caption)

To run the analysis of this vignette, load the installed version of multiGSEA.

library(multiGSEA)
library(magrittr)

Load the omics data for each layer where an enrichment should be calculated. The example data is provided within the package and already preprocessed such that we have log2 transformed fold changes and their associated p-values.

# load transcriptomic features
data(transcriptome)

# load proteomic features
data(proteome)

# load metabolomic features
data(metabolome)

Cautionary note

This example involves preprocessed omics data from public repositories, which means that the data might look different when you download and pre-process it with your own workflow. Therefore, we put our processed data as an example data in the R package. We here sketch out the pipeline described in the multiGSEA paper. We will not focus on the pre-processing steps and how to derive the necessary input format for the multi-omics pathway enrichment in terms of differentially expression analysis, since this is highly dependent on your experiment and analysis workflow.

However, the required input format is quite simple and exactly the same for each input omics layer: A data frame with 3 mandatory columns, including feature IDs, the log2-transformed fold change (logFC), and the associated p-value.

The header of the data frame can be seen in Table \@ref(tab:omicsTable):

caption <- "Structure of the necessary omics data. For each layer
            (here: transcriptome), feature IDs, log2FCs, and p-values
            are needed."

knitr::kable(
  transcriptome %>%
    dplyr::arrange(Symbol) %>%
    dplyr::slice(1:6),
  caption = caption
)

Create data structure

multiGSEA works with nested lists where each sublist represents an omics layer. The function rankFeatures calculates the pre-ranked features, that are needed for the subsequent calculation of the enrichment score. rankFeatures calculates the a local statistic ls based on the direction of the fold change and the magnitude of its significance:

\begin{equation} ls = sign( log_2(FC)) * log_{10}( p-value) \end{equation}

Please note, that any other rank metric will work as well and the choice on which one to chose is up to the user. However, as it was shown by Zyla et al. [@Zyla:2017], the choice of the applied metric might have a big impact on the outcome of your analysis.

# create data structure
omics_data <- initOmicsDataStructure(layer = c(
  "transcriptome",
  "proteome",
  "metabolome"
))

## add transcriptome layer
omics_data$transcriptome <- rankFeatures(
  transcriptome$logFC,
  transcriptome$pValue
)
names(omics_data$transcriptome) <- transcriptome$Symbol

## add proteome layer
omics_data$proteome <- rankFeatures(proteome$logFC, proteome$pValue)
names(omics_data$proteome) <- proteome$Symbol

## add metabolome layer
## HMDB features have to be updated to the new HMDB format
omics_data$metabolome <- rankFeatures(metabolome$logFC, metabolome$pValue)
names(omics_data$metabolome) <- metabolome$HMDB
names(omics_data$metabolome) <- gsub(
  "HMDB", "HMDB00",
  names(omics_data$metabolome)
)

The first elements of each omics layer are shown below:

omics_short <- lapply(names(omics_data), function(name) {
  head(omics_data[[name]])
})
names(omics_short) <- names(omics_data)
omics_short

Download and customize pathway definitions

Now we have to select the databases we want to query and the omics layer we are interested in before pathway definitions are downloaded and features are mapped to the desired format.

databases <- c("kegg", "reactome")
layers <- names(omics_data)

pathways <- getMultiOmicsFeatures(
  dbs = databases, layer = layers,
  returnTranscriptome = "SYMBOL",
  returnProteome = "SYMBOL",
  returnMetabolome = "HMDB",
  useLocal = FALSE
)

The first two pathway definitions of each omics layer are shown below:

pathways_short <- lapply(names(pathways), function(name) {
  head(pathways[[name]], 2)
})
names(pathways_short) <- names(pathways)
pathways_short

Run the pathway enrichment

At this stage, we have the ranked features for each omics layer and the extracted and mapped features from external pathway databases. In the following step we can use the multiGSEA function to calculate the enrichment for all omics layer separately.

# use the multiGSEA function to calculate the enrichment scores
# for all omics layer at once.
enrichment_scores <- multiGSEA(pathways, omics_data)

The enrichment score data structure is a list containing sublists named transcriptome, proteome, and metabolome. Each sublist represents the complete pathway enrichment for this omics layer.

Calculate the aggregated p-values

Making use of the Stouffers Z-method to combine multiple p-values that have been derived from independent tests that are based on the same null hypothesis. The function extractPvalues creates a simple data frame where each row represents a pathway and columns represent omics related p-values and adjusted p-values. This data structure can then be used to calculate the aggregated p-value. The subsequent calculation of adjusted p-values can be achieved by the function p.adjust.

multiGSEA provided three different methods to aggregate p-values. These methods differ in their way how they weight either small or large p-values. By default, combinePvalues will apply the Z-method or Stouffer's method [@Stouffer:1949] which has no bias towards small or large p-values. The widely used Fisher's combined probability test [@Fisher:1932] can also be applied but is known for its bias towards small p-values. Edgington's method goes the opposite direction by favoring large p-values [@Edgington:1972]. Those methods can be applied by setting the parameter method to "fisher" or "edgington".

df <- extractPvalues(
  enrichmentScores = enrichment_scores,
  pathwayNames = names(pathways[[1]])
)

df$combined_pval <- combinePvalues(df)
df$combined_padj <- p.adjust(df$combined_pval, method = "BH")

df <- cbind(data.frame(pathway = names(pathways[[1]])), df)

Finally, print the pathways sorted based on their combined adjusted p-values. For displaying reasons, only the adjusted p-values are shown in Table \@ref(tab:resultTable).

caption <- "Table summarizing the top 15 pathways where we can calculate an
            enrichment for all three layers . Pathways from KEGG, Reactome,
            and BioCarta are listed based on their aggregated adjusted p-value.
            Corrected p-values are displayed for each omics layer separately and
            aggregated by means of the Stouffer's Z-method."

knitr::kable(
  df %>%
    dplyr::arrange(combined_padj) %>%
    dplyr::filter(!is.na(metabolome_padj)) %>%
    dplyr::select(c(pathway, transcriptome_padj, proteome_padj, metabolome_padj, combined_pval)) %>%
    dplyr::slice(1:15),
  caption = caption
)

Customizable gene sets

In principle, multiGSEA can also be run as single/multi omics analysis on custom gene sets.

The pathways object storing the pathway features across multiple omics layers is a nested list, and hence can be designed manually to fit ones purposes.

In the following example, HALLMARK gene sets are retrieved from MSigDB and used to create a transcriptomics input list:

library(msigdbr)
library(dplyr)

hallmark <- msigdbr(species = "Rattus norvegicus", category = "H") %>%
  dplyr::select(gs_name, ensembl_gene) %>%
  dplyr::filter(!is.na(ensembl_gene)) %>%
  unstack(ensembl_gene ~ gs_name)

pathways <- list("transcriptome" = hallmark)

Please note, feature sets across multiple omics layers have to be in the same order and their names have to be identical, see the example presented above.

Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()

References



yigbt/multiGSEA documentation built on April 28, 2023, 10:02 p.m.