library(tidyverse)
library(prora)
knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  error = FALSE,
  fig.width = 4,
  fig.height = 3,
  fig.align = "center",
  cache = FALSE
)
options(width = 80)

Abstract

A plethora of R packages exist on CRAN and Bioconductor to perform over-representation analysis (ORA) and gene set enrichment analysis (GSEA). However, consistency in the underlying nomenclature for specific analyses and user friendly implementation is still lacking. prora aims at unifying ID mapping and enrichment analysis in a syntactically coherent and intuitive way, while ensuring reproducibility of results. prora primarily consists of wrapper functions around the sigora and WebGestaltR packages from CRAN and rmarkdown based reports for visualisation and contextualisation of analysis results.

Introduction

Methods to identify differentially regulated pathways among a vast background of quantified genes or proteins differ in complexity of the underlying algorithms and computations, ranging from the application of Fisher's exact test in ORA and extensions to gene pairs [@Foroushani2013] to more complex statistical procdures in gene set enrichment analysis [@Subramanian15545] and network based procedures [@glaab2012enrichnet]. Almost all methods come with their own implementation as an R package, i.e. ORA in both sigora and WebGestaltR, network based gene enrichment analysis in enrichnet and the list keeps going on. Naturally, different implementations come with different -- and often inconsistent -- features, such as ID mapping or demand a lot of knowledge from the user in setting up appropriate data structures and other preconditions. prora aims at providing a consistent and user friendly interface, which will be introduced in the following section. Elaborating on the methodological and statistical aspects of the underlying methods is beyond the scope of this vignette and the reader is referred to the publications in the reference section.

Installation and loading

After installing the packages from Bioconductor as shown below, the package can be loaded using library(prora).

BiocManager::install("prora")
library(prora)

Example workflow

We shall analyse proteomic data with the following format:

data("exampleContrastData")
#glimpse(exampleContrastData)

The first column protein_Id contains FASTA formatted human protein accessions and the second column fold change estimates. e.g. as obtained by a limma::toptable() call. The first step is now to translate the FASTA headers into Uniprot accessions for subsequent ID mapping.

dd <- prora::get_UniprotID_from_fasta_header(exampleContrastData)
#glimpse(dd)

We see that we have lost r nrow(exampleContrastData) - nrow(dd) accessions which did not correspond to a FASTA format. Now we apply ORA, sigORA and GSEA to the obtained ranked protein list. To apply sigORA and ORA one must first compute the background GPS repository using makeGPS_wrappR(). The advantage is that backgrounds can be specified separately for each experiment or combined for several batches of experiments. However, computing the accession pair signatures and calculating their respective weights is computationally intensive and potentially time consuming.

data("idmap", package = "sigora")
myGPSrepo <- makeGPS_wrappR(ids = dd$UniprotID, target = "GO")


res <- sigoraWrappR(data =  dd,
                    threshold = 0.2,
                    score_col = "estimate",
                    GPSrepos = myGPSrepo$gps,
                    greater_than = TRUE)

The result of sigoraWrappR() is a list containing the sigORA and ORA results along with user specified inputs that will be used in the reports described in section \@ref(reports).

names(res)

Under the hood a lot of ID mapping has taken place of which the user is unaware. To check the ID mapping efficiency when mapping Uniprot IDs to GO, KEGG or ENTREZ accessions the package comes with a checkIDmappingEfficiency() function to check the number of lost IDs.

checkIDmappingEfficiency(dd$UniprotID, keytype = "UNIPROT") %>% 
  round(2)

Evidently, the mapping from Uniprot to KEGG IDs seems to be fairly inefficient and results should be interpreted with care.

Reports

Good reporting and reproducibility of all intermediate and final results is key to good scientific practice. In this spirit, prora provides a reproducible, rmarkdown based reports for each method. To highlight the most important targets, settings and parameters, a walkthrough is given in this part of the vignette.

To run all methods and produce the reports on the example data from the prior section, the following parameters must be set in advance.

organism <- "hsapiens" # Organism to which the FASTA headers correspond
ID_col <- "UniprotID" # Column containing the IDs
fc_col <- "estimate" # Column containing the estimates (in this case log fold changes)
target_GSEA <- c( # all possible targets for GSEA
  "geneontology_Biological_Process",
  "geneontology_Cellular_Component",
  "geneontology_Molecular_Function"
)
target_SIGORA <- # all possible targets for sigORA
  c("GO", "KEGG", "reactome")
fc_threshold <- 0.5 # threshold for ORA methods
greater <- TRUE # direction of the threshold
nperm <- 10 # number of permutations used to compute enrichment scores in GSEA
odir <- "tmp" # output directory to write all reports

To iterate over all possible target directly, the following workflow functions can be used. They take the above specified parameters as inputs and facilitate the enrichment analysis workflows tremendously. The results will be saved to the specified output directory with a subfolder for each method and target.

The following code chunk iterates the runGSEA() function over all possible GSEA target databases. Internally, runGSEA() calls WebGestaltR() for the analysis and compiles prora's in the same output directory.

sapply(target_GSEA, function(x) {
  runWebGestaltGSEA(
    data = dd,
    fpath = "",
    ID_col = ID_col,
    score_col = fc_col,
    organism = organism,
    target = x,
    nperm = nperm,
    outdir = file.path(odir, "WebGestaltGSEA")
  )
})

Analogously, runWebGestaltORA() and runSIGORA() embody a whole workflow and save nicely formatted results to the output directory.

sapply(target_GSEA, function(x) {
  runWebGestaltORA(
    data = dd,
    fpath = "",
    organism = organism,
    ID_col = ID_col,
    target = x,
    threshold = fc_threshold,
    greater = greater,
    nperm = nperm,
    score_col = fc_col,
    outdir = file.path(odir, "WebGestaltORA")
  )
})

runSIGORA() calls the sigoraWrappR() function which was introduced in section \@ref(Example workflow) and compiles the built-in report.

sapply(target_SIGORA, function(x) {
  runSIGORA(
    data = dd,
    target = x,
    score_col = fc_col,
    threshold = fc_threshold,
    greater = greater,
    outdir = file.path(odir, "sigORA")
  )
})

In the following the runGSEA() function's inner workings will be discussed in more detail. After creating the output directory outdir the provided data is fed into WebGestaltR's WebGestaltR(..., enrichMethod = "GSEA"). The results are then used to generate the report provided by the prora package^[The source files for all reports can be found in system.file("rmarkdown_reports", package = "fgczgseaora")]. All WebGestaltR and prora results can then be found in the output directory.

Command line tools

The enrichment methods in this package (ORA, GSEA sigORA) come with a docopt based command line tool to facilitate analysing batches of files^[Those scripts can be found in system.file("run_scripts", package = "prora")]. The main advantage of docopt is that all tools have to come with a thorough documentation where it is also possible to provide default values. For running GSEA, the command line tool is:

"WebGestaltR GSEA for multigroup reports

Usage:
  lfq_multigroup_gsea.R <grp2file> [--organism=<organism>] [--outdir=<outdir>] [--idtype=<idtype>] [--ID_col=<ID_col>]  [--nperm=<nperm>] [--score_col=<score_col>] [--contrast=<contrast>]

Options:
  -o --organism=<organism> organism [default: hsapiens]
  -r --outdir=<outdir> output directory [default: results_gsea]
  -t --idtype=<idtype> type of id used for mapping [default: uniprotswissprot]
  -i --ID_col=<ID_col> Column containing the UniprotIDs [default: UniprotID]
  -n --nperm=<nperm> number of permutations to calculate enrichment scores [default: 500]
  -e --score_col=<score_col> column containing fold changes [default: pseudo_estimate]
  -c --contrast=<contrast> column containing fold changes [default: contrast]

Arguments:
  grp2file  input file
" -> doc

library(docopt)

opt <- docopt(doc)

It can be run using the following command in a UNIX/Linux terminal, specifying the organism as mouse:

```{bash eval=FALSE} Rscript path/to/lfq_multigroup_gsea.R path/to/group2file.xlsx -o mmusculus

The help file can also be accessed via the command line using:

```{bash eval=FALSE}
Rscript path/to/lfq_multigroup_gsea.R -h

Session info {.unnumbered}

sessionInfo()

References



protViz/fgczgseaora documentation built on Dec. 14, 2021, 9:22 p.m.