library(tidyverse) library(prora) knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, error = FALSE, fig.width = 4, fig.height = 3, fig.align = "center", cache = FALSE ) options(width = 80)
A plethora of R packages exist on CRAN and Bioconductor to perform over-representation analysis (ORA) and gene set enrichment analysis (GSEA). However, consistency in the underlying nomenclature for specific analyses and user friendly implementation is still lacking. prora aims at unifying ID mapping and enrichment analysis in a syntactically coherent and intuitive way, while ensuring reproducibility of results. prora primarily consists of wrapper functions around the sigora and WebGestaltR packages from CRAN and rmarkdown based reports for visualisation and contextualisation of analysis results.
Methods to identify differentially regulated pathways among a vast background of
quantified genes or proteins differ in complexity of the underlying algorithms
and computations, ranging from the application of Fisher's exact test in ORA and
extensions to gene pairs [@Foroushani2013] to more complex statistical procdures
in gene set enrichment analysis [@Subramanian15545] and network based procedures
[@glaab2012enrichnet]. Almost all methods come with their own implementation as
an R package, i.e. ORA in both sigora and
WebGestaltR, network based gene enrichment analysis in
enrichnet and the list keeps going on. Naturally, different
implementations come with different -- and often inconsistent -- features, such
as ID mapping or demand a lot of knowledge from the user in setting up appropriate
data structures and other preconditions. prora
aims at providing a
consistent and user friendly interface, which will be introduced in the following
section. Elaborating on the methodological and statistical aspects of the underlying
methods is beyond the scope of this vignette and the reader is referred to the
publications in the reference section.
After installing the packages from Bioconductor as shown below, the package
can be loaded using library(prora)
.
BiocManager::install("prora") library(prora)
We shall analyse proteomic data with the following format:
data("exampleContrastData") #glimpse(exampleContrastData)
The first column protein_Id
contains FASTA formatted human protein accessions
and the second column fold change estimates. e.g. as obtained by a
limma::toptable()
call. The first step is now to translate the FASTA headers
into Uniprot accessions for subsequent ID mapping.
dd <- prora::get_UniprotID_from_fasta_header(exampleContrastData) #glimpse(dd)
We see that we have lost r nrow(exampleContrastData) - nrow(dd)
accessions
which did not correspond to a FASTA format. Now we apply ORA, sigORA and GSEA to
the obtained ranked protein list. To apply sigORA and ORA one must first compute
the background GPS repository using makeGPS_wrappR()
. The advantage is that
backgrounds can be specified separately for each experiment or combined for several
batches of experiments. However, computing the accession pair signatures and
calculating their respective weights is computationally intensive and potentially
time consuming.
data("idmap", package = "sigora") myGPSrepo <- makeGPS_wrappR(ids = dd$UniprotID, target = "GO") res <- sigoraWrappR(data = dd, threshold = 0.2, score_col = "estimate", GPSrepos = myGPSrepo$gps, greater_than = TRUE)
The result of sigoraWrappR()
is a list containing the sigORA and ORA results
along with user specified inputs that will be used in the reports described in
section \@ref(reports).
names(res)
Under the hood a lot of ID mapping has taken place of which the user is unaware.
To check the ID mapping efficiency when mapping Uniprot IDs to GO, KEGG or ENTREZ
accessions the package comes with a checkIDmappingEfficiency()
function to
check the number of lost IDs.
checkIDmappingEfficiency(dd$UniprotID, keytype = "UNIPROT") %>% round(2)
Evidently, the mapping from Uniprot to KEGG IDs seems to be fairly inefficient and results should be interpreted with care.
Good reporting and reproducibility of all intermediate and final results is key
to good scientific practice. In this spirit, prora
provides a reproducible,
rmarkdown based reports for each method. To highlight the most
important targets, settings and parameters, a walkthrough is given in this part
of the vignette.
To run all methods and produce the reports on the example data from the prior section, the following parameters must be set in advance.
organism <- "hsapiens" # Organism to which the FASTA headers correspond ID_col <- "UniprotID" # Column containing the IDs fc_col <- "estimate" # Column containing the estimates (in this case log fold changes) target_GSEA <- c( # all possible targets for GSEA "geneontology_Biological_Process", "geneontology_Cellular_Component", "geneontology_Molecular_Function" ) target_SIGORA <- # all possible targets for sigORA c("GO", "KEGG", "reactome") fc_threshold <- 0.5 # threshold for ORA methods greater <- TRUE # direction of the threshold nperm <- 10 # number of permutations used to compute enrichment scores in GSEA odir <- "tmp" # output directory to write all reports
To iterate over all possible target directly, the following workflow functions can be used. They take the above specified parameters as inputs and facilitate the enrichment analysis workflows tremendously. The results will be saved to the specified output directory with a subfolder for each method and target.
The following code chunk iterates the runGSEA()
function over all possible
GSEA target databases. Internally, runGSEA()
calls WebGestaltR()
for the
analysis and compiles prora's
in the same output directory.
sapply(target_GSEA, function(x) { runWebGestaltGSEA( data = dd, fpath = "", ID_col = ID_col, score_col = fc_col, organism = organism, target = x, nperm = nperm, outdir = file.path(odir, "WebGestaltGSEA") ) })
Analogously, runWebGestaltORA()
and runSIGORA()
embody a whole workflow
and save nicely formatted results to the output directory.
sapply(target_GSEA, function(x) { runWebGestaltORA( data = dd, fpath = "", organism = organism, ID_col = ID_col, target = x, threshold = fc_threshold, greater = greater, nperm = nperm, score_col = fc_col, outdir = file.path(odir, "WebGestaltORA") ) })
runSIGORA()
calls the sigoraWrappR()
function which was introduced in
section \@ref(Example workflow) and compiles the built-in report.
sapply(target_SIGORA, function(x) { runSIGORA( data = dd, target = x, score_col = fc_col, threshold = fc_threshold, greater = greater, outdir = file.path(odir, "sigORA") ) })
In the following the runGSEA()
function's inner workings will be discussed in
more detail. After creating the output directory outdir
the provided data is
fed into WebGestaltR's WebGestaltR(..., enrichMethod = "GSEA")
.
The results are then used to generate the report provided by the prora
package^[The source files for all reports can be found in
system.file("rmarkdown_reports", package = "fgczgseaora")
]. All WebGestaltR
and prora
results can then be found in the output directory.
The enrichment methods in this package (ORA, GSEA sigORA) come with a
docopt based command line tool to facilitate analysing batches of
files^[Those scripts can be found in
system.file("run_scripts", package = "prora")
].
The main advantage of docopt is that all tools have to come with a thorough
documentation where it is also possible to provide default values. For running
GSEA, the command line tool is:
"WebGestaltR GSEA for multigroup reports Usage: lfq_multigroup_gsea.R <grp2file> [--organism=<organism>] [--outdir=<outdir>] [--idtype=<idtype>] [--ID_col=<ID_col>] [--nperm=<nperm>] [--score_col=<score_col>] [--contrast=<contrast>] Options: -o --organism=<organism> organism [default: hsapiens] -r --outdir=<outdir> output directory [default: results_gsea] -t --idtype=<idtype> type of id used for mapping [default: uniprotswissprot] -i --ID_col=<ID_col> Column containing the UniprotIDs [default: UniprotID] -n --nperm=<nperm> number of permutations to calculate enrichment scores [default: 500] -e --score_col=<score_col> column containing fold changes [default: pseudo_estimate] -c --contrast=<contrast> column containing fold changes [default: contrast] Arguments: grp2file input file " -> doc library(docopt) opt <- docopt(doc)
It can be run using the following command in a UNIX/Linux terminal, specifying the organism as mouse:
```{bash eval=FALSE} Rscript path/to/lfq_multigroup_gsea.R path/to/group2file.xlsx -o mmusculus
The help file can also be accessed via the command line using: ```{bash eval=FALSE} Rscript path/to/lfq_multigroup_gsea.R -h
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.