In wolski/p3003PBC: Slides and R exercises for FGCZ Bioinformatics course

Overview

Pathway analysis for proteomics quantification experiments
fgczgseaora
Outlook

library(tidyverse)
library(flextable)
knitr::opts_chunk$set(fig.width=4.25, fig.height=3.5, fig.retina=3,
message=FALSE, warning=FALSE, cache = TRUE,
autodep = TRUE, hiline=TRUE)
knitr::opts_hooks$set(fig.callout = function(options) {
if (options$fig.callout) {
options$echo <- FALSE
options$out.height <- "99%"
options$fig.width <- 16
options$fig.height <- 8
}
options
})
hook_source <- knitr::knit_hooks$get('source')
knitr::knit_hooks$set(source = function(x, options) {
if (!is.null(options$hiline) && options$hiline) {
x <- stringr::str_replace(x, "^ ?(.+)\\s?#<<", "*\\1")
}
hook_source(x, options)
})
options(htmltools.dir.version = FALSE, width = 90)
as_table <- function(...) knitr::kable(..., format='html', digits = 3)

background-image: url("../inst/images/fgczgseaora.png")

Protein quantification experiments

determine protein foldchanges
for various contrasts (comparisons of treatments)
up to thousands of proteins
only abundant proteins quantifed (detection bias)

.img-right[ ]

Pathway analysis

Pathway analysis uses a priori gene sets that have been grouped together by their involvement in the same biological pathway, or by proximal location on a chromosome. Examples of gene set database are Gene Ontology (GO), KEGG, Reactome and many more.

Over-Represenation Analysis (ORA)
Gene Set Enrichment Analysis (GSEA)

Over-Representation Analysis (ORA)

Dychotomize list of proteins
(e.g. using a threshold into overexpressed - Yes/No).
Test if a geneset is over-represented
in on of the sublists
(e.g. Fischers Exact Test).
how to choose the threshold?

tab <- matrix(c(12, 3, 7, 24), nrow = 2, byrow = TRUE)
dimnames(tab) <- list("GO Term" = c("Contained", "Not Contained"),
                      "Differentially expressed" = c("    Yes    ", "    No    "))

cat("Pathway GO:0003091")
tab
cat("p-value:", round(fisher.test(tab)$p.value, 5))

.img-right[ ]

Gene Set Enrichment Analysis (GSEA)

Protein ID's ranked by fold change,
t-statistics
locate genes of genesets in ranked list
compute enrichment score

.footnote[Gene Sets can be highly correlated, because they contain the same proteins. Multiplicity adjustment assumes indpendence (FDR).]

.img-right[ ]

fgczgseaora

Easily generate reports to be delivered to biologists.
For ORA we can only use tools which allow to specify detection background.
Map identifiers - support for sp identifiers
Ideally run packages locally
Provide a similar R and command line interface to run ORA GSEA.

Many R packages are available

benchmark <- list(
  c("WebGestaltR","CRAN","+","-","+","+","+"),
  c("FGNet","Bioc","+","(-)","(-)","-","+"),
  c("HTSanalyzeR","Bioc","-","(-)","-","+","+"),
  c("sigora","CRAN","+","+","(-)","+","-"),
  c("SetRank","CRAN","-","(-)","-","-","+"),
  c("STRINGdb","Bioc","+","-","(-)","+","+"),
  c("enrichR","CRAN","+","-","+","(+)","+"),         
  c("TopGO", "Bioc","...","","","",""))
benchmark <- do.call("rbind",benchmark)
colnames(benchmark) <- c("Package","Repo","Maintenance","offline","ID Mapping","ORA","GSEA")

flextable(data.frame(benchmark)) %>% set_caption("R packages for pathway analysis")

We did integrate:
WebgestaltR (online only)
sigORA (offline)

.footnote[WebgesaltR - Various gene set databases, id mapping, allows for downloading html results. sigORA - uses gene pair signatures. Searches background and pathways for protein pairs unique to a given pathway. By this it decreases the correlation among gene sets.]

Common R interface

.left-code[

runWebGestaltGSEA(
  data = dd,
  fpath = "",
  ID_col = "UniprotID",
  score_col =  "estimate",
  organism =  "hsapiens",
  target = "geneontology_Biological_Process",
  nperm = 500,
  outdir = file.path(odir, "WebGestaltGSEA")
)

] .right-code[

runWebGestaltORA(
  data = dd,
  fpath = "",
  ID_col = "UniprotID",
  score_col =  "estimate",
  organism =  "hsapiens",
  threshold = 1,
  greater = TRUE,
  target = "geneontology_Biological_Process",
  nperm = 500,
  outdir = file.path(odir, "WebGestaltORA")
)
runSIGORA(
  data = dd,
  score_col = "estimate",
  threshold = 1,
  greater = TRUE,
  target = "GO",
  outdir = file.path(odir, "sigORA")
)

]

Command line interface

```{bash eval=FALSE} Rscript lfq_multigroup_gsea.R ./foldchange_estimates.xlsx -o hsapiens Rscript lfq_multigroup_ora.R ./foldchange_estimates.xlsx -t uniprotswissprot

The enrichment methods in this package (ORA, GSEA sigORA) come with a
`docopt` based command line tool to facilitate analysing batches of files.

---

# Command line interface


```r
"WebGestaltR GSEA for multigroup reports

Usage:
  lfq_multigroup_gsea.R <grp2file> [--organism=<organism>] [--outdir=<outdir>] [--idtype=<idtype>] [--ID_col=<ID_col>]  [--nperm=<nperm>] [--score_col=<score_col>] [--contrast=<contrast>]

Options:
  -o --organism=<organism> organism [default: hsapiens]
  -r --outdir=<outdir> output directory [default: results_gsea]
  -t --idtype=<idtype> type of id used for mapping [default: uniprotswissprot]
  -i --ID_col=<ID_col> Column containing the UniprotIDs [default: UniprotID]
  -n --nperm=<nperm> number of permutations to calculate enrichment scores [default: 500]
  -e --score_col=<score_col> column containing fold changes [default: pseudo_estimate]
  -c --contrast=<contrast> column containing fold changes [default: contrast]

Arguments:
  grp2file  input file
" -> doc
library(docopt)
opt <- docopt(doc)

HTML outputs - Multiple Contrasts and Targets

For all contrasts
e.g. t - v, 8wk - 1wk etc.
and all selected target
creates folder structure with HTML files
visualizing the ORA and GSEA results

e.g. GO Bioprocess, GO Molecular Function - These files are linked from an index.html - can easily be stored and delivered as part of analysis.

.img-right[

]

HTML output - HTML report with method description

.img-left[ ]

.img-right[ ]

background-image: url("../inst/images/brussels.jpg")

Outlook

specify user stories
I want to compare multiple GSEA outputs
I have my own gene sets
Standardize R-API interface
Standardize return values and reports
add one or two more packages (enrichr, topGO, ?)

THANK YOU!

Acknowledgments:

Paolo Nanni, Christian Panse, Ralph Schlapbach, Tobias Kockmann

wolski/p3003PBC documentation built on Nov. 30, 2024, 7:14 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

wolski/p3003PBC
Slides and R exercises for FGCZ Bioinformatics course

In wolski/p3003PBC: Slides and R exercises for FGCZ Bioinformatics course

Overview

Protein quantification experiments

Pathway analysis

Over-Representation Analysis (ORA)

Gene Set Enrichment Analysis (GSEA)

fgczgseaora

Many R packages are available

Common R interface

Command line interface

HTML outputs - Multiple Contrasts and Targets

HTML output - HTML report with method description

Outlook

Outlook

THANK YOU!

Acknowledgments:

R Package Documentation

Browse R Packages

We want your feedback!

wolski/p3003PBC Slides and R exercises for FGCZ Bioinformatics course

In wolski/p3003PBC: Slides and R exercises for FGCZ Bioinformatics course

Overview

Protein quantification experiments

Pathway analysis

Over-Representation Analysis (ORA)

Gene Set Enrichment Analysis (GSEA)

fgczgseaora

Many R packages are available

Common R interface

Command line interface

HTML outputs - Multiple Contrasts and Targets

HTML output - HTML report with method description

Outlook

Outlook

THANK YOU!

Acknowledgments:

R Package Documentation

Browse R Packages

We want your feedback!

wolski/p3003PBC
Slides and R exercises for FGCZ Bioinformatics course