BiocStyle::markdown() suppressPackageStartupMessages({ library(edgeR) library(goseq) library(org.Hs.eg.db) library(GO.db) })
Is expression of genes in a gene set associated with experimental condition?
Many methods, a recent review is Kharti et al., 2012.
Any a priori classification of `genes' into biologically relevant groups
Sets do not need to be...
Gene Ontology (GO) Annotation (GOA)
Pathways
E.g., MSigDb
Initially based on a presentation by Simon Anders, CSAMA 2010
Steps
In gene set? | |||
Yes | No | ||
---|---|---|---|
Differentially | Yes | k | K |
expressed? | No | n - k | N - K |
fiser.test()
or r Biocpkg("GOstats")
Notes
r Biocpkg("GOstats")
Steps
E.g., Jiang \& Gentleman, 2007; \Biocpkg{Category}
Expression in NEG vs BCR/ABL samples for genes in the 'ribosome' KEGG
pathway; r Biocpkg("Category")
vignette.
Goemann & Bühlmann, 2007
E.g., Hummel et al., 2008, \Biocpkg{GlobalAncova}
r Biocpkg("limma")
romer()
and Wu \& Smythe 2012 camera()
for
enrichment (competitive null) linear modelsroast()
, mroast()
for self-contained null
linear modelsE.g., Tarca et al., 2009, \Biocpkg{SPIA}
Incorporate pathway topology (e.g., interactions between gene products) into signficance testing
Signaling Pathway Impact Analysis
Combined evidence: pathway over-representation $P_{NDE}$; unusual signaling $P_{PERT}$ (equation 1 of Tarca et al.)
Evidence plot, colorectal cancer. Points: pathway gene sets. Significant after Bonferroni (red) or FDR (blue) correction.
E.g., Young et al., 2010, r Biocpkg("goseq")
DE genes vs. transcript length. Points: bins of 300 genes. Line: fitted probability weighting function.
Example: Langfelder & Hovarth, WGCNA
list()
, where names of the list are sets, and each element
of the list is a vector of genes in the set.data.frame()
of set name / gene name pairsr Biocpkg("GSEABase")
Gene set enrichment classifications
Selected \Bioconductor{} Packages
| Approach | Packages |
|-------------------|---------------------------------------------|
| Hypergeometric | r Biocpkg("GOstats")
, r Biocpkg("topGO")
|
| Enrichment | r Biocpkg("limma")``::romer()
|
| Category $t$-test | r Biocpkg("Category")
|
| Linear model | r Biocpkg("GlobalAncova")
, r Biocpkg("GSEAlm")
, r Biocpkg("limma")``::roast()
|
| Pathway topology | r Biocpkg("SPIA")
|
| Sequence-specific | r Biocpkg("goseq")
|
| de novo | r CRANpkg("WGCNA")
|
This practical is based on section 6 of the r Biocpkg("goseq")
vignette.
This (relatively old) experiment examined the effects of androgen stimulation on a human prostate cancer cell line, LNCaP (Li et al., 2008). The experiment used short (35bp) single-end reads from 4 control and 3 untreated lines. Reads were aligned to hg19 using Bowtie, and counted using ENSEMBL 54 gene models.
Input the data to r Biocpkg("edgeR")
's DGEList
data structure.
library(edgeR) path <- system.file(package="goseq", "extdata", "Li_sum.txt") table.summary <- read.table(path, sep='\t', header=TRUE, stringsAsFactors=FALSE) counts <- table.summary[,-1] rownames(counts) <- table.summary[,1] grp <- factor(rep(c("Control","Treated"), times=c(4,3))) summarized <- DGEList(counts, lib.size=colSums(counts), group=grp)
Use a 'common' dispersion estimate, and compare the two groups using an exact test
disp <- estimateCommonDisp(summarized) tested <- exactTest(disp) topTags(tested)
Start by extracting all P values, then correcting for multiple
comparison using p.adjust()
. Classify the genes as differentially
expressed or not.
padj <- with(tested$table, { keep <- logFC != 0 value <- p.adjust(PValue[keep], method="BH") setNames(value, rownames(tested)[keep]) }) genes <- padj < 0.05 table(genes)
Under the hood, r Biocpkg("goseq")
uses Bioconductor annotation
packages (in this case r Biocannopkg("org.Hs.eg.db")
and r
Biocannopkg("GO.db")
to map from gene symbols to GO pathways.
Expore these packages through the columns()
and select()
functions. Can you map between ENSEMBL gene identifiers (the row names
of topTable()
) to GO pathway? What about 'drilling down' on
particular GO identifiers to discover the term's definition?
Calculate the weighting for each gene. This looks up the gene lengths in a pre-defined table (how could these be calculated using TxDb packages? What challenges are associated with calculating these 'weights', based on the knowledge that genes typically consist of several transcripts, each expressed differently?)
pwf <- nullp(genes,"hg19","ensGene") head(pwf)
Perform the main analysis. This includes association of genes to GO pathway
GO.wall <- goseq(pwf, "hg19", "ensGene") head(GO.wall)
Here we do the same operation, but ignore gene lengths
GO.nobias <- goseq(pwf,"hg19","ensGene",method="Hypergeometric")
Compare the over-represented P-values for each set, under the different methods
idx <- match(GO.nobias$category, GO.wall$category) plot(log10(GO.nobias[, "over_represented_pvalue"]) ~ log10(GO.wall[idx, "over_represented_pvalue"]), xlab="Wallenius", ylab="Hypergeometric", xlim=c(-5, 0), ylim=c(-5, 0)) abline(0, 1, col="red", lwd=2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.