BiocStyle::markdown()
suppressPackageStartupMessages({
    library(edgeR)
    library(goseq)
    library(org.Hs.eg.db)
    library(GO.db)
})

Motivation

Is expression of genes in a gene set associated with experimental condition?

Many methods, a recent review is Kharti et al., 2012.

What is a gene set?

Any a priori classification of `genes' into biologically relevant groups

Sets do not need to be...

Collections of gene sets

Gene Ontology (GO) Annotation (GOA)

Pathways

E.g., MSigDb

Statistical approaches

Initially based on a presentation by Simon Anders, CSAMA 2010

Approach 1: hypergeometric tests

Steps

  1. Classify each gene as 'differentially expressed' DE or not, e.g., based on P < 0.05
  2. Are DE genes in the set more common than DE genes not in the set?
In gene set?
Yes No
Differentially Yes k K
expressed? No n - k N - K
  1. Fisher hypergeometric test, via fiser.test() or r Biocpkg("GOstats")

Notes

Approach 2: enrichment score

Steps

Approach 3: category $t$-test

E.g., Jiang \& Gentleman, 2007; \Biocpkg{Category}

Expression in NEG vs BCR/ABL samples for genes in the 'ribosome' KEGG pathway; r Biocpkg("Category") vignette.

Competitive versus self-contained null hypothesis

Goemann & Bühlmann, 2007

Approach 4: linear models

E.g., Hummel et al., 2008, \Biocpkg{GlobalAncova}

r Biocpkg("limma")

Approach 5: pathway topology

E.g., Tarca et al., 2009, \Biocpkg{SPIA}

Evidence plot, colorectal cancer. Points: pathway gene sets. Significant after Bonferroni (red) or FDR (blue) correction.

Issues with sequence data?

E.g., Young et al., 2010, r Biocpkg("goseq")

DE genes vs. transcript length. Points: bins of 300 genes. Line: fitted probability weighting function.

Approach 6: de novo discovery

Example: Langfelder & Hovarth, WGCNA

Representing gene sets in R

Conclusions

Gene set enrichment classifications

Selected \Bioconductor{} Packages

| Approach | Packages | |-------------------|---------------------------------------------| | Hypergeometric | r Biocpkg("GOstats"), r Biocpkg("topGO")| | Enrichment | r Biocpkg("limma")``::romer() | | Category $t$-test | r Biocpkg("Category") | | Linear model | r Biocpkg("GlobalAncova"), r Biocpkg("GSEAlm"), r Biocpkg("limma")``::roast() | | Pathway topology | r Biocpkg("SPIA") | | Sequence-specific | r Biocpkg("goseq") | | de novo | r CRANpkg("WGCNA") |

Practical

This practical is based on section 6 of the r Biocpkg("goseq") vignette.

1-6 Experimental design, ..., Analysis of gene differential expression

This (relatively old) experiment examined the effects of androgen stimulation on a human prostate cancer cell line, LNCaP (Li et al., 2008). The experiment used short (35bp) single-end reads from 4 control and 3 untreated lines. Reads were aligned to hg19 using Bowtie, and counted using ENSEMBL 54 gene models.

Input the data to r Biocpkg("edgeR")'s DGEList data structure.

library(edgeR)
path <- system.file(package="goseq", "extdata", "Li_sum.txt")

table.summary <- read.table(path, sep='\t', header=TRUE, stringsAsFactors=FALSE)
counts <- table.summary[,-1]
rownames(counts) <- table.summary[,1]
grp <- factor(rep(c("Control","Treated"), times=c(4,3)))
summarized <- DGEList(counts, lib.size=colSums(counts), group=grp)

Use a 'common' dispersion estimate, and compare the two groups using an exact test

disp <- estimateCommonDisp(summarized)
tested <- exactTest(disp)
topTags(tested)

7. Comprehension

Start by extracting all P values, then correcting for multiple comparison using p.adjust(). Classify the genes as differentially expressed or not.

padj <- with(tested$table, {
    keep <- logFC != 0
    value <- p.adjust(PValue[keep], method="BH")
    setNames(value, rownames(tested)[keep])
})
genes <- padj < 0.05
table(genes)

Gene symbol to pathway

Under the hood, r Biocpkg("goseq") uses Bioconductor annotation packages (in this case r Biocannopkg("org.Hs.eg.db") and r Biocannopkg("GO.db") to map from gene symbols to GO pathways.

Expore these packages through the columns() and select() functions. Can you map between ENSEMBL gene identifiers (the row names of topTable()) to GO pathway? What about 'drilling down' on particular GO identifiers to discover the term's definition?

Probability weighting function

Calculate the weighting for each gene. This looks up the gene lengths in a pre-defined table (how could these be calculated using TxDb packages? What challenges are associated with calculating these 'weights', based on the knowledge that genes typically consist of several transcripts, each expressed differently?)

pwf <- nullp(genes,"hg19","ensGene")
head(pwf)

Over- and under-representation

Perform the main analysis. This includes association of genes to GO pathway

GO.wall <- goseq(pwf, "hg19", "ensGene")
head(GO.wall)

What if we'd ignored gene length?

Here we do the same operation, but ignore gene lengths

GO.nobias <- goseq(pwf,"hg19","ensGene",method="Hypergeometric")

Compare the over-represented P-values for each set, under the different methods

idx <- match(GO.nobias$category, GO.wall$category)
plot(log10(GO.nobias[, "over_represented_pvalue"]) ~
     log10(GO.wall[idx, "over_represented_pvalue"]),
     xlab="Wallenius", ylab="Hypergeometric",
     xlim=c(-5, 0), ylim=c(-5, 0))
abline(0, 1, col="red", lwd=2)

References



Bioconductor/UseBioconductor documentation built on May 6, 2019, 7:52 a.m.