gsaseq | R Documentation |
Given a user defined list of gene sets, gsaseq
will test whether
significantly differentially expressed genes are enriched in these gene sets.
gsaseq(
sig.de,
universe,
collection,
plot.bias = FALSE,
gene.length = NULL,
sort = TRUE
)
sig.de |
Character vector of significant differentially expressed genes to test for gene set enrichment. Must be Entrez Gene ID format. |
universe |
Character vector of all genes analysed in the experiment. Must be Entrez Gene ID format. |
collection |
A list of user specified gene sets to test. Can also be a single character vector gene set. Gene identifiers must be Entrez Gene IDs. |
plot.bias |
Logical, if true a plot showing gene length bias related to differential expression will be displayed. |
gene.length |
A vector containing the gene lengths for each gene in the
same order as |
sort |
Logical, if TRUE then the output dataframe is sorted by p-value. |
This function is a generalised version of goana
and kegga
from
the limma
package in that it can take a user-defined list of
differentially expressed genes and perform gene set enrichment analysis, and
is not limited to only testing GO and KEGG categories. It is not as flexible
as goana
and kegga
. Please note the vector of differentially
expressed genes and list of gene sets must be Entrez Gene IDs.
The gsaseq
function will test for enrichment using a hypergeometric
test if the gene.length
parameter is NULL. If the
gene.length
parameter is supplied then the p-values are derived from
Walllenius' noncentral hypergeometric distribution from the BiasedUrn
package. Please note that the gene.length
parameter must be in the
same order and of the same length as the universe
parameter.
A data frame with a row for each gene set and the following columns:
N |
number of genes in the gene set |
DE |
number of genes that are differentially expressed |
P.DE |
p-value for over-representation of the gene set |
FDR |
False discovery rate, calculated using the method of Benjamini and Hochberg (1995). |
Belinda Phipson
goana,kegga,camera,roast
## Not run: # to avoid timeout on Bioconductor build
library(org.Hs.eg.db)
# Use org.Hs.eg.db to extract GO terms
GOtoID <- suppressMessages(select(org.Hs.eg.db, keys=keys(org.Hs.eg.db),
columns=c("ENTREZID","GO"),
keytype="ENTREZID"))
head(GOtoID)
# Define the universe as random sample of 20000 genes in the annotation
universe <- sample(unique(GOtoID$ENTREZID),20000)
# Randomly sample 500 genes as DE
de.genes <- sample(universe, 500)
# Generate random gene lengths for genes in universe
# This is based on the true distribution of log(gene length) of genes in the
# hg19 genome
logGL <- rnorm(length(universe),mean=7.9, sd=1.154)
genelength <- exp(logGL)
# Define a list of gene sets containing two GO terms
setname1 <- GOtoID$GO[1]
setname1
keep.set1 <- GOtoID$GO %in% setname1
set1 <- GOtoID$ENTREZID[keep.set1]
setname2 <- GOtoID$GO[2]
setname2
keep.set2 <- GOtoID$GO %in% setname2
set2 <- GOtoID$ENTREZID[keep.set2]
# Make the gene sets into a list
sets <- list(set1, set2)
names(sets) <- c(setname1,setname2)
# Test for enrichment of gene sets with no gene length bias
# The genes are randomly selected so we don't expect significant results
gsaseq(sig.de = de.genes, universe = universe, collection = sets)
# Test for enrichment of gene sets taking into account gene length bias
# Since the gene lengths are randomly generated this shouldn't make much
# difference to the results
# Using log(gene length) or gene length doesn't make a difference to the
# p-values because the probability weighting function is transformation
# invariant
gsaseq(sig.de = de.genes, univers = universe, collection = sets,
gene.length = genelength)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.