gometh | R Documentation |
Tests gene ontology enrichment for significant CpGs from Illumina's Infinium HumanMethylation450 or MethylationEPIC array, taking into account two different sources of bias: 1) the differing number of probes per gene present on the array, and 2) CpGs that are annotated to multiple genes.
gometh(
sig.cpg,
all.cpg = NULL,
collection = c("GO", "KEGG"),
array.type = c("450K", "EPIC"),
plot.bias = FALSE,
prior.prob = TRUE,
anno = NULL,
equiv.cpg = TRUE,
fract.counts = TRUE,
genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR",
"ExonBnd"),
sig.genes = FALSE
)
sig.cpg |
Character vector of significant CpG sites to test for GO term enrichment. |
all.cpg |
Character vector of all CpG sites tested. Defaults to all CpG sites on the array. |
collection |
The collection of pathways to test. Options are "GO" and "KEGG". Defaults to "GO". |
array.type |
The Illumina methylation array used. Options are "450K" or "EPIC". Defaults to "450K". |
plot.bias |
Logical, if true a plot showing the bias due to the differing numbers of probes per gene will be displayed. |
prior.prob |
Logical, if true will take into account the probability of significant differential methylation due to numbers of probes per gene. If false, a hypergeometric test is performed ignoring any bias in the data. |
anno |
Optional. A |
equiv.cpg |
Logical, if true then equivalent numbers of cpgs are used
for odds calculation rather than total number cpgs. Only used if
|
fract.counts |
Logical, if true then fractional counting of Cpgs is used
to account for CpGs that are annotated to multiple genes. Only used if
|
genomic.features |
Character vector or scalar indicating whether the gene set enrichment analysis should be restricted to CpGs from specific genomic locations. Options are "ALL", "TSS200","TSS1500","Body","1stExon", "3'UTR","5'UTR","ExonBnd"; and the user can select any combination. Defaults to "ALL". |
sig.genes |
Logical, if true then the significant differentially methylated genes that overlap with the gene set of interest is outputted as the final column in the results table. Default is FALSE. |
This function takes a character vector of significant CpG sites, maps the
CpG sites to Entrez Gene IDs, and tests for GO term or KEGG pathway
enrichment using a Wallenius' non central hypergeometric test, taking into
account the number of CpG sites per gene on the 450K/EPIC array and
multi-gene annotated CpGs. Geeleher et al. (2013) showed
that a severe bias exists when performing gene set analysis for genome-wide
methylation data that occurs due to the differing numbers of CpG sites
profiled for each gene. gometh
is based on the goseq
method
(Young et al., 2010), and is a modification of the goana
function in
the limma
package. If prior.prob
is set to
FALSE, then prior probabilities are not used and it is assumed that each
gene is equally likely to have a significant CpG site associated with it.
The testing now also takes into account that some CpGs are annotated to
multiple genes. For a small number of gene families, this previously caused
their associated GO categories/gene sets to be erroneously overrepresented
and thus highly significant. If fract.counts=FALSE
then CpGs are
allowed to map to multiple genes (this is NOT recommended).
A new feature of gometh
and gsameth
is the ability to restrict
the input CpGs by genomic feature with the argument genomic.features
.
The possible options include "ALL", "TSS200", "TSS1500", "Body", "1stExon",
"3'UTR", "5'UTR" and "ExonBnd", and the user may specify any combination.
Please not that "ExonBnd" is not an annotated feature on 450K arrays. For
example if you are interested in the promoter region only, you could specify
genomic.features = c("TSS1500","TSS200","1stExon")
. The default
behaviour is to test all input CpGs sig.cpg
even if the user specifies
"ALL" and one or more other features.
Genes associated with each CpG site are obtained from the annotation package
IlluminaHumanMethylation450kanno.ilmn12.hg19
if the array type is
"450K". For the EPIC array, the annotation package
IlluminaHumanMethylationEPICanno.ilm10b4.hg19
is used. To use a
different annotation package, please supply it using the anno
argument.
If you are interested in which genes overlap with the genes in the gene set,
setting sig.genes
to TRUE will output an additional column in the
results data frame that contains all the significant differentially
methylated gene symbols, comma separated. The default is FALSE.
In order to get a list which contains the mapped Entrez gene IDs,
please use the getMappedEntrezIDs
function. gometh
tests all
GO or KEGG terms, and false discovery rates are calculated using the method
of Benjamini and Hochberg (1995). The topGSA
function can be used to
display the top 20 most enriched pathways.
For more generalised gene set testing where the user can specify the gene
set/s of interest to be tested, please use the gsameth
function.
If you are interested in performing gene set testing following a region
analysis, then the functions goregion
and gsaregion
can be
used.
A data frame with a row for each GO or KEGG term and the following columns:
Term |
GO term if testing GO pathways |
Ont |
ontology that the GO term belongs to if testing GO pathways. "BP" - biological process, "CC" - cellular component, "MF" - molecular function. |
Pathway |
the KEGG pathway being tested if testing KEGG terms. |
N |
number of genes in the GO or KEGG term |
DE |
number of genes that are differentially methylated |
P.DE |
p-value for over-representation of the GO or KEGG term term |
FDR |
False discovery rate |
SigGenesInSet |
Significant differentially methylated genes overlapping with the gene set of interest. |
Belinda Phipson
Phipson, B., Maksimovic, J., and Oshlack, A. (2016). missMethyl: an R package for analysing methylation data from Illuminas HumanMethylation450 platform. Bioinformatics, 15;32(2), 286–8.
Geeleher, P., Hartnett, L., Egan, L. J., Golden, A., Ali, R. A. R., and Seoighe, C. (2013). Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics, 29(15), 1851–1857.
Young, M. D., Wakefield, M. J., Smyth, G. K., and Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, 11, R14.
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, gkv007.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.
gsameth,goregion,gsaregion,
getMappedEntrezIDs
## Not run: # to avoid timeout on Bioconductor build
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(limma)
ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
# Randomly select 1000 CpGs to be significantly differentially methylated
sigcpgs <- sample(rownames(ann),1000,replace=FALSE)
# All CpG sites tested
allcpgs <- rownames(ann)
# GO testing with prior probabilities taken into account
# Plot of bias due to differing numbers of CpG sites per gene
gst <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO",
plot.bias = TRUE, prior.prob = TRUE, anno = ann)
# Total number of GO categories significant at 5% FDR
table(gst$FDR<0.05)
# Table of top GO results
topGSA(gst)
# GO testing ignoring bias
gst.bias <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO",
prior.prob=FALSE, anno = ann)
# Total number of GO categories significant at 5% FDR ignoring bias
table(gst.bias$FDR<0.05)
# Table of top GO results ignoring bias
topGSA(gst.bias)
# GO testing ignoring multi-mapping CpGs
gst.multi <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO",
plot.bias = TRUE, prior.prob = TRUE, fract.counts = FALSE,
anno = ann)
topGSA(gst.multi, n=10)
# Restrict to CpGs in promoter regions
gst.promoter <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs,
collection = "GO", anno = ann,
genomic.features=c("TSS200","TSS1500","1stExon"))
topGSA(gst.promoter)
# KEGG testing
kegg <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "KEGG",
prior.prob=TRUE, anno = ann)
# Table of top KEGG results
topGSA(kegg)
# Add significant genes to KEGG output
kegg.siggenes <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs,
collection = "KEGG", anno = ann, sig.genes = TRUE)
# Output top 5 KEGG pathways
topGSA(kegg.siggenes, n=5)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.