gsameth: Generalised gene set testing for Illumina's methylation array...
In missMethyl: Analysing Illumina HumanMethylation BeadChip Data

Description Usage Arguments Details Value Author(s) References See Also Examples

Given a user specified list of gene sets to test, gsameth tests whether significantly differentially methylated CpG sites are enriched in these gene sets.

gsameth(
  sig.cpg,
  all.cpg = NULL,
  collection,
  array.type = c("450K", "EPIC"),
  plot.bias = FALSE,
  prior.prob = TRUE,
  anno = NULL,
  equiv.cpg = TRUE,
  fract.counts = TRUE,
  genomic.features = c("ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR",
    "ExonBnd"),
  sig.genes = FALSE
)

`sig.cpg`	Character vector of significant CpG sites to test for gene set enrichment.
`all.cpg`	Character vector of all CpG sites tested. Defaults to all CpG sites on the array.
`collection`	A list of user specified gene sets to test. Can also be a single character vector gene set. Gene identifiers must be Entrez Gene IDs.
`array.type`	The Illumina methylation array used. Options are "450K" or "EPIC". Defaults to "450K".
`plot.bias`	Logical, if true a plot showing the bias due to the differing numbers of probes per gene will be displayed
`prior.prob`	Logical, if true will take into account the probability of significant differentially methylation due to numbers of probes per gene. If false, a hypergeometric test is performed ignoring any bias in the data.
`anno`	Optional. A `DataFrame` object containing the complete array annotation as generated by the `minfi` `getAnnotation` function. Speeds up execution, if provided.
`equiv.cpg`	Logical, if true then equivalent numbers of cpgs are used for odds calculation rather than total number cpgs. Only used if `prior.prob=TRUE`.
`fract.counts`	Logical, if true then fractional counting of cpgs is used to account for cgps that map to multiple genes. Only used if `prior.prob=TRUE`.
`genomic.features`	Character vector or scalar indicating whether the gene set enrichment analysis should be restricted to CpGs from specific genomic locations. Options are "ALL", "TSS200","TSS1500","Body","1stExon", "3'UTR","5'UTR","ExonBnd"; and the user can select any combination. Defaults to "ALL".
`sig.genes`	Logical, if true then the significant differentially methylated genes that overlap with the gene set of interest is outputted as the final column in the results table. Default is FALSE.

This function extends gometh, which only tests GO and KEGG pathways. gsameth can take a list of user specified gene sets and test whether the significant CpG sites are enriched in these pathways. gsameth maps the CpG sites to Entrez Gene IDs and tests for pathway enrichment using Wallenius' concentral hypergeometric test, taking into account the number of CpG sites per gene on the 450K/EPIC arrays. Please note the gene ids for the collection of gene,sets must be Entrez Gene IDs. If prior.prob is set to FALSE, then prior probabilities are not used and it is assumed that each gene is equally likely to have a significant CpG site associated with it.

The testing now also takes into account that some CpGs map to multiple genes. For a small number of gene families, this previously caused their associated GO categories/gene sets to be erroneously overrepresented and thus highly significant. If fract.counts=FALSE then CpGs are allowed to map to multiple genes (this is NOT recommended).

A new feature of gometh and gsameth is the ability to restrict the input CpGs by genomic feature with the argument genomic.features. The possible options include "ALL", "TSS200", "TSS1500", "Body", "1stExon", "3'UTR", "5'UTR" and "ExonBnd", and the user may specify any combination. Please not that "ExonBnd" is not an annotatedfeature on 450K arrays. For example if you are interested in the promoter region only, you could specify genomic.features = c("TSS1500","TSS200","1stExon"). The default behaviour is to test all input CpGs sig.cpg even if the user specifies "ALL" and one or more other features.

Genes associated with each CpG site are obtained from the annotation package IlluminaHumanMethylation450kanno.ilmn12.hg19 if the array type is "450K". For the EPIC array, the annotation package IlluminaHumanMethylationEPICanno.ilm10b4.hg19 is used. To use a different annotation package, please supply it using the anno argument.

In order to get a list which contains the mapped Entrez gene IDS, please use the getMappedEntrezIDs function.

If you are interested in which genes overlap with the genes in the gene set, setting sig.genes to TRUE will output an additional column in the results data frame that contains all the significant differentially methylated gene symbols, comma separated. The default is FALSE.

A data frame with a row for each gene set and the following columns:

`N`	number of genes in the gene set
`DE`	number of genes that are differentially methylated
`P.DE`	p-value for over-representation of the gene set
`FDR`	False discovery rate, calculated using the method of Benjamini and Hochberg (1995).
`SigGenesInSet`	Significant differentially methylated genes overlapping with the gene set of interest.

Belinda Phipson

Phipson, B., Maksimovic, J., and Oshlack, A. (2016). missMethyl: an R package for analysing methylation data from Illuminas HumanMethylation450 platform. Bioinformatics, 15;32(2), 286–8.

Geeleher, P., Hartnett, L., Egan, L. J., Golden, A., Ali, R. A. R., and Seoighe, C. (2013). Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics, 29(15), 1851–1857.

Young, M. D., Wakefield, M. J., Smyth, G. K., and Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, 11, R14.

Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, gkv007.

Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.

gometh,getMappedEntrezIDs

## Not run:  # to avoid timeout on Bioconductor build
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(org.Hs.eg.db)
library(limma)
ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
# Randomly select 1000 CpGs to be significantly differentially methylated
sigcpgs <- sample(rownames(ann),1000,replace=FALSE)
# All CpG sites tested
allcpgs <- rownames(ann)
# Use org.Hs.eg.db to extract a GO term
GOtoID <- suppressMessages(select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), 
                                  columns=c("ENTREZID","GO"), 
                                  keytype="ENTREZID"))
setname1 <- GOtoID$GO[1]
setname1
keep.set1 <- GOtoID$GO %in% setname1
set1 <- GOtoID$ENTREZID[keep.set1]
setname2 <- GOtoID$GO[2]
setname2
keep.set2 <- GOtoID$GO %in% setname2
set2 <- GOtoID$ENTREZID[keep.set2]
# Make the gene sets into a list
sets <- list(set1, set2)
names(sets) <- c(setname1,setname2)
# Testing with prior probabilities taken into account
# Plot of bias due to differing numbers of CpG sites per gene
gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, 
                plot.bias = TRUE, prior.prob = TRUE)
topGSA(gst)

# Add significant gene symbols in each set to output
gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, 
                plot.bias = TRUE, prior.prob = TRUE, sig.genes = TRUE)
topGSA(gst)

# Testing ignoring bias
gst.bias <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, 
                    prior.prob = FALSE)
topGSA(gst.bias)

# Restrict to CpGs in gene bodies
gst.body <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets,
                    genomic.features = "Body")
topGSA(gst.body)


## End(Not run)