R/gometh.R
In missMethyl: Analysing Illumina HumanMethylation BeadChip Data

Documented in gometh

#' Gene ontology testing for Ilumina methylation array data
#' 
#' Tests gene ontology enrichment for significant CpGs from Illumina's Infinium
#' HumanMethylation450 or MethylationEPIC array, taking into account two 
#' different sources of bias: 1) the differing number of probes per gene 
#' present on the array, and 2) CpGs that are annotated to multiple genes.
#' 
#' This function takes a character vector of significant CpG sites, maps the
#' CpG sites to Entrez Gene IDs, and tests for GO term or KEGG pathway
#' enrichment using a Wallenius' non central hypergeometric test, taking into 
#' account the number of CpG sites per gene on the 450K/EPIC array and 
#' multi-gene annotated CpGs.  Geeleher et al. (2013) showed
#' that a severe bias exists when performing gene set analysis for genome-wide
#' methylation data that occurs due to the differing numbers of CpG sites
#' profiled for each gene. \code{gometh} is based on the \code{goseq} method
#' (Young et al., 2010), and is a modification of the \code{goana} function in 
#' the \code{limma} package. If \code{prior.prob} is set to
#' FALSE, then prior probabilities are not used and it is assumed that each
#' gene is equally likely to have a significant CpG site associated with it.
#' 
#' The testing now also takes into account that some CpGs are annotated to 
#' multiple genes. For a small number of gene families, this previously caused 
#' their associated GO categories/gene sets to be erroneously overrepresented 
#' and thus highly significant. If \code{fract.counts=FALSE} then CpGs are 
#' allowed to map to multiple genes (this is NOT recommended).
#' 
#' A new feature of \code{gometh} and \code{gsameth} is the ability to restrict 
#' the input CpGs by genomic feature with the argument \code{genomic.features}. 
#' The possible options include "ALL", "TSS200", "TSS1500", "Body", "1stExon", 
#' "3'UTR", "5'UTR" and "ExonBnd", and the user may specify any combination. 
#' Please not that "ExonBnd" is not an annotated feature on 450K arrays. For 
#' example if you are interested in the promoter region only, you could specify 
#' \code{genomic.features = c("TSS1500","TSS200","1stExon")}. The default 
#' behaviour is to test all input CpGs \code{sig.cpg} even if the user specifies
#'  "ALL" and one or more other features.
#' 
#' Genes associated with each CpG site are obtained from the annotation package
#' \code{IlluminaHumanMethylation450kanno.ilmn12.hg19} if the array type is
#' "450K". For the EPIC array, the annotation package
#' \code{IlluminaHumanMethylationEPICanno.ilm10b4.hg19} is used. To use a
#' different annotation package, please supply it using the \code{anno}
#' argument. 
#' 
#' If you are interested in which genes overlap with the genes in the gene set, 
#' setting \code{sig.genes} to TRUE will output an additional column in the 
#' results data frame that contains all the significant differentially 
#' methylated gene symbols, comma separated. The default is FALSE.
#' 
#' In order to get a list which contains the mapped Entrez gene IDs,
#' please use the \code{getMappedEntrezIDs} function. \code{gometh} tests all
#' GO or KEGG terms, and false discovery rates are calculated using the method
#' of Benjamini and Hochberg (1995).  The \code{topGSA} function can be used to 
#' display the top 20 most enriched pathways.
#' 
#' For more generalised gene set testing where the user can specify the gene
#' set/s of interest to be tested, please use the \code{gsameth} function.
#' If you are interested in performing gene set testing following a region 
#' analysis, then the functions \code{goregion} and \code{gsaregion} can be 
#' used.
#' 
#' @param sig.cpg Character vector of significant CpG sites to test for GO term
#' enrichment.
#' @param all.cpg Character vector of all CpG sites tested. Defaults to all CpG
#' sites on the array.
#' @param collection The collection of pathways to test. Options are "GO" and
#' "KEGG". Defaults to "GO".
#' @param array.type The Illumina methylation array used. Options are "450K" or
#' "EPIC". Defaults to "450K".
#' @param plot.bias Logical, if true a plot showing the bias due to the
#' differing numbers of probes per gene will be displayed.
#' @param prior.prob Logical, if true will take into account the probability of
#' significant differential methylation due to numbers of probes per gene. If
#' false, a hypergeometric test is performed ignoring any bias in the data.
#' @param anno Optional. A \code{DataFrame} object containing the complete
#' array annotation as generated by the \code{\link{minfi}}
#' \code{\link{getAnnotation}} function. Speeds up execution, if provided.
#' @param equiv.cpg Logical, if true then equivalent numbers of cpgs are used
#' for odds calculation rather than total number cpgs. Only used if 
#' \code{prior.prob=TRUE}.
#' @param fract.counts Logical, if true then fractional counting of Cpgs is used
#' to account for CpGs that are annotated to multiple genes. Only used if 
#' \code{prior.prob=TRUE}.
#' @param genomic.features Character vector or scalar indicating whether the 
#' gene set enrichment analysis should be restricted to CpGs from specific 
#' genomic locations. Options are "ALL", "TSS200","TSS1500","Body","1stExon",
#' "3'UTR","5'UTR","ExonBnd"; and the user can select any combination. Defaults
#' to "ALL".
#' @param sig.genes Logical, if true then the significant differentially 
#' methylated genes that overlap with the gene set of interest is outputted 
#' as the final column in the results table. Default is FALSE.
#' 
#' @return A data frame with a row for each GO or KEGG term and the following
#' columns: \item{Term}{ GO term if testing GO pathways } \item{Ont}{ ontology
#' that the GO term belongs to if testing GO pathways. "BP" - biological
#' process, "CC" - cellular component, "MF" - molecular function.  }
#' \item{Pathway}{ the KEGG pathway being tested if testing KEGG terms.  }
#' \item{N}{ number of genes in the GO or KEGG term } \item{DE}{ number of
#' genes that are differentially methylated } \item{P.DE}{ p-value for
#' over-representation of the GO or KEGG term term } \item{FDR}{ False
#' discovery rate } \item{SigGenesInSet}{ Significant differentially methylated 
#' genes overlapping with the gene set of interest. }
#' 
#' @author Belinda Phipson
#' @seealso \code{\link{gsameth},\link{goregion},\link{gsaregion},
#' \link{getMappedEntrezIDs}}
#' 
#' @references Phipson, B., Maksimovic, J., and Oshlack, A. (2016). missMethyl:
#' an R package for analysing methylation data from Illuminas
#' HumanMethylation450 platform. \emph{Bioinformatics}, \bold{15};32(2),
#' 286--8. 
#' 
#' Geeleher, P., Hartnett, L., Egan, L. J., Golden, A., Ali, R. A. R.,
#' and Seoighe, C. (2013). Gene-set analysis is severely biased when applied to
#' genome-wide methylation data. \emph{Bioinformatics}, \bold{29}(15),
#' 1851--1857. 
#' 
#' Young, M. D., Wakefield, M. J., Smyth, G. K., and Oshlack, A.
#' (2010). Gene ontology analysis for RNA-seq: accounting for selection bias.
#' \emph{Genome Biology}, 11, R14. 
#' 
#' Ritchie, M. E., Phipson, B., Wu, D., Hu, Y.,
#' Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential
#' expression analyses for RNA-sequencing and microarray studies. \emph{Nucleic
#' Acids Research}, gkv007. 
#' 
#' Benjamini, Y., and Hochberg, Y. (1995). Controlling
#' the false discovery rate: a practical and powerful approach to multiple
#' testing. \emph{Journal of the Royal Statistical Society Series}, B,
#' \bold{57}, 289-300.
#' 
#' @examples
#' 
#' \dontrun{ # to avoid timeout on Bioconductor build
#' library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
#' library(limma)
#' ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
#' # Randomly select 1000 CpGs to be significantly differentially methylated
#' sigcpgs <- sample(rownames(ann),1000,replace=FALSE)
#' # All CpG sites tested
#' allcpgs <- rownames(ann)
#' 
#' # GO testing with prior probabilities taken into account
#' # Plot of bias due to differing numbers of CpG sites per gene
#' gst <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", 
#'               plot.bias = TRUE, prior.prob = TRUE, anno = ann)
#' # Total number of GO categories significant at 5% FDR
#' table(gst$FDR<0.05)
#' # Table of top GO results
#' topGSA(gst)
#' 
#' # GO testing ignoring bias
#' gst.bias <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", 
#'                     prior.prob=FALSE, anno = ann)
#' # Total number of GO categories significant at 5% FDR ignoring bias
#' table(gst.bias$FDR<0.05)
#' # Table of top GO results ignoring bias
#' topGSA(gst.bias)
#' 
#' # GO testing ignoring multi-mapping CpGs
#' gst.multi <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "GO", 
#'               plot.bias = TRUE, prior.prob = TRUE, fract.counts = FALSE,
#'               anno = ann)
#' topGSA(gst.multi, n=10)
#' 
#' # Restrict to CpGs in promoter regions 
#' gst.promoter <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, 
#'                 collection = "GO", anno = ann, 
#'                 genomic.features=c("TSS200","TSS1500","1stExon"))
#' topGSA(gst.promoter)
#' 
#' # KEGG testing
#' kegg <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = "KEGG", 
#'                 prior.prob=TRUE, anno = ann)
#' # Table of top KEGG results
#' topGSA(kegg)
#' 
#' # Add significant genes to KEGG output
#' kegg.siggenes <- gometh(sig.cpg = sigcpgs, all.cpg = allcpgs, 
#'                         collection = "KEGG", anno = ann, sig.genes = TRUE)
#' # Output top 5 KEGG pathways
#' topGSA(kegg.siggenes, n=5)
#' }
#' 
#' @import org.Hs.eg.db statmod stringr
#' @importFrom methods is
#' @export gometh
gometh <- function(sig.cpg, all.cpg=NULL, collection=c("GO","KEGG"), 
                   array.type = c("450K","EPIC"), plot.bias=FALSE, 
                   prior.prob=TRUE, anno=NULL, equiv.cpg = TRUE, 
                   fract.counts = TRUE, 
                   genomic.features = c("ALL", "TSS200","TSS1500","Body",
                                       "1stExon","3'UTR","5'UTR","ExonBnd"),
                   sig.genes = FALSE)
  # Gene ontology testing or KEGG pathway analysis for Illumina methylation 
  # arrays based on goseq
  # Takes into account probability of differential methylation based on
  # numbers of probes on array per gene
  # Belinda Phipson
  # 28 January 2015. Last updated 1 September 2020.
  # EPIC functionality contributed by Andrew Y.F. Li Yim
{
  array.type <- match.arg(toupper(array.type), c("450K","EPIC"))    
  collection <- match.arg(toupper(collection), c("GO","KEGG"))
  genomic.features <- match.arg(genomic.features, c("ALL", "TSS200","TSS1500",
                                                    "Body", "1stExon","3'UTR",
                                                    "5'UTR","ExonBnd"), 
                                several.ok = TRUE)
  
  if(length(genomic.features) > 1 & any(grepl("ALL", genomic.features))){
    message("All input CpGs are used for testing.") 
    genomic.features <- "ALL"
  } 
  
  if(array.type == "450K" & any(grepl("ExonBnd", genomic.features))){
      stop("'ExonBnd' is not an annotated feature on 450K arrays,\n
           please remove it from your genomic.feature parameter\n
           specification.") 
  }
   
  if(collection == "GO"){
    go <- .getGO()
    result <- gsameth(sig.cpg=sig.cpg, all.cpg=all.cpg, collection=go$idList, 
                      array.type=array.type, plot.bias=plot.bias, 
                      prior.prob=prior.prob, anno=anno, equiv.cpg=equiv.cpg,
                      fract.counts=fract.counts, 
                      genomic.features = genomic.features,
                      sig.genes = sig.genes)
    result <- merge(go$idTable,result,by.x="GOID",by.y="row.names")
    rownames(result) <- result$GOID

  } else if(collection == "KEGG"){
    kegg <- .getKEGG()
    result <- gsameth(sig.cpg=sig.cpg, all.cpg=all.cpg, collection=kegg$idList, 
                      array.type=array.type, plot.bias=plot.bias, 
                      prior.prob=prior.prob, anno=anno, equiv.cpg=equiv.cpg,
                      fract.counts=fract.counts, 
                      genomic.features = genomic.features,
                      sig.genes = sig.genes)
    result <- merge(kegg$idTable,result,by.x="PathwayID",by.y="row.names")
    rownames(result) <- result$PathwayID
  }
  
  result[,-1]
}  

.getGO <- function(){
  if(!requireNamespace("org.Hs.eg.db", quietly = TRUE))
    stop("org.Hs.eg.db package required but not installed.")
  egGO2ALLEGS <- utils::getFromNamespace("org.Hs.egGO2ALLEGS", "org.Hs.eg.db")
  GeneID.PathID <- AnnotationDbi::toTable(egGO2ALLEGS)[,c("gene_id", "go_id", "Ontology")]
  d <- !duplicated(GeneID.PathID[, c("gene_id", "go_id")])
  GeneID.PathID <- GeneID.PathID[d, ]
  GOID.TERM <- suppressMessages(AnnotationDbi::select(GO.db::GO.db, 
                                                      keys=unique(GeneID.PathID$go_id), 
                                                      columns=c("GOID","ONTOLOGY","TERM"), 
                                                      keytype="GOID"))
  go <- tapply(GeneID.PathID$gene_id, GeneID.PathID$go_id, list)
    
  list(idList=go, idTable=GOID.TERM)
}

.getKEGG <- function(){
  GeneID.PathID <- limma::getGeneKEGGLinks(species.KEGG = "hsa", convert = TRUE)
  isna <- rowSums(is.na(GeneID.PathID[, 1:2])) > 0.5
  GeneID.PathID <- GeneID.PathID[!isna, ]
  ID.ID <- paste(GeneID.PathID[, 1], GeneID.PathID[, 2], sep = ".")
  d <- !duplicated(ID.ID)
  GeneID.PathID <- GeneID.PathID[d, ]
  PathID.PathName <- limma::getKEGGPathwayNames(species.KEGG = "Hsa", 
                                         remove.qualifier = TRUE)
  GeneID.PathID <- merge(GeneID.PathID, PathID.PathName, by="PathwayID")
  kegg <- tapply(GeneID.PathID$GeneID, GeneID.PathID$PathwayID, list)
  
  list(idList = kegg, idTable = PathID.PathName)
}  

.plotBias <- function(D,bias)
  # Plotting function to show gene level CpG density bias
  # Belinda Phipson
  # 5 March 2015
{
  o <- order(bias)
  splitf <- rep(1:100,each=200)[1:length(bias)]
  avgbias <- tapply(bias[o],factor(splitf),mean)
  sumDM <- tapply(D[o],factor(splitf),sum)
  propDM <- sumDM/table(splitf)
  graphics::par(mar=c(5,5,2,2))
  graphics::plot(avgbias,as.vector(propDM),
                 xlab="Number of CpGs per gene (binned)",
                 ylab="Proportion Differential Methylation",cex.lab=1.5,
                 cex.axis=1.2)
  graphics::lines(stats::lowess(avgbias,propDM),col=4,lwd=2)
}

.estimatePWF <- function(D,bias)
  # An alternative to goseq function nullp, which is transformation invariant
  # Belinda Phipson and Gordon Smyth
  # 6 March 2015
{
  prior.prob <- bias
  o <- order(bias)
  prior.prob[o] <- limma::tricubeMovingAverage(D[o],span=0.5)
  prior.prob
}

.getFlatAnnotation <- function(array.type=c("450K","EPIC"),anno=NULL)
  # flatten 450k or EPIC array annotation
  # Jovana Maksimovic
  # 18 September 2018
  # Updated 18 September 2018
  # Modified version of Belida Phipson's .flattenAnn code
{
  if(is.null(anno)){
    if(array.type=="450K"){
      anno <- minfi::getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19::IlluminaHumanMethylation450kanno.ilmn12.hg19)
    } else {
      anno <- minfi::getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19::IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
    }
  }
  
  # get rid of the non-CpG sites
  ann.keep<-anno[grepl("^cg",anno$Name),]
  
  # get rid of CpGs that are not annotated
  missing<-ann.keep$UCSC_RefGene_Name==""
  ann.keep<-ann.keep[!missing,]
  
  # get individual gene names for each CpG
  geneslist<-strsplit(ann.keep$UCSC_RefGene_Name,split=";")
  names(geneslist)<-rownames(ann.keep)
  
  grouplist<-strsplit(ann.keep$UCSC_RefGene_Group,split=";")
  names(grouplist)<-rownames(ann.keep)
  
  flat<-data.frame(symbol=unlist(geneslist),group=unlist(grouplist))
  flat$symbol<-as.character(flat$symbol)
  flat$group <- as.character(flat$group)
  
  flat$cpg<- substr(rownames(flat),1,10)
  
  #flat$cpg <- rownames(flat)
  flat$alias <- suppressWarnings(limma::alias2SymbolTable(flat$symbol))
  
  #eg <- toTable(org.Hs.egSYMBOL2EG)
  eg <- suppressMessages(AnnotationDbi::select(org.Hs.eg.db, 
                                keys=AnnotationDbi::keys(org.Hs.eg.db), 
                                columns=c("ENTREZID","SYMBOL"), 
                                keytype="ENTREZID"))
  colnames(eg) <- c("gene_id","symbol")
  
  m <- match(flat$alias,eg$symbol)
  flat$entrezid <- eg$gene_id[m]
  flat <- flat[!is.na(flat$entrezid),]
  
  # keep unique cpg by gene name annotation
  id<-paste(flat$cpg,flat$entrezid,sep=".")
  d <- duplicated(id)
  flat.u <- flat[!d,]
  flat.u
  # This randomly samples only 1 gene ID for multimapping CpGs
  #.reduceMultiMap(flat.u)
}

# .reduceMultiMap <- function(flat){
#   mm <- table(flat$cpg)
#   mm <- names(mm[mm > 2])
#   red <- tapply(flat$entrezid[flat$cpg %in% mm],
#                 flat$cpg[flat$cpg %in% mm], sample, 1)
#   key <- paste(flat$cpg,flat$entrezid,sep=".")
#   rkey <- paste(names(red),red,sep = ".")
#   w <- which(key %in% rkey)
#   
#   newmm <- flat[w,] 
#   newflat <- flat[-which(flat$cpg %in% mm),]
#   newflat <- rbind(newflat, newmm)
#   newflat
# }