goana: Gene Ontology or KEGG Pathway Analysis
In limma: Linear Models for Microarray Data

Description Usage Arguments Details Value Note Author(s) References See Also Examples

Test for over-representation of gene ontology (GO) terms or KEGG pathways in one or more sets of genes, optionally adjusting for abundance or gene length bias.

## S3 method for class 'MArrayLM'
goana(de, coef = ncol(de), geneid = rownames(de), FDR = 0.05, trend = FALSE, ...)
## S3 method for class 'MArrayLM'
kegga(de, coef = ncol(de), geneid = rownames(de), FDR = 0.05, trend = FALSE, ...)
## Default S3 method:
goana(de, universe = NULL, species = "Hs", prior.prob = NULL, covariate=NULL,
      plot=FALSE, ...)
## Default S3 method:
kegga(de, universe = NULL, restrict.universe = FALSE, species = "Hs", species.KEGG = NULL,
      convert = FALSE, gene.pathway = NULL, pathway.names = NULL,
      prior.prob = NULL, covariate=NULL, plot=FALSE, ...)
getGeneKEGGLinks(species.KEGG = "hsa", convert = FALSE)
getKEGGPathwayNames(species.KEGG = NULL, remove.qualifier = FALSE)

`de`	a character vector of Entrez Gene IDs, or a list of such vectors, or an `MArrayLM` fit object.
`coef`	column number or column name specifying for which coefficient or contrast differential expression should be assessed.
`geneid`	Entrez Gene identifiers. Either a vector of length `nrow(de)` or the name of the column of `de$genes` containing the Entrez Gene IDs.
`FDR`	false discovery rate cutoff for differentially expressed genes. Numeric value between 0 and 1.
`species`	character string specifying the species. Possible values include `"Hs"` (human), `"Mm"` (mouse), `"Rn"` (rat), `"Dm"` (fly) or `"Pt"` (chimpanzee), but other values are possible if the corresponding organism package is available. See `alias2Symbol` for other possible values. Ignored if `species.KEGG` or is not `NULL` or if `gene.pathway` and `pathway.names` are not `NULL`.
`species.KEGG`	three-letter KEGG species identifier. See http://www.kegg.jp/kegg/catalog/org_list.html or http://rest.kegg.jp/list/organism for possible values. Alternatively, if `de` contains KEGG ortholog Ids (`"k00001"` etc) instead of gene Ids, then set `species.KEGG="ko"`. This argument is ignored if `gene.pathway` and `pathway.names` are both not `NULL`.
`convert`	if `TRUE` then KEGG gene identifiers will be converted to NCBI Entrez Gene identifiers. Note that KEGG IDs are the same as Entrez Gene IDs for most species anyway.
`gene.pathway`	data.frame linking genes to pathways. First column gives gene IDs, second column gives pathway IDs. By default this is obtained automatically by `getGeneKEGGLinks(species.KEGG)`.
`remove.qualifier`	if `TRUE`, the species qualifier will be removed from the pathway names.
`pathway.names`	data.frame giving full names of pathways. First column gives pathway IDs, second column gives pathway names. By default this is obtained automatically using `getKEGGPathwayNames(species.KEGG, remove=TRUE)`.
`trend`	adjust analysis for gene length or abundance? Can be logical, or a numeric vector of covariate values, or the name of the column of `de$genes` containing the covariate values. If `TRUE`, then `de$Amean` is used as the covariate.
`universe`	vector specifying the set of Entrez Gene identifiers to be the background universe. If `NULL` then all Entrez Gene IDs associated with any gene ontology term will be used as the universe.
`restrict.universe`	logical, should the `universe` be restricted to gene identifiers found in at least one pathway in `gene.pathway`?
`prior.prob`	optional numeric vector of the same length as `universe` giving the prior probability that each gene in the universe appears in a gene set. Will be computed from `covariate` if the latter is provided. Ignored if `universe` is `NULL`.
`covariate`	optional numeric vector of the same length as `universe` giving a covariate against which `prior.prob` should be computed. Ignored if `universe` is `NULL`.
`plot`	logical, should the `prior.prob` vs `covariate` trend be plotted?
`...`	any other arguments in a call to the `MArrayLM` methods are passed to the corresponding default method.

These functions perform over-representation analyses for Gene Ontology terms or KEGG pathways. The default methods accept a gene set as a vector of Entrez Gene IDs or multiple gene sets as a list of such vectors. An over-represention analysis is then done for each set. The MArrayLM method extracts the gene sets automatically from a linear model fit object.

The p-values returned by goana and kegga are unadjusted for multiple testing. The authors have chosen not to correct automatically for multiple testing because GO terms and KEGG pathways are often overlapping, so standard methods of p-value adjustment may be very conservative. Users should be aware though that p-values are unadjusted, meaning that only very small p-values should be used for published results.

goana uses annotation from the appropriate Bioconductor organism package. The species can be any character string XX for which an organism package org.XX.eg.db is installed. Examples are "Hs" for human for "Mm" for mouse. See alias2Symbol for other possible values for species.

kegga reads KEGG pathway annotation from the KEGG website. For kegga, the species name can be provided in either Bioconductor or KEGG format. Examples of KEGG format are "hsa" for human, "mmu" for mouse of "dme" for fly. kegga can be used for any species supported by KEGG, of which there are more than 14,000 possibilities. By default, kegga obtains the KEGG annotation for the specified species from the http://rest.kegg.jp website using getGeneKEGGLinks and getKEGGPathwayNames. Alternatively one can supply the required pathway annotation to kegga in the form of two data.frames. If this is done, then an internet connection is not required.

The gene ID system used by kegga for each species is determined by KEGG. For human and mouse, the default (and only choice) is Entrez Gene ID. For Drosophila, the default is FlyBase CG annotation symbol. The format of the IDs can be seen by typing head(getGeneKEGGLinks(species)), for examplehead(getGeneKEGGLinks("hsa")) or head(getGeneKEGGLinks("dme")). Entrez Gene IDs can always be used. If Entrez Gene IDs are not the default, then conversion can be done by specifying "convert=TRUE".

Another possibility is to use KEGG orthology IDs as the gene IDs, and these can be used for any species. In that case, set species.KEGG="ko".

The ability to supply data.frame annotation to kegga means that kegga can in principle be used in conjunction with any user-supplied set of annotation terms.

The default goana and kegga methods accept a vector prior.prob giving the prior probability that each gene in the universe appears in a gene set. This vector can be used to correct for unwanted trends in the differential expression analysis associated with gene length, gene abundance or any other covariate (Young et al, 2010). The MArrayLM object computes the prior.prob vector automatically when trend is non-NULL.

If prior.prob=NULL, the function computes one-sided hypergeometric tests equivalent to Fisher's exact test. If prior probabilities are specified, then a test based on the Wallenius' noncentral hypergeometric distribution is used to adjust for the relative probability that each gene will appear in a gene set, following the approach of Young et al (2010).

The MArrayLM methods performs over-representation analyses for the up and down differentially expressed genes from a linear model analysis. In this case, the universe is all the genes found in the fit object.

trend=FALSE is equivalent to prior.prob=NULL. If trend=TRUE or a covariate is supplied, then a trend is fitted to the differential expression results and this is used to set prior.prob.

The statistical approach provided here is the same as that provided by the goseq package, with one methodological difference and a few restrictions. Unlike the goseq package, the gene identifiers here must be Entrez Gene IDs and the user is assumed to be able to supply gene lengths if necessary. The goseq package has additional functionality to convert gene identifiers and to provide gene lengths. The only methodological difference is that goana and kegga computes gene length or abundance bias using tricubeMovingAverage instead of monotonic regression. While tricubeMovingAverage does not enforce monotonicity, it has the advantage of numerical stability when de contains only a small number of genes.

The goana default method produces a data frame with a row for each GO term and the following columns:

`Term`	GO term.
`Ont`	ontology that the GO term belongs to. Possible values are `"BP"`, `"CC"` and `"MF"`.
`N`	number of genes in the GO term.
`DE`	number of genes in the `DE` set.
`P.DE`	p-value for over-representation of the GO term in the set.

The last two column names above assume one gene set with the name DE. In general, there will be a pair of such columns for each gene set and the name of the set will appear in place of "DE".

The goana method for MArrayLM objects produces a data frame with a row for each GO term and the following columns:

`Term`	GO term.
`Ont`	ontology that the GO term belongs to. Possible values are `"BP"`, `"CC"` and `"MF"`.
`N`	number of genes in the GO term.
`Up`	number of up-regulated differentially expressed genes.
`Down`	number of down-regulated differentially expressed genes.
`P.Up`	p-value for over-representation of GO term in up-regulated genes.
`P.Down`	p-value for over-representation of GO term in down-regulated genes.

The row names of the data frame give the GO term IDs.

The output from kegga is the same except that row names become KEGG pathway IDs, Term becomes Pathway and there is no Ont column.

kegga requires an internet connection unless gene.pathway and pathway.names are both supplied.

The default for kegga with species="Dm" changed from convert=TRUE to convert=FALSE in limma 3.27.8. Users wanting to use Entrez Gene IDs for Drosophila should set convert=TRUE, otherwise fly-base CG annotation symbol IDs are assumed (for example "Dme1_CG4637").

The default for restrict.universe=TRUE in kegga changed from TRUE to FALSE in limma 3.33.4.

Bug fix: results from kegga with trend=TRUE or with non-NULL covariate were incorrect prior to limma 3.32.3. The results were biased towards significant Down p-values and against significant Up p-values.

Gordon Smyth and Yifang Hu

Young, M. D., Wakefield, M. J., Smyth, G. K., Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology 11, R14. http://genomebiology.com/2010/11/2/R14

topGO, topKEGG

The goseq package provides an alternative implementation of methods from Young et al (2010). Unlike the limma functions documented here, goseq will work with a variety of gene identifiers and includes a database of gene length information for various species.

The gostats package also does GO analyses without adjustment for bias but with some other options.

See 10.GeneSetTests for a description of other functions used for gene set testing.

## Not run: 
## Linear model usage:

fit <- lmFit(y, design)
fit <- eBayes(fit)

# Standard GO analysis

go.fisher <- goana(fit, species="Hs")
topGO(go.fisher, sort = "up")
topGO(go.fisher, sort = "down")

# GO analysis adjusting for gene abundance

go.abund <- goana(fit, geneid = "GeneID", trend = TRUE)
topGO(go.abund, sort = "up")
topGO(go.abund, sort = "down")

# GO analysis adjusting for gene length bias
# (assuming that y$genes$Length contains gene lengths)

go.len <- goana(fit, geneid = "GeneID", trend = "Length")
topGO(go.len, sort = "up")
topGO(go.len, sort = "down")

## Default usage with a list of gene sets:

go.de <- goana(list(DE1 = EG.DE1, DE2 = EG.DE2, DE3 = EG.DE3))
topGO(go.de, sort = "DE1")
topGO(go.de, sort = "DE2")
topGO(go.de, ontology = "BP", sort = "DE3")
topGO(go.de, ontology = "CC", sort = "DE3")
topGO(go.de, ontology = "MF", sort = "DE3")

## Standard KEGG analysis

k <- kegga(fit, species="Hs")
k <- kegga(fit, species.KEGG="hsa") # equivalent to previous
topKEGG(k, sort = "up")
topKEGG(k, sort = "down")


## End(Not run)