go_enrich: Test gene sets for enrichment in GO-categories

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/go_enrich.R

Description

Tests GO-categories for enrichment of user defined gene sets, using either the hypergeometric (default), Wilcoxon rank-sum, binomial or 2x2 contingency table test.

Usage

1
2
3
go_enrich(genes, test = 'hyper', n_randsets = 1000, organismDb = 'Homo.sapiens', gene_len = FALSE,
    regions = FALSE, circ_chrom = FALSE, silent = FALSE, domains = NULL, orgDb = NULL,
    txDb = NULL, annotations = NULL, gene_coords = NULL,  godir = NULL)

Arguments

genes

a data.frame() with gene-symbols (character()) in the first column and test-dependent additional numeric() columns:
If test='hyper' (default) a second column with 1 for candidate genes and 0 for background genes. If no background genes are defined, all remaining genes from the internal dataset are used as background. All candidate genes are implicitly part of the background gene set.
If test='wilcoxon' a second column with the score that is associated with each gene.
If test='binomial' two additional columns with two gene-associated integers.
If test='contingency' four additional columns with four gene-associated integers.
For test='hyper' with regions=TRUE the first column describes chromosomal regions ('chr:start-stop', e.g. '9:0-39200000').
Note that each gene has to be unique in the data.frame; e.g. for test='wilcoxon' a gene cannot have two different scores assigned.

test

character(). 'hyper' (default) for the hypergeometric test, 'wilcoxon' for the Wilcoxon rank-sum test, 'binomial' for the binomial test and 'contingency' for the 2x2-contingency table test (fisher's exact test or chi-square).

n_randsets

integer defining the number of random sets for computing the FWER.

organismDb

character(). Annotation package for GO-annotations and gene coordinates. Besides the default 'Homo.sapiens' also 'Mus.musculus' and 'Rattus.norvegicus' are currently available on Bioconductor.

gene_len

logical. Correct for gene length. If TRUE the probability of a background gene to be chosen as a candidate gene in a random set is dependent on the gene length. If FALSE genes are chosen randomly with equal probability each. Only for test='hyper'.

regions

logical. If TRUE chromosomal regions are analyzed, and genes are automatically extracted from the regions defined in genes[,1] as e.g. '9:0-39200000'. Note that this option requires the input of background regions.

circ_chrom

logical. When genes defines chromosomal regions, circ_chrom=TRUE uses background regions from the same chromosome and allows randomly chosen blocks to overlap multiple background regions. Only if test='hyper'.

silent

logical. If TRUE all output to the screen except for warnings and errors is suppressed.

domains

optional character() vector containing one or more of the three GO-domains 'cellular_component', 'biological_process' and 'molecular_function'. If defined, the analysis will be reduced to those domains which saves time.\ cr If a custom ontology is specified in godir it might have a different domains.

orgDb

optional character() naming an OrgDb annotation package from Bioconductor. If orgDb is set, organismDb is not used. OrgDb annotations are available for a wider range of species, e.g. 'org.Pt.eg.db' for chimp and 'org.Gg.eg.db' for chicken. Note that options gene_len and regions also need a txDb for the gene coordinates (which are integrated in OrganismDb).

txDb

optional character() naming an TxDb annotation package from Bioconductor (e.g. 'TxDb.Ptroglodytes.UCSC.panTro4.refGene') for the gene coordinates. Only needed when orgDb is specified, and options gene_len or regions are set. Note that orgDb is needed whenever txDb is defined, even when custom annotations are provided, because the orgDb package is used for Entrez-ID to gene-symbol conversions.

annotations

optional data.frame() for custom annotations, with two character() columns: (1) gene-symbols and (2) GO-categories. Note that options gene_len and regions also need an organismDb or txDb + orgDb.

gene_coords

optional data.frame() for custom gene coordinates, with four columns: gene-symbols (character), chromosome (character), start (integer), end (integer). When gene_len=TRUE or regions=TRUE these custom gene coordinates are used instead of the ones obtained from organismDb or txDb.

godir

optional character() specifying a directory () that contains a custom GO-graph (files 'term.txt', 'term2term.txt' and 'graph_path.txt'). Alternative to the default integrated GO-graph.
For details please refer to the package's vignette.

Details

Please also refer to the package's vignette.
GO-annotations are taken from a Bioconductor annotation package (OrganismDb package 'Homo.sapiens' by default), but also other 'OrganismDb' or 'OrgDb' packages can be used. It is also possible to provide custom annotations as a data.frame().

The ontology graph is integrated, but a custom version can be defined as well with parameter 'godir'. As long as the ontology tables are in the right format (see link to description in vignette), any ontology can be used in GOfuncR, it is not restricted to the gene ontology.

The statistical analysis is based on the ontology enrichment software FUNC [2]. go_enrich offers four different statistical tests: (1) the hypergeometric test for a candidate and a background gene set; (2) the Wilcoxon rank-sum test for genes that are ranked by scores (e.g. p-value for differential expression); (3) the binomial test for genes that have two associated counts (e.g. amino-acid changes on the human and the chimp lineage); and (4) a 2x2-contingency table test for genes that have four associated counts (e.g. for a McDonald-Kreitman test).

To account for multiple testing family-wise error rates are computed using randomsets. Besides naming candidate genes explicitly, for the hypergeometric test it is also possible to provide entire genomic regions as input. The enrichment analysis is then performed for all genes located in or overlapping these regions and the multiple testing correction accounts for the spatial clustering of genes.

Value

A list with components

results

a data.frame() with the FWERs from the enrichment analyses per ontology category, ordered by 'FWER_overrep', 'raw_p_overrep', -'FWER_underrep', -'raw_p_underrep', 'ontology' and 'node_id', or the corresponding columns if another test then the hypergeometric test is used. This table contains the combined results for all three ontology domains. Note that GO-categories without any annotations of candidate or background genes are not listed.

genes

the input genes data.frame(), excluding those genes for which no GO-annotations are available and which therefore were not included in the enrichment analysis. If gene_len=TRUE, also genes without gene coordinates are excluded.

databases

a data.frame() with the used annotation packages and GO-graph.

min_p

a data.frame() with the minimum p-values from the randomsets that are used to compute the FWER.

Author(s)

Steffi Grote

References

[1] Ashburner, M. et al. (2000). Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25-29. doi: 10.1038/75556
[2] Pruefer, K. et al. (2007). FUNC: A package for detecting significant associations between gene sets and ontological. BMC Bioinformatics 8: 41. doi: 10.1186/1471-2105-8-41

See Also

get_parent_nodes
get_child_nodes
get_anno_categories
get_anno_genes
plot_anno_scores
get_names
get_ids

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#### Note that argument 'n_randsets' is reduced 
#### to lower computational time in the following examples. 
#### Using the default value is recommended.

#### Perform a GO-enrichment analysis for some human genes 
#### with a defined background set
# create input dataframe that defines the candidate and backround genes
candi_gene_ids = c('NCAPG', 'APOL4', 'NGFR', 'NXPH4', 'C21orf59', 'CACNG2', 
    'AGTR1', 'ANO1', 'BTBD3', 'MTUS1', 'CALB1', 'GYG1', 'PAX2')
bg_gene_ids = c('FGR', 'NPHP1', 'DRD2', 'ABCC10', 'PTBP2', 'JPH4', 'SMARCC2',
    'FN1', 'NODAL', 'CYP1A2', 'ACSS1', 'CDHR1', 'SLC25A36', 'LEPR', 'PRPS2',
    'TNFAIP3', 'NKX3-1', 'LPAR2', 'PGAM2')
is_candidate = c(rep(1,length(candi_gene_ids)), rep(0,length(bg_gene_ids)))
genes = data.frame(gene_ids=c(candi_gene_ids, bg_gene_ids), is_candidate)
genes

# run enrichment analysis
go_res = go_enrich(genes, n_randset=100)

# go_enrich returns a list with 4 elements:
# 1) results from the anlysis
# (ordered by FWER for overrepresentation of candidate genes)
head(go_res[[1]])
# see the top GOs from every GO-domain
by(go_res[[1]], go_res[[1]][,'ontology'], head)
# 2) all valid input genes
go_res[[2]]
# 3) annotation databases used
go_res[[3]]
# 4) minimum p-values from randomsets
head(go_res[[4]])

#### see the package's vignette for more examples

GOfuncR documentation built on Nov. 8, 2020, 8:27 p.m.