Build a Minimalist Gene Ontology (GO) Database (GODB)
Barry
Zeeberg
barryz2013@gmail.com
Motivation
The Gene Ontology (GO) Consortium (see https://geneontology.org/) maintains and provides a database relating genes and biological processes. This resource has been used extensively to analyze the results of gene expression studies in health and disease.
Building a GO data base (GODB) is fairly complicated, involving downloading multiple database files and using these to build e.g. a 'mySQL' database. Accessing this database is also complicated, involving an intimate knowledge of the database in order to construct reliable queries.
Here we have a more modest goal, GOGOA3 a stripped down version of the GODB that is restricted to human genes as designated by the HUGO Gene Nomenclature Committee (HGNC) (see https://geneontology.org). This can be built in a matter of seconds from 2 easily downloaded files, and it can be queried to determine e.g. the mapping of a list of genes to GO categories.
Constructing the GODB
There are two curated files that are publicly available for download, that can be easily processed and then 'joined' to produce the desired minimalist GODB.
goa_human.gaf can be downloaded from https://current.geneontology.org/products/pages/downloads.html and processed by parseGOA() to generate a matrix (Figure1) that relates human gene symbols with the identifier for GO categories. This is not very useful, as we still do not know what these categories are.
{width=50%}
Fortunately, go-basic.obo can be downloaded from https://geneontology.org/docs/download-ontology/ and processed by parseGOBASIC() to match up the GO identifiers and the category names (Figure 2).
{width=100%}
These two matrices can be 'joined' by joinGO() to produce the desired result (Figure 3). The entries in the column 'GO_NAME' are intended to combine the identifier with the descriptive name, eliminating colons and spaces so as to provide a 'safe' name in the event that it might be used as a variable name or a filename in some applications.
{width=125%}
GOGOA3 is a More Convenient Version of GOGOA!
GOGOA contains a column specifying the ontology ("biological_process","molecular_function", or "cellular_component") for each row entry (Figure 3). In practice, queries of GOGOA will target one of these ontologies. Rather than requiring the query to repetitively filter for the desired ontology, the function subsetGOGOA() generates the more convenient database GOGOA3, which is essentially a list containing three separate versions of GOGOA, one for each ontology (FIgure 4). GOGOA3 also has several additional components that provide convenient statistical information and metadata that characterize the three ontology databases (Figure 5).
{width=100%}
Figure 4. Example of GOGOA3 'biological_process' Ontology
{width=150%}
Figure 5. Components of GOGOA3
GOGOA3.RData and GOGOA.RData are too large to include in a CRAN package, but they can be generated by running the programs in the current package, or by download from https://github.com/barryzee/GO. For convenience, GO.RData, GOA.RData, and GODB.RData are provided in the data subdirectory and at https://github.com/barryzee/GO.
Using GOGOA3
GOGOA3 can be queried by a submitted list of genes to determine the distribution of mapping to GO categories (Figure 6).
GOGOA3 is a convenient structure representing the minimalist GODB hgncList is a list of gene identifiers BP<-GOGOA3$ontologies[["biological_process"]] w<-which(BP[,"HGNC"] %in% hgcnList) t<-table(BP[w,"NAME"])
{width=125%}
Figure 6. Mapping of genes to GO categories
My upcoming CRAN package GoMiner will use GOGOA3 to implement the GoMiner application first described in my paper GoMiner: a resource for biological interpretation of genomic and proteomic data that I had previously published (see Zeeberg, B.R., Feng, W., Wang, G. et al. (2003)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.