ontoProc: RDF ontology processing for Bioconductor



The r Biocpkg("ontoProc") package includes tools for

Our primary objective is facilitating use of ontological metadata to simplify construction of formally annotated hierarchies of samples or features that should be traversed in analysis of complex genomic experiments.

The ontoProc package was developed to facilitate the coding of an ontology-driven visualizer of transcriptomic patterns in single-cell RNA-seq studies (tenXplore).


Application to cell type hierarchy

An enumeration of cell types

We used the Experimental Factor Ontology 'cell type' class (EFO_0000324) to obtain an enumeration of cell types. As of August 22 2017 it is an open question whether Cell Ontology or Cellosaurus should be used for this purpose. The author's subjective impression is that EFO has a simpler collection of terms for cell types, while Cell Ontology has a better collection of terms for types of neurons.

Basic operations using ontologyIndex facilities

This package ships with an R serialization of an OBO representation of the Cell Ontology. This is created using get_OBO in r CRANpkg("ontologyIndex"). (For ontologies only available in OWL format, the python pronto package was used to convert to OBO.)

cellOnto = getCellOnto()

At this time, elementary manipulations of the ontology involve collecting the children, siblings, or labels for given URIs.

cochil = children_TAG("CL:0000540", cellOnto) 
label_TAG("CL:0000540", cellOnto)
siblings_TAG("CL:0000540", cellOnto) 

Application: finding genes annotated to neuron subtypes

We focus on mouse. The neuron subtypes identified as OWL subclasses of "neuron" have names

cleanNames = function(tset) {

We would like to see if the expression data would allow us to discriminate neurons of these different types.

Bridging from Cell Ontology to mouse genes

There is no formal linkage at present between terms of Cell Ontology and those of Gene Ontology. Research on inference of tissue of origin from expression signatures has led to accurate classifiers (Lee, Krishnan, Troyanskaya) and applications in cell mixture deconvolution (Houseman).
Formal work in ontology bridging has been described but the specific task of mapping from Cell Ontology terms to Gene Ontology terms has not culminated in any programmatically available resource.

We apply approximate pattern matching (agrep in R) to find gene ontology terms that are apparently relevant to cell type vocabulary terms of interest. These are then mapped to gene annotation. Simple (non-vectorized) functions that accomplish this in an organism-specific are straightforward using the OrgDb packages. We serialized all GO terms for convenience with this package, in the data object allGOterms.

cellTypeToGO("serotonergic neuron", gotab=allGOterms)
cellTypeToGenes("serotonergic neuron", orgDb=org.Mm.eg.db, gotab=allGOterms)
cellTypeToGenes("serotonergic neuron", orgDb=org.Hs.eg.db, gotab=allGOterms)

Discrimination of neuron types: exploratory multivariate analysis

At this point the API for selecting cell types, bridging to gene sets, and acquiring expression data, is not well-modularized. Thus the best ways to get a feel for it are to use tenXplore() function, and to read the source code. In brief, we often fail to find GO terms that approximately match, as strings, Cell Ontology terms corresponding to cell subtypes. On the other hand, if we match on cell types, we get very large numbers of matches, which, at this time, will need to be filtered to get manageable feature sets. We will introduce tools for generating additional RDF to improve gene harvesting in real time. But the associated statements will need to be curated. The EBI Webulous system should be useful for introducing new terms that facilitate better connections between anatomic structures and sets of genes or other genomic features.

Annotation of free text

The humrna data.frame supplied with the package is a small sample of metadata from NCBI Sequence Read Archive (SRA). The study title field has been serialized as minicorpus.


There is a convention in text analysis of identifying stop words that are unlikely to be very useful for interpretation. The dropStop function tokenizes the study titles and eliminates stop words.


My hope is that EMBL BioSolr will help index strings of this type with formal ontology terms. However, as a step in the general direction, we have the following examples.

cs = getCellosaurusOnto()
ch = getChebiOnto()
grep("P493", cs$name, value=TRUE, ignore.case=TRUE)
grep("doxycyline", ch$name, value=TRUE, ignore.case=TRUE)

Based on PMID 10956386, P493-6 is an EBV-EBNA1 positive cell line, but that is not revealed in our image of the ontology. Will available tools help us to automate the systematic mapping of study concepts? Or will manual curation be necessary?

Try the ontoProc package in your browser

Any scripts or data that you put into this service are public.

ontoProc documentation built on Nov. 1, 2018, 4:29 a.m.