title: "ontoProc: RDF ontology processing for Bioconductor" author: "Vincent J. Carey, stvjc at" date: "r format(Sys.time(), '%B %d, %Y')" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{ontoProc: RDF ontology processing} %\VignetteEncoding{UTF-8} output: BiocStyle::pdf_document: toc: yes number_sections: yes BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes

```{r setupp,echo=FALSE,results="hide"} suppressWarnings({ suppressPackageStartupMessages({ library(ontoProc) library(BiocStyle) library( library( }) })

# Introduction

The `r Biocpkg("ontoProc")` package
includes tools for

- programming with ontology snapshots that are distributed with the package
- annotating free text with ontology tags

Our primary objective is facilitating use of ontological
metadata to simplify construction of formally
annotated hierarchies of samples or
features that should be traversed in analysis of complex
genomic experiments.

The ontoProc package was developed to facilitate
the coding of an ontology-driven visualizer of transcriptomic
patterns in single-cell RNA-seq studies ([tenXplore](


# Application to cell type hierarchy

## An enumeration of cell types

We used the [Experimental Factor Ontology]( 'cell type' class ([EFO_0000324]( to obtain
an enumeration of cell types.  As of August 22 2017 it is an open question
whether [Cell Ontology]( or
[Cellosaurus]( should be used for this purpose.  The author's
subjective impression is that EFO has a simpler collection of terms for cell types,
while Cell Ontology has a better collection of terms for types of neurons.

## Basic operations using ontologyIndex facilities

This package ships with an R serialization
of an OBO representation of the [Cell Ontology](
This is created using `get_OBO` in `r CRANpkg("ontologyIndex")`.
(For ontologies only available in OWL format, the python pronto
package was used to convert to OBO.)
```{r useCO}
cellOnto = getCellOnto()

At this time, elementary manipulations of the ontology involve collecting the children, siblings, or labels for given URIs. ```{r useCO2} cochil = children_TAG("CL:0000540", cellOnto) cochil label_TAG("CL:0000540", cellOnto) siblings_TAG("CL:0000540", cellOnto)

# Application: finding genes annotated to neuron subtypes

We focus on mouse.  The neuron subtypes identified as
OWL subclasses of "neuron" have names
```{r getcl}
cleanNames = function(tset) {

We would like to see if the expression data would allow us to discriminate neurons of these different types.

Bridging from Cell Ontology to mouse genes

There is no formal linkage at present between terms of Cell Ontology and those of Gene Ontology. Research on inference of tissue of origin from expression signatures has led to accurate classifiers (Lee, Krishnan, Troyanskaya) and applications in cell mixture deconvolution (Houseman). Formal work in ontology bridging has been described but the specific task of mapping from Cell Ontology terms to Gene Ontology terms has not culminated in any programmatically available resource.

We apply approximate pattern matching (agrep in R) to find gene ontology terms that are apparently relevant to cell type vocabulary terms of interest. These are then mapped to gene annotation. Simple (non-vectorized) functions that accomplish this in an organism-specific are straightforward using the OrgDb packages. We serialized all GO terms for convenience with this package, in the data object allGOterms.

```{r lkfuns} data(allGOterms) cellTypeToGO("serotonergic neuron", gotab=allGOterms) cellTypeToGenes("serotonergic neuron",, gotab=allGOterms) cellTypeToGenes("serotonergic neuron",, gotab=allGOterms)

## Discrimination of neuron types: exploratory multivariate analysis

At this point the API for selecting cell types, bridging to gene
sets, and acquiring expression data, is not well-modularized.  Thus
the best ways to get a feel for it are to use tenXplore() function,
and to read the source code.  In brief, we often fail to find
GO terms that approximately match, as strings, Cell Ontology
terms corresponding to cell subtypes.  On the other hand, if we
match on cell types, we get very large numbers of matches, which,
at this time,
will need to be filtered to get manageable feature sets.  We 
will introduce tools for generating
additional RDF to improve gene harvesting in real time.  But the
associated statements will need to be curated.  The EBI Webulous
system should be useful for introducing new terms that
facilitate better connections between anatomic structures and
sets of genes or other genomic features.

# Annotation of free text

The `humrna` data.frame supplied with the package is a small
sample of metadata from NCBI Sequence Read Archive (SRA).  The
`study title` field has been serialized as `minicorpus`.

```{r lkmc}

There is a convention in text analysis of identifying stop words that are unlikely to be very useful for interpretation. The dropStop function tokenizes the study titles and eliminates stop words.

```{r lksto} dropStop(head(minicorpus))

My hope is that EMBL BioSolr will help index strings of this
type with formal ontology terms.  However, as a step in the
general direction, we have the following examples.

```{r lk1}
cs = getCellosaurusOnto()
ch = getChebiOnto()
grep("P493", cs$name, value=TRUE,
grep("doxycyline", ch$name, value=TRUE,

Based on PMID 10956386, P493-6 is an EBV-EBNA1 positive cell line, but that is not revealed in our image of the ontology. Will available tools help us to automate the systematic mapping of study concepts? Or will manual curation be necessary?

Try the ontoProc package in your browser

Any scripts or data that you put into this service are public.

ontoProc documentation built on Nov. 1, 2018, 4:29 a.m.