title: "RPhenoscape Tutorial" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{RPhenoscape} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The desired input of many comparative studies is a matrix that includes all the data available for a set of traits and a focal taxon. Bringing similar types of trait data together across studies is notoriously difficult, but is straightforward with ontology annotations. Trait data that are tagged using similar ontology terms can be automatically called using R functions.
This workshop uses trait data for vertebrates that has been annotated (tagged) with ontology terms. These data are accessed from the Phenoscape KB. In this lesson we use R functions to obtain and view character matrices from the KB, obtain synthetic matrices of inferred presence/absence characters, and understand the meaning and relations of ontology terms. The lesson assumes basic knowledge of R and RStudio.
You would like to understand the evolution of morphological traits in a taxonomic group, and you need a matrix to map to your (typically molecular) phylogenetic tree. You come to the Phenoscape KB to see what types of data are available across the members of your clade. In this exercise, you are an ichthyologist, interested in catfishes (Siluriformes), specifically the bullhead catfishes in the family Ictaluridae.
You begin by querying ‘Siluriformes’ in the Phenoscape KB faceted browsing page. You immediately see the scope of the data:
From left to right across the top tabs, for ‘Siluriformes’:
Taxon annotations are the ontology terms tagged to original free text descriptions of phenotypes for each siluriform taxon. Click on the ‘Taxon annotations’ tab, and then the ‘Sources’ box on the right hand side of each row to see the publication source(s) and free text description of the phenotype to which each ontology annotation was tagged.
You are specifically interested in Ictaluridae (bullhead catfishes), and you click on that family in the faceted browsing interface. Now filter using ‘fin’ as the anatomical entity under ‘Query’ and include parts (check the ‘parts’ box.)
Download these taxa and their morphological trait data using the link provided.
Go to the publications tab, which shows nine publications that contain fin data. Note that any of these studies, e.g., Lundberg (1992), can be entered under ‘publications’ in faceted browsing, but the original matrix cannot be downloaded.
Now we will repeat some of the above steps in R, beginning with a query for ‘Ictaluridae’ and ‘fin’ (including parts) to retrieve the list of studies that contain fin characters for this taxon.
The RPhenoscape package provides convenient access to the Phenoscape Knowledgebase in the form of R functions returning R-native data structures. So that we don't have to prefix every call of the package's functions with the package name, we will first attach the package:
library(rphenoscape)
First let’s look up information about the query terms. The find_term()
function searches the KB for terms matching a given search text, and will return the ID (term IRI), label, source ontology IRI, and type of match (e.g., exact, broad). For example, search for terms containing the string “fin” in the label or synonym:
find_term("fin")
Note that because the KB integrates model organism phenotypes, the results show types of fin specific to model organisms, such as "fin (zebrafish)".
There are several parameters available to narrow and filter search results. For example, the parameter matchType
allows restricting to exact matches only:
find_term("fin", matchTypes = c("exact"))
To retrieve more details about a given term, such as definition, synonyms and classification, you can pass the term IRI (e.g., 'http://purl.obolibrary.org/obo/UBERON_0008897' for fin) to the function as.terminfo()
. Classification details include the superclass (parent) and subclass (child) relationships for a given ontology term, along with other relationships such as part_of.
For example, view the relationships and term info for “pectoral fin”:
pecfin <- find_term('pectoral fin', matchType='exact') as.terminfo(pecfin)
For a specific taxon, such as Siluriformes, you can also view the taxon rank, whether or not the taxon is extinct, and its common name:
taxon <- find_term('Siluriformes') as.terminfo(taxon)
Now we will query for ‘Ictaluridae’ and ‘fin’ (including parts) to retrieve the list of studies that contain fin characters for this taxon.
slist <- get_studies(taxon = "Ictaluridae", entity = "fin", includeRels = "part of")
The list is returned as a data.frame containing the study IDs and citations. Navigate to Global Environment and click on ‘slist’ to view the ID and citation (author, year) for each study in the KB that contains any ictalurid species associated with fin data. Note that taxon, entity, and quality are optional arguments in this query, and we have also included fin parts. Here we found nine studies. Click on ‘slist’ in the Global Environment to view these studies and their IDs.
Next we want to obtain the character matrix for one of these studies. Using the study IDs from the previous step, we can get a list of the character matrices as NeXML objects. We will choose the Lundberg (1992) publication, the fifth item on the list above.
lundberg_nexml <- get_study_data(slist$id[5])
Now go to Global Environment pane and click on ‘lundberg_nexml’ to view the study ID and associated NeXML object.
To view the full Lundberg (1992) matrix within RStudio, we need to first retrieve the matrix as a data.frame from the NeXML object:
lundberg_matrix <- get_char_matrix(lundberg_nexml[[1]], otus_id = FALSE)
You can take a look at a small part of the Lundberg (1992) matrix (e.g., the first five taxa and first five characters):
lundberg_matrix[1:5, 1:7]
Now navigate to Global Environment and click on ‘lundberg_matrix’ to view the matrix, including the total number of taxa and characters (i.e., the dimensions of the data.frame). To view the entire matrix in this panel, click on the first icon that appears to the right when you hover over the second row.
We can use the information in the NeXML object to convert character state symbols to labels:
state_symbols2labels(lundberg_nexml[[1]], charmat = lundberg_matrix)[1:5, 1:7]
Note that all taxa and characters from this study are returned, not just those pertaining to our original search terms. In this example, many anatomical entities other than ‘fin’ and its parts are returned as part of the original study matrix. In a later step, we will learn how to subset a matrix using anatomy or taxonomy terms.
In the previous step, we queried the KB for individual studies that contain characters pertaining to the fins in ictalurid catfish, and we obtained character matrices for source publications that included those characters. Now we want to get a matrix that synthesizes these characters into a single matrix.
OntoTrace from the KB combines characters from multiple publications into a single synthetic matrix. It allows a user to specify any number or combination of anatomical entities and taxa. It returns a matrix as a nexml object that includes only presence or absence of anatomical entities. It uses asserted data of presence or absence, i.e., originally described by an author, or inferred data. Data are inferred from asserted characters, e.g., if an author describes a part of a fin such as ‘fin ray’, the fin is inferred present and scored as such in the matrix.
The OntoTrace default settings are to include parts of the anatomical entity and to include variable-only characters, but here we will include those that are invariant as well.
nex <- get_ontotrace_data(taxon = "Ictaluridae", entity = "fin", variable_only = FALSE)
To view the matrix within RStudio, get the matrix (m) as a data.frame
from the NeXML object. Opening this (‘m’) in the Global Environment shows the taxa, character names, and state assignments (0, absent; 1, present) (note: the otu and otus are NeXML-internal IDs):
m <- get_char_matrix(nex)
The character and taxon lists for the matrix can also be returned. Click on ‘meta’ in the Global Environment pane to view the data.frames for taxa and entities:
meta <- get_char_matrix_meta(nex)
The above returns 29 species of ictalurid catfishes and 124 anatomical entities for fins and their parts.
You will need to go to the Phenoscape KB to view the supporting states, i.e., the source data upon which inferences were made or those that were directly asserted.
For example, in the returned OntoTrace matrix (see ‘m’ in Global Environment (row 10)) the pelvic fin is scored as present ('1') for Ictalurus australis. Enter these data on the faceted browsing page:
Looking at ‘Taxon annotations’, you will find that there are zero phenotypes, i.e., no asserted character states to ‘pelvic fin present’. Now click the ‘inferred presence’ radio button under Phenotypic quality. You will now see two phenotypes for this taxon in the KB. These phenotypes ‘pelvic fin lepidotrichium amount’ and ‘pelvic splint present’ logically imply the presence of a pelvic fin. Thus, though an author did not directly assert the presence of a pelvic fin in Ictalurus australis, it is inferred and thus reported as present (‘1’) in the OntoTrace matrix.
In the above step you retrieved a synthetic matrix for the Ictaluridae. Now you would like to pare it down in taxonomic and anatomical scope. Let’s begin by trimming the matrix down to the presence and absence characters pertaining only to the genus Ictalurus. We will reduce the matrix to include only descendents of Ictalurus.
First, we need to parse the list to determine which taxa in the matrix (‘m’) are descendants of Ictalurus. This returns a list of TRUE or FALSE assignments for each taxon in the matrix:
is_desc <- is_descendant('Ictalurus', m$taxa)
We then subset the matrix m to the descendants of Ictalurus:
ictalurus_m <- m[is_desc, ]
View the new matrix by clicking on the ‘ictalurus_m’ data.frame
in the Global Environment.
We can also subset the matrix by anatomy. In this case, let’s select characters for the pectoral fin and its parts. We will use is_descendant()
to first parse the characters (column names; note the first three taxon-related columns are excluded) to determine which are descendents (types or parts) of the term ‘pectoral fin’. Because this function only looks for descendent terms, we also need to specify that any characters pertaining to 'pectoral fin' itself are included.
is_desc_pecfin <- is_descendant('pectoral fin', colnames(ictalurus_m)[-c(1,2,3)], includeRels = "part_of") | colnames(ictalurus_m)[-c(1,2,3)] == 'pectoral fin' table(is_desc_pecfin)
Using the ‘colnames’ command below, view the character names to verify that pectoral fin related characters will be selected:
colnames(ictalurus_m)[-c(1,2,3)][is_desc_pecfin]
We now subset the matrix m using ‘is_desc_pecfin’ and retain the first three taxon-related columns:
to_keep <- c(TRUE, TRUE, TRUE, is_desc_pecfin) pecfin_m <- ictalurus_m[, to_keep]
View this matrix of pectoral fin characters for Ictalurus by clicking on ‘pecfin_m’ in the Global Environment.
Knowing how the information about an anatomical structure may rely on other layers of anatomical information is critical, if the goal is to isolate and analyze independent features. We can query the KB for a presence-absence dependency matrix for a specified set of anatomical structures. The dependency matrix is derived from the formalized knowledge of relationships among entities encoded in the ontology. In the example below, the presence of ‘pelvic splint’ is dependent on the presence of ‘pelvic fin’. In other words, the pelvic fin is implied present when the pelvic splint is present because ‘pelvic splint’ is part_of ‘pelvic fin'.
tl <- c("pectoral fin", "pelvic fin", "pelvic splint", "antorbital", "preopercle") dep_mat <- pa_dep_matrix(tl, .names = "label") dep_mat
In the PARAMO tutorial, we will learn how to use these dependency matrices to reconstruct individual traits and entire phenotypes.
We can also obtain similarity metrics for phenotypes based on ontological distance. Three similarity metrics - Jaccard, Tanimoto, and Cosine - are computed by the KB to measure similarity between terms (see RPhenoscape documentation in the Help panel for how these are measured). For example, here we compute and visualize Jaccard similarity for premaxillary tooth phenotypes in Ictaluridae:
phens <- get_phenotypes(entity = "premaxillary tooth", taxon = "Ictaluridae") nrow(phens) sm <- jaccard_similarity(terms = phens$id, .labels = phens$label, .colnames = "label") plot(hclust(as.dist(1-sm)))
In the PARAMO tutorial, we'll use a semantic similarity matrix to visualize traits in a heat map. We are also developing methods that use semantic similarity to cluster phenotypes into characters and character states.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.