library("methods") library("knitr") opts_chunk$set(tidy = FALSE, warning = FALSE, message = FALSE, cache = 1, comment = NA, verbose = TRUE) basename <- gsub(".Rmd", "", knitr:::knit_concord$get('infile')) opts_chunk$set(fig.path = paste("components/figure/", basename, "-", sep=""), cache.path = paste("components/cache/", basename, "/", sep=""))
Users of the popular statistical and mathematical computing platform R [@R] enjoy a wealth of readily installable comparative phylogenetic methods and tools [@taskview]. Exploiting the opportunities arising from this wealth for complex and integrative comparative research questions relies on the ability to reuse and integrate previously generated or published data and metadata. The expanding data exchange needs of the evolutionary research community are rapidly outpacing the capabilities of most current and widely used data exchange standards [@Vos_2012], which were all developed a decade or more ago. This has resulted in a radiation of different data representations and exchange standard "flavors" that are no longer interoperable at the very time when the growth of available data and methods has made that interoperability most valuable. In response to the unmet needs for standardized data exchange in phylogenetics, a modern XML-based exchange standard, called NeXML, has recently been developed [@Vos_2012]. NeXML comprehensively supports current data exchange needs, is predictably machine-readable, and is forward compatible.
The exchange problem for phylogenetic data is particularly acute in light of the challenges in finding and sharing phylogenetic data without the otherwise common loss of most data and metadata semantics [@Drew_2013; @Stoltzfus_2012; @Cranston_2014]. For example, the still popular NEXUS file format [@Maddison_1997] cannot consistently represent horizontal gene transfer or ambiguity in reading a character (such as a DNA sequence base pair). This and other limitations have led to modifications of NEXUS in different ways for different needs, with the unfortunate result that NEXUS files generated by one program can be incompatible with another [@Vos_2012]. Without a formal grammar, software based on NEXUS files may also make inconsistent assumptions about tokens, quoting, or element lengths. @Vos_2012 estimates that as many as 15% of the NEXUS files in the CIPRES portal contain unrecoverable but hard to diagnose errors.
A detailed account of how the NeXML standard addresses these and other relevant challenges can be found in @Vos_2012. In brief, NeXML was designed with the following important properties. First, NeXML is defined by a precise grammar that can be programmatically validated; i.e., it can be verified whether a file precisely follows this grammar, and therefore whether it can be read (parsed) without errors by software that uses the NeXML grammar (e.g. RNeXML) is predictable. Second, NeXML is extensible: a user can define representations of new, previously unanticipated information (as we will illustrate) without violating its defining grammar. Third and most importantly, NeXML is rich in computable semantics: it is designed for expressing metadata such that machines can understand their meaning and make inferences from it. For example, OTUs in a tree or character matrix for frog species can be linked to concepts in a formally defined hierarchy of taxonomic concepts such as the Vertebrate Taxonomy Ontology [@Midford2013], which enables a machine to infer that a query for amphibia is to include the frog data in what is returned. (For a more broader discussion of the value of such capabilities for evolutionary and biodiversity science we refer the reader to @Parr2011.)
To make the capabilities of NeXML available to R users in an easy-to-use form, and to lower the hurdles to adoption of the standard, we present RNeXML, an R package that aims to provide easy programmatic access to reading and writing NeXML documents, tailored for the kinds of use-cases that will be common for users and developers of the wealth of evolutionary analysis methods within the R ecosystem.
install.packages("RNeXML", dependencies=TRUE)
library(RNeXML)
The RNeXML
package is written entirely in R and available under a
Creative Commons Zero public domain waiver. The current development
version can be found on Github at https://github.com/ropensci/RNeXML,
and the stable version can be installed from the CRAN repository.
RNeXML
is part of the rOpenSci project. Users of RNeXML
are encouraged
to submit bug reports or feature requests in the issues log on Github,
or the phylogenetics R users group list at r-sig-phylo@r-project.org
for help. Vignettes with more detailed examples of specific features
of RNeXML are distributed with the R package and serve as a supplement
to this manuscript. Each of the vignettes can be found at
http://ropensci.github.io/RNeXML/.
Conceptually, a NeXML document has the following components: (1)
phylogeny topology and branch length data, (2) character or trait data in matrix form,
(3) operational taxonomic units (OTUs), and (4) metadata. To represent
the contents of a NeXML document (currently in memory), RNeXML
defines
the nexml
object type. This type therefore holds phylogenetic trees as
well as character or trait matrices, and all metadata, which is similar
to the phylogenetic data object types defined in the phylobase
package
[@phylobase], but contrasts with the more widely used ones defined in the
ape
package [@Paradis_2004], which represents trees alone.
When reading and writing NeXML documents, RNeXML
aims to map
their components to and from, respectively, their most widely used
representations in R. As a result, the types of objects accepted
or returned by the package's methods are the phylo
and multiPhylo
objects from the ape
package [@Paradis_2004] for phylogenies, and R's
native data.frame
list structure for data matrices.
The method nexml_read()
reads NeXML files, either from a local file, or
from a remote location via its URL, and returns an object of type nexml
:
nex <- nexml_read("components/trees.xml")
The method get_trees_list()
can be used to extract the phylogenies
as an ape::multiPhylo
object, which can be treated as a list of ape::phylo
objects:
phy <- get_trees_list(nex)
The get_trees_list()
method is designed for use in scripts, providing
a consistent and predictable return type regardless of the number
of phylogenies a NeXML document contains. For greater convenience in
interactive use, the method get_trees()
returns the R object most
intuitive given the arrangement of phylogeny data in the source NeXML
document. For example, the method returns an ape::phylo
object if
the NeXML document contains a single phylogeny, an ape::multiPhylo
object if it contains multiple phylogenies arranged in a single trees
block, and a list of ape::multiPhylo
objects if it contains multiple
trees
blocks (the capability for which NeXML inherits from NEXUS).
If the location parameter with which the nexml_read()
method is
invoked is recognized as a URL, the method will automatically download
the document to the local working directory and read it from there. This
gives convenient and rapid access to phylogenetic data published in
NeXML format on the web, such as the content of the phylogenetic data
repository TreeBASE [@Piel_2002; @Piel_2009]. For example, the following plots a
tree in TreeBASE (using ape's plot function):
tb_nex <- nexml_read( "https://raw.github.com/TreeBASE/supertreebase/master/data/treebase/S100.xml") tb_phy <- get_trees_list(tb_nex) plot(tb_phy[[1]])
The method get_characters()
obtains character data matrices from a
nexml
object, and returns them as a standard data.frame
R object
with columns as characters and rows as taxa:
nex <- nexml_read("components/comp_analysis.xml") get_characters(nex)
A NeXML data matrix can be of molecular (for molecular sequence
alignments), discrete (for most morphological character data), or
continuous type (for many trait data). To enable strict validation of data
types NeXML allows, and if their data types differ requires multiple data
matrices to be separated into different "blocks". Since the data.frame
data structure in R has no such constraints, the get_characters()
method combines such blocks as separate columns into a single data.frame
object, provided they correspond to the same taxa. Otherwise, a list of
data.frame
s is returned, with list elements corresponding to characters
blocks. Similar to the methods for obtaining trees, there is also a method
get_characters_list()
, which always returns a list of data.frame
s,
one for each character block.
The method nexml_write()
generates a NeXML file from its input
parameters. In its simplest invocation, the method writes a tree to
a file:
data(bird.orders) nexml_write(bird.orders, file = "birds.xml")
The first argument to nexml_write()
is either an object of type nexml
,
or any object that can be coerced to it, such as in the above example an
ape::phylo
phylogeny. Alternatively, passing a multiPhylo
object
would write a list of phylogenies to the file.
In addition to trees, the nexml_write()
method also allows to specify
character data as another parameter. The following example uses data
from the comparative phylogenetics R package geiger
[@Pennell2014].
library("geiger") data(geospiza) nexml_write(trees = geospiza$phy, characters = geospiza$dat, file="geospiza.xml")
Note that the NeXML format is well-suited for incomplete data: for instance, here it does not assume the character matrix has data for every tip in the tree.
File validation is a central feature of the NeXML format which ensures
that any properly implemented NeXML parser will always be able to read
the NeXML file. The function takes the path to any NeXML file and returns
TRUE
to indicate a valid file, or FALSE
otherwise, along with a
display of any error messages generated by the validator.
nexml_validate("geospiza.xml")
The nexml_validate()
function performs this validation
using the online NeXML validator (when a network connection is available),
which performs additional checks not expressed in the NeXML schema itself [@Vos_2012].
If a network connection is not available, the function falls back on the
schema validation method from the XML
package [@XML].
nexml
objectsInstead of packaging the various components for a NeXML file at the
time of writing the file, RNeXML
also allows users to create and
iteratively populate in-memory nexml
objects. The methods to do
this are add_characters()
, add_trees()
, and add_meta()
, for
adding characters, trees, and metadata, respectively. Each of these
functions will automatically create a new nexml object if not supplied
with an existing one as the last (optional) argument.
For example, here we use add_trees()
to first create a nexml
object with the phylogeny data, and then add the character data to it:
nexObj <- add_trees(geospiza$phy) nexObj <- add_characters(geospiza$dat, nexObj)
The data with which a nexml
object is populated need not share the
same OTUs. RNeXML
automatically adds new, separate OTU blocks into
the NeXML file for each data matrix and tree that uses a different set of OTUs.
Other than storage size, there is no limit to the number of phylogenies and character matrices that can be included in a single NeXML document. This allows, for example, to capture samples from a posterior probability distribution of inferred or simulated phylogenies and character states in a single NeXML file.
NeXML allows attaching ("annotating") metadata to any data element, and even to metadata themselves. Whether at the level of the document as a whole or an individual data matrix or phylogeny, metadata can provide bibliographic and provenance information, for example about the study as part of which the phylogeny was generated or applied, which data matrix and which methods were used to generate it. Metadata can also be attached to very specific elements of the data, such as specific traits, individual OTUs, nodes, or even edges of the phylogeny.
As described in @Vos_2012, to encode metadata annotations NeXML uses the
"Resource Description Framework in Annotations" (RDFa) [@W3C_2014]. This
standard provides for a strict machine-readable format yet enables future
backwards compatibility with compliant NeXML parsers (and thus RNeXML
),
because the capacity of a tool to parse annotations is not predicated
on understanding the meaning of annotations it has not seen before.
To lower the barriers to sharing well-documented phylogenetic data,
RNeXML
aims to make recording useful and machine-readable metadata
easier at several levels.
First, when writing a NeXML file the package adds certain basic metadata
automatically if they are absent, using default values consistent with
recommended best practices [@Cranston_2014]. Currently, this includes
naming the software generating the NeXML, a time-stamp of when a tree
was produced, and an open data license. These are merely default arguments
to add_basic_meta()
and can be configured.
Second, RNeXML
provides a simple method, called add_basic_metadata()
,
to set metadata attributes commonly recommended for inclusion with data
to be publicly archived or shared [@Cranston_2014]. The currently accepted
parameters include title
, description
, creator
, pubdate
, rights
,
publisher
, and citation
. Behind the scenes the method automatically
anchors these attributes in common vocabularies (such as Dublin Core).
Third, RNeXML
integrates with the R package taxize
[@Chamberlain_2013]
to mitigate one of the most common obstacles to reuse of phylogenetic
data, namely the misspellings and inconsistent taxonomic naming with
which OTU labels are often fraught. The taxize_nexml()
method in
RNeXML
uses taxize
to match OTU labels against the NCBI database,
and, where a unique match is found, it annotates the respective OTU with
the matching NCBI identifier.
The RNeXML
interface described above for built-in metadata allows
users to create precise and semantically rich annotations without
confronting any of the complexity of namespaces and ontologies.
Nevertheless, advanced users may desire the explicit control over
these semantic tools that takes full advantage of the flexibility
and extensibility of the NeXML specification [@Vos_2012; @Parr2011].
In this section we detail how to accomplish these more complex uses
in RNeXML.
Using a vocabulary or ontology terms rather than simple text strings to
describe data is crucial for allowing
machines to not only parse but also interpret and potentially reason
over their semantics. To achieve this benefit for custom metadata
extensions, the user necessarily needs to handle certain technical
details from which the RNeXML
interface shields her otherwise, in particular
the globally unique identifiers (normally HTTP URIs) of metadata terms
and vocabularies. To be consistent with XML terminology, RNeXML
calls
vocabulary URIs namespaces, and their abbreviations prefixes. For
example, the namespace for the Dublin Core Metadata Terms vocabulary
is "http://purl.org/dc/elements/1.1/". Using its common abbreviation
"dc", a metadata property "dc:title" expands to the identifier
"http://purl.org/dc/elements/1.1/title". This URI resolves to a human and
machine-readable (depending on access) definition of precisely what the
term title
in Dublin Core means. In contrast, just using the text string
"title" could also mean the title of a person, a legal title, the verb
title, etc. URI identifiers of metadata vocabularies and terms are not
mandated to resolve, but if machines are to derive the maximum benefit
from them, they should resolve to a definition of their semantics in RDF.
RNeXML
includes methods to obtain and manipulate metadata properties,
values, identifiers, and namespaces. The get_namespaces()
method
accepts a nexml
object and returns a named list of namespace prefixes
and their corresponding identifiers known to the object:
birds <- nexml_read("birds.xml") prefixes <- get_namespaces(birds) prefixes["dc"]
The get_metadata()
method returns, as a named list, the metadata
annotations for a given nexml
object at a given level, with the whole
NeXML document being the default level ("all"
extracts all metadata
objects):
meta <- get_metadata(birds) otu_meta <- get_metadata(birds, level="otu")
The returned list does not include the data elements to which the metadata are attached. Therefore, a different approach, documented in the metadata vignette, is recommended for accessing the metadata attached to data elements.
The meta()
method creates a new metadata object from a property name
and content (value). For example, the following creates a modification
date metadata object, using a property in the PRISM vocabulary:
modified <- meta(property = "prism:modificationDate", content = "2013-10-04")
Metadata annotations in NeXML
can be nested within another
annotation, which the meta()
method accommodates by accepting a
parameter children
, with the list of nested metadata objects (which
can themselves be nested) as value.
The add_meta()
function adds metadata objects as annotations to a
nexml
object at a specified level, with the default level being the
NeXML document as a whole:
birds <- add_meta(modified, birds)
If the prefix used by the metadata property is not among the built-in
ones (which can be obtained using get_namespaces()
), it has to be
provided along with its URI as the namespaces
parameter. For example,
the following uses the "Simple Knowledge Organization System" (SKOS)
vocabulary to add a note to the trees in the nexml
object:
history <- meta(property = "skos:historyNote", content = "Mapped from the bird.orders data in the ape package using RNeXML") birds <- add_meta(history, birds, level = "trees", namespaces = c(skos = "http://www.w3.org/2004/02/skos/core#"))
Alternatively, additional namespaces can also be added in batch using
the add_namespaces()
method.
By virtue of subsetting the S4 nexml
object, RNeXML
also offers
fine control of where a meta
element is added, for which the package
vignette on S4 subsetting of nexml
contains examples.
Because NeXML expresses all metadata using the RDF standard, and stores
them compliant with RDFa, they can be extracted as an RDF graph, queried,
analyzed, and mashed up with other RDF data, local or on the web, using
a wealth of off-the-shelf tools for working with RDF (see @W3C_2014
or @Hartig_2012). Examples for these possibilities are included in the
RNeXML
SPARQL vignette (a recursive acronym for SPARQL Protocol and
RDF Query Language, see http://www.w3.org/TR/rdf-sparql-query/), and
the package also comes with a demonstration that can be run from R using
the following command: demo("sparql", "RNeXML")
).
NeXML was designed to prevent the need for future non-interoperable "flavors" of the standard in response to new research directions. Its solution to this inevitable problem is a highly flexible metadata system without sacrificing strict validation of syntax and structure.
Here we illustrate how RNeXML
's interface to NeXML's metadata system
can be used to record and share a type of phylogenetic data not taken
into account when NeXML was designed, in this case stochastic character
maps [@Huelsenbeck_2003]. Such data assign certain parts (corresponding
to time) of each branch in a time-calibrated phylogeny to a particular
"state" (typically of a morphological characteristic). The current
de-facto format for sharing stochastic character maps, created by
simmap
[@Bollback_2006], a widely used tool for creating such maps,
is a non-interoperable modification of the standard Newick tree format.
This means that computer programs designed to read Newick or NEXUS formats
may fail when trying to read in a phylogeny that includes simmap
annotations.
In contrast, by allowing new data types to be added as --- sometimes
complex --- metadata annotations NeXML can accommodate data extensions
without compromise to its grammar and thus syntax In NeXML. To illustrate
how RNeXML facilitates extending the NeXML standard in this way, we
have implemented two functions in the package, nexml_to_simmap
and
simmap_to_nexml
. These functions show
how simmap data can be represented as meta
annotations on the branch
length elements of a NeXML tree, and provide routines to convert between
this NeXML representation and the extended ape::phylo
representation
of a simmap
tree in R that was introduced by @Revell_2012. We encourage
readers interested in this capability to consult the example code in
simmap_to_nexml
to see how this is implemented.
Extensions to NeXML must also be defined in the file's namespace
in order to valid. This provides a way to ensure that a URI providing
documentation of the extension is always included. Our examples here
use the prefix, simmap
, to group the newly introduced
metadata properties in a vocabulary, for which the add_namespace()
method can be used to give a URI as an identifier:
nex <- add_namespaces(c(simmap = "https://github.com/ropensci/RNeXML/tree/master/inst/simmap.md"))
Here the URI does not resolve to a fully machine-readable definition of the terms and their semantics, but it can nonetheless be used to provide at least a human-readable informal definition of the terms.
Data archiving is increasingly required by scientific journals, including in evolutionary biology, ecology, and biodiversity (e.g. @Rausher_2010). The effort involved with preparing and submitting properly annotated data to archives remains a notable barrier to the broad adoption of data archiving and sharing as a normal part of the scholarly publication workflow [@Tenopir_2011; @Stodden_2014]. In particular, the majority of phylogenetic trees published in the scholarly record are inaccessible or lost to the research community [@Drew_2013].
One of RNeXML
's aims is to promote the archival of well-documented
phylogenetic data in scientific data repositories, in the form of
NeXML files. To this end, the method nexml_publish()
provides an API
directly from within R that allows data archival to become a step programmed
into data management scripts. Initially, the method supports the data repository Figshare (http://figshare.com):
doi <- nexml_publish(birds, repository="figshare")
This method reserves a permanent identifier (DOI) on the figshare repository that can later be made public through the figshare web interface. This also acts as a secure backup of the data to a repository and a way to share with collaborators prior to public release.
RNeXML
allows R's ecosystem to read and write data in the NeXML
format through an interface that is no more involved than reading or
writing data from other phylogenetic data formats. It also carries
immediate benefits for its users compared to other formats. For
example, comparative analysis R packages and users frequently add
their own metadata annotations to the phylogenies they work with, such
as annotations of species, stochastic character maps, trait values,
model estimates and parameter values. RNeXML
affords R the
capability to harness machine-readable semantics and an extensible
metadata schema to capture, preserve, and share these and other kinds
of information, all through an API instead of having to understand in
detail the schema underlying the NeXML standard. To assist users in
meeting the rising bar for best practices in data sharing in
phylogenetic research [@Cranston_2014], RNeXML
captures metadata
information from the R environment to the extent possible, and applies
reasonable defaults.
The goals for continued development of RNeXML
revolve primarily
around better interoperability with other existing phylogenetic data
representations in R, such as those found in the phylobase
package
[@phylobase]; and better integration of the rich metadata semantics
found in ontologies defined in the Web Ontology Language (OWL),
including programmatic access to machine reasoning with such metadata.
This project was supported in part by the National Evolutionary
Synthesis Center (NESCent) (NSF #EF-0905606), and grants from the
National Science Foundation (DBI-1306697) and the Alfred P Sloan
Foundation (Grant 2013-6-22). RNeXML
started as a project idea for
the Google Summer of Code(TM), and we thank Kseniia Shumelchyk for taking
the first steps to implement it. We are grateful to F. Michonneau for
helpful comments on an earlier version of this manuscript, and reviews
by Matthew Pennell, Associate Editor Richard FitzJohn, and an anonymous
reviewer. At their behest, the reviews of FitzJohn and Pennell can be found in this
project's GitHub page at github.com/ropensci/RNeXML/issues/121 and github.com/ropensci/RNeXML/issues/120, together with our replies and a record of our revisions.
All software, scripts and data used in this paper can be found in the permanent data archive Zenodo under the digital object identifier doi:10.5281/zenodo.13131 [@zenodo]. This DOI corresponds to a snapshot of the GitHub repository at github.com/ropensci/RNeXML.
unlink("birds.xml") unlink("geospiza.xml")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.