In cboettig/taxald: A High-Performance Local Taxonomic Database Interface

library(kableExtra)
library(magrittr)
library(tidyverse)
library(taxadb)
library(printr)
library(rticles)
library(knitr)

## see https://blog.rstudio.com/2017/06/13/dplyr-0-7-0/, 
## https://community.rstudio.com/t/why-does-na-match-na-when-joining-two-dataframes/28785/3
pkgconfig::set_config("dplyr::na_matches" = "never")

## Display NAs as "-" because "NA" looks like a category
options(knitr.kable.NA = '-') 

printtable <- function(df, ...){
  df %>%
    kableExtra::kable("latex", booktabs=T, ...) %>%
    kableExtra::kable_styling(full_width = F, latex_options = "hold_position")
}
knitr::opts_chunk$set(cache=FALSE, message=FALSE, warning=FALSE)

# no sci notation integers pleases
options(scipen=999)

taxadb::td_disconnect()

As ecologists and evolutionary biologists synthesize datasets across larger and larger assemblies of species, we face a continual challenge of maintaining consistent taxonomy. How many species are in the combined data? Do the studies use the same names for the same species, or do they use different synonyms for the same species? Failing to correct for such differences can lead to significant inflation of species counts and miss-aligned datasets. These challenges have become particularly acute as it becomes increasingly common for researchers to work across a larger number and diversity of species in any given analysis, which may preclude the resources or substantive taxonomic expertise for all clades needed to resolve scientific names [@Patterson2010].

While these issues have long been recognized in the literature [@boyle2013; @dayrat2005; @bortolus2008; @maldonado2015; @remsen2016], and a growing number of databases and tools have emerged over the past few decades [e.g. @itis; @ncbi; @col; @rees2014; @alvarez2018; @wagner2016; @foster2018; @gries2014], it remains difficult to resolve taxonomic names to a common authority in a transparent, efficient, and automatable manner. Here, we present an R package, taxadb, which seeks to address this gap.

Databases of taxonomic names such as the Integrated Taxonomic Information System [ITIS; @itis], the National Center for Biological Information's (NCBI) Taxonomy database [@ncbi], the Catalogue of Life [COL; @col], and over one hundred other providers have sought to address these problems by providing expert-curated lists of accepted taxonomic names, synonyms, associated taxonomic rank, hierarchical classifications, and scientific authority (e.g. author and date) establishing a scientific name. The R language [@R] is widely used in ecology and evolution [@Lai2019] and the taxize package [@Chamberlain2013] has become a popular way for R users to interact with naming providers and name resolution services. taxize implements bindings to the web APIs (Application Programming Interface) hosted by many popular taxonomic name providers. Nevertheless, this means that functions in the taxize are impacted by several major drawbacks that are inherent in the implementation of these central API servers, such as:

Queries require internet access at all times.
Queries are slow and inefficient to implement and perform; frequently requiring separate API calls for each taxonomic name.
The type of query is highly limited by the API design. For instance, it is usually impossible to make queries across the entire corpus of names, such as "which accepted name has the most known synonyms?".
Both query formats and responses differ substantially across different naming providers, making it difficult to apply a script designed for one provider to different provider.
Most queries are not reproducible, as the results depend on the state of the central server (and potentially the quality of the internet connection)[@rees2017]. Many names providers update the server data either continuously or at regular intervals, including both revising existing names (for spelling or changes in accepted name designation) and adding new names.

Instead of binding existing web APIs, taxadb is built around a set of compressed text files which are automatically downloaded, imported, and stored on a local database by taxadb. The largest of the taxonomic naming providers today contain under 6 million name records with uncompressed file sizes over a GB, which can be compressed to around 50 MB and downloaded in under a minute on a 1 MB/s connection. By using a local database as the backend, taxadb allows R users to interact with large data files without large memory (RAM) requirements. A query for a single name over the web API requires a remote server to respond, execute the query, and serialize the response, which can take several seconds. Thus it does not take many taxa before transferring the entire data set to query locally is more efficient. Moreover, this local copy can be cached on the user's machine, requiring only the one-time setup, and enabling offline use and reproducible queries. Rather than returning data in whatever format is given by the provider, taxadb provides a data structure following a consistent, standardized layout or schema following Darwin Core, which provides standard terms for biodiversity data [@Wieczorek2012]. Table 1 summarizes the list of all naming providers currently accessed by taxadb. More details are provided in the Data Sources Vignette, https://docs.ropensci.org/taxadb/articles/data-sources.html.

#provider descriptions
desc <- c(itis = "originally formed to standardize taxonomic name usage across many agencies in the United States federal government",
          ncbi = "nomenclature for sequences in the International Nucleotide Sequence Database Collaboration database",
          col = "comprehensive taxonomic effort, includes some other providers (e.g. itis)",
          gbif = "taxonomic backbone of the GBIF database, assembled from other sources including COL",
          fb = "nomenclature for global database of fishes",
          ott = "comprehensive tree of life based on phylogenetic trees and taxonomic data",
          iucn = "taxonomy for classification of species status")

tibble(provider = c("Integrated Taxonomic Information System (ITIS 2019)",
                                 "National Center for Biological Information's Taxonomy database (Biotechnology Information 2019)",
                                 "Catalogue of Life (Roskov Y. 2018)",
                                 "Global Biodiversity Information Facility Taxonomic Backbone (GBIF 2019)",
                                 "FishBase (Froese and Pauly 2019)",
                                 "Open Tree Taxonomy (J. A. Rees and Cranston 2017)",
                                 "International Union for Conservation of Nature and Natural Resources (IUCN 2019)"),
                    abbreviation = c("itis", "ncbi", "col", "gbif", "fb", "ott", "iucn"),
                    total_identifiers = map(abbreviation,
                                            ~taxa_tbl(.x) %>%
                                              select(acceptedNameUsageID) %>% pull() %>% n_distinct()),
                    description = desc
) %>%
  knitr::kable(format = "latex", col.names = c("Provider", "Abbreviation", "Number of \nIdentifiers", "Description"), escape = FALSE, 
               caption = "Descriptions of the providers supported by taxadb with their reference abbreviation and the total number of identifiers contained by each provider.") %>%
  kable_styling(latex_options= c("scale_down", "hold_position")) %>%
  kableExtra::column_spec(1, width = "5cm") %>%
  kableExtra::column_spec(4, width = "10cm")

Package Overview

library(tidyverse)
library(taxadb)

After loading our package and the tidyverse package for ease in manipulating function output, we look up the taxonomic identifier for Atlantic Cod, Gadus morhua, and the compliment:

get_ids("Gadus morhua")
get_names("ITIS:164712")

Our first call to any taxadb functions will automatically set up a local, persistent database if one has not yet been created. This one-time setup will download, extract, and import the compressed data into persistent database storage (using the appropriate location specified by the operating system [see @rappdirs], or configured using the environmental variable TAXADB_HOME). The example above searches for names in ITIS, the default provider, which can be configured using the provider argument. Any future function calls to this function or any other function using data from the same provider will be able to access this data rapidly without the need for processing or an internet connection.

Users can also explicitly trigger this one-time setup using td_create() and specifying the provider abbreviation (see Table 1), or simply using all to install all available providers:

td_create("all")

taxadb functions like get_ids() and td_create() take an optional argument, db, to an external database connection. taxadb will work with most DBI-compliant databases such as MySQL or Postgres, but will be much faster when using a column-oriented database engine such as duckdb or MonetDBLite. These latter options are also much easier for most users, since each can be installed directly as an R package. taxadb will default to the fastest available option. taxadb can also run without a database backend by setting db=NULL, though some functions will require a lot (2-20 GB) of free RAM for this to work with many of the larger providers.

taxadb uses the widely known SQLite database by default, but users are encouraged to install the optional, suggested database backends by passing the option dependencies = TRUE to the install command. This installs a MonetDBLite database instance [@monetdblite], a columnar-oriented relational database requiring no additional installation while also providing persistent disk-based storage. This also installs duckdb, another local columnar database which is rapidly emerging as an alternative to MonetDB and SQLite. taxadb will automatically detect and use these database engines if available, and automatically handles opening, caching, and closing the database connection. For large queries, MonetDBLite or duckdb deliver impressive improvements. Our benchmark on resolving the 750 species names in the Breeding Bird Survey against over 3 million names known in the 2019 Catalogue of Life takes 8 minutes in SQLite but less than a second in MonetDBLite.

Functions in taxadb are organized into several families:

queries that return vectors: get_ids() and it's complement, get_names(),
queries that filter the underlying taxonomic data frames: filter_name(), filter_rank(), filter_id(), and filter_common(),
database functions td_create(), td_connect() and taxa_tbl(),
and helper utilities, such as clean_names().

Taxonomic Identifiers

Taxonomic identifiers provide a fundamental abstraction which lies at the heart of managing taxonomic names. For instance, by resolving scientific names to identifiers, we can identify which names are synonyms -- different scientific names used to describe the same species -- and which names are not recognized. Each naming authority provides its own identifiers for the names it recognizes. For example, the name Homo sapiens has the identifier 9606 in NCBI and 180092 in ITIS. To avoid possible confusion, taxadb always prefixes the naming provider, e.g. NCBI:9606. Some taxonomic naming providers include separate identifiers for synonyms, see Box 1. Unmatched names may indicate an error in data entry or otherwise warrant further investigation. Taxon identifiers are also easily resolved to the original authority (scientific publication) establishing the name. The common practice of appending an author and year to a scientific name, e.g. Poa annua annua (Smith 1912), serves a valuable role in disambiguating different uses of the same name but can be notoriously harder to resolve to the appropriate reference, while variation in this convention creates many distinct versions of the same name [@Patterson2010].

These issues are best illustrated using a minimal example. We'll consider the task of combining data on bird extinction risk as assessed by the IUCN [@iucn] with data on average adult biomass, as estimated in the Elton Traits v1.0 database [@elton-traits]. To keep the example concise enough for for visual presentation we will focus on a subset involving just 10 species (Table 2, 3).

trait_data <- read_tsv(system.file("extdata", "trait_data.tsv", package="taxadb"))
status_data <- read_tsv(system.file("extdata", "status_data.tsv", package="taxadb"))

status_data %>% printtable(caption = "The subset of the IUCN status data used for subsequent taxonomic identifier examples.") %>%
  column_spec(1, italic = TRUE)

trait_data %>% printtable(caption = "The subset of the Elton trait data used for subsequent taxonomic identifier examples.") %>%
  column_spec(1, italic = TRUE)

If we attempted to join these data directly on the species names provided by each table, we would find very little overlap, with only one species name having both a body mass and an IUCN threat status resolved (Table 4).

joined <- full_join(trait_data, status_data, by = c("elton_name" = "iucn_name"))

joined %>%
  printtable(caption = "Example IUCN and trait data joined directly on scientific name showing only one match. While common, joining on scientific name does not account for nomenclatural and taxonomic inconsistencies between databases and therefore results in seemingly very little overlap in species representation between the two.") %>%
  column_spec(1, italic = TRUE)

If we first resolve names used in each data set into shared identifiers, (for instance, using the Catalogue of Life), we discover that there is far more overlap in the species coverage than we might have initially realized. First, we just add an ID column to each table by looking up the Catalog of Life identifier for the name provided:

traits <- trait_data %>% mutate(id = get_ids(elton_name, "col"))
status <- status_data %>% mutate(id = get_ids(iucn_name, "col"))

We can now join on the id column instead of names directly:

joined <- full_join(traits, status, by = "id")

## Just for pretty-printing
joined %>%  
  tidyr::replace_na(list(category = "-", elton_name = "-", iucn_name = "-")) %>%
  select(elton_name, iucn_name, mass, category, id) %>%
  printtable(caption = "Example IUCN and trait data joined on taxonomic ID. Multiple species have a different scientific name in the Elton and IUCN Redlist databases but can be match based on their COL taxonomic ID.") %>%
  column_spec(c(1,2), italic = TRUE)

This results in many more matches (Table 5), as different scientific names are recognized by the naming provider (Catalog of Life 2018 in this case), as synonyms for the same species, and thus resolve to the same taxonomic identifier. While we have focused on a small example for visual clarity here, the get_ids() function in taxadb can quickly resolve hundreds of thousands of species names to unique identifiers, thanks to the performance of fast joins in a local MonetDBLite database.

\dummy{\Begin{tcolorbox}[title= Box 1: Taxonomic Identifiers and Synonyms, lower separated=false]}

get_ids() returns the acceptedNameUsageID, the identifier associated with the accepted name. Some naming providers, such as ITIS and NCBI, provide taxonomic identifiers to both synonyms and accepted names. Other providers, such as COL and GBIF, only provide identifiers for accepted names. Common practice in Darwin Core archives is to provide an acceptedNameUsageID only for names which are synonyms, and otherwise to provide a taxonID. For accepted names, the acceptedNameUsageID is then given as missing (NA), while for synonyms, the taxonID may be missing (NA). In contrast, taxadb lists the acceptedNameUsageID for accepted names (where it matches the taxonID), as well as known synonyms. This is semantically identical, but also more convenient for database interfaces, since it allows a name to mapped to its accepted identifier (or an identifier to map to it's accepted name usage) without the additional logic. For consistency, we will use the term "identifier" to mean the acceptedNameUsageID rather than the more ambiguous taxonID (which is undefined for synonyms listed by many providers), unless explicitly stated otherwise.

\End{tcolorbox}

Unresolved names

get_ids offers a first pass at matching scientific names to id, but names may remain unresolved for a number of reasons. First, a name may match to multiple accepted names, as in the case of a species that has been split. By design, these cases are left to be resolved by the researcher using the filter_ functions to filter underlying taxonomic tables for additional information. A name may also be unresolved due to typos or improper formatting. clean_names addresses common formatting issues such as the inclusion of missing species epithets (e.g. Accipiter sp.) that prevent matches to the Genus, or intraspecific epithets such as Colaptes auratus cafer that prevent matches to the binomial name. These modifications are not appropriate in all settings and should be used with care. Spell check of input names is outside the scope of taxadb, however existing tools such as those developed by the Global Names Architecture (http://globalnames.org/apps/) could be incorporated into a taxadb workflow.

Names may also have an ambiguous resolution wherein a name may be resolved by a different provider than the one specified, either as an accepted name or a synonym. Mapping between providers represent a meaningful scientific statement requiring an understanding of the underlying taxonomic concepts of each provider [@franz2009; @Franz2018; @lepage2014]. The spirit of taxadb is not to automate steps that require expert knowledge, but provide access to multiple potential "taxonomic theories".

`filter_` functions for access to underlying tables

Underlying data tables can be accessed through the family of filter_ functions, which filter by certain attributes such as scientific name, id, common name, and rank. These functions allow us to ask general questions such as, how many bird species are there?

filter_rank("Aves", rank="class", provider = "col") %>%
  filter(taxonomicStatus == "accepted", taxonRank == "species") %>%
  pull(taxonID) %>%
  n_distinct()

We can also use this to gain a detailed look at specific species or ids. For example, we can explore why get_ids fails to resolve a seemingly common species:

multi_match <- filter_name("Abies menziesii", provider = "col")

multi_match %>%
  select(1:5, genus, specificEpithet) %>%
  unite(acceptedScientificName, genus, specificEpithet, sep = " ") %>%
  printtable(caption = "Some names may not resolve to an identifier using get\\_ids() because they match to more than one accepted ID. In such cases filter\\_ functions give further detail, as in the example of *Abies menziesii* below which has two accepted ID matches.", escape = FALSE) %>%
  column_spec(3, italic = TRUE)

We see that Abies menziesii is a synonym for three accepted names which the user will have to choose between (Table 6). This is an example of how taxadb seeks to provide users with information from existing authorities and names providers, rather than make a potentially arbitrary decision. Because they return data.frames, filter_ functions provide both potential matches. Note that the simpler get_ functions (get_ids()) consider multiple name matches as NA for the id, making them suitable for automated pipelines where manual resolution of duplicates is not an option.

Direct database access

The full taxonomic record in the database can also be directly accessed by taxa_tbl(), allowing for whole-database queries that are not possible through the API or web interface of many providers. For example, we can easily check the coverage of accepted species names in each of the classes of vertebrates within the Catalogue of Life (Table 7):

verts <- taxa_tbl("col") %>%
  filter(taxonomicStatus == "accepted", phylum == "Chordata", taxonRank == "species") %>% 
  count(class, sort = TRUE)

 verts %>%
  printtable(caption = "taxadb also provides direct access to the database, allowing dplyr or SQL queries which can compute across the entire dataset, such as counting accepted species in all vertebrate classes shown here.  This kind of query is effectively impossible in most REST API-based interfaces.")

\dummy{\Begin{tcolorbox}[title= Box 2: Common Names, lower separated=false]}

taxadb can also resolve common names to their identifier by mapping common name to the accepted scientific name. Common names have many of the same issues as scientific names but even more frequent (e.g. matching to more than one accepted name, non-standardized formatting). Common names are accessed via filter_common which takes a vector of common names. The user can then resolve discrepancies.

\End{tcolorbox}

Discussion

Some taxonomic name providers (e.g. OTT, COL, NCBI) offer periodic releases of a static names list, while many other providers (e.g. ITIS, FB, IUCN) offer name data on a rolling basis (i.e. the data returned by a given download URL is updated continuously or at arbitrary intervals without any additional indication if and how that data has changed.) taxadb's td_create() function downloads and stores cached snapshots from each provider, which follow an annual release model to support reproducible analyses. All taxadb functions that download or access data include an optional argument version to indicate which version of the provider data should be used. By default, taxadb will determine the latest version available (at the time of writing this is version 2019). Appropriate metadata is stored with each snapshot, including scripts used to access and reformat the data files, as described in the "Data Sources" vignette, https://docs.ropensci.org/taxadb/articles/data-sources.html.

Taxonomic identifiers are an essential first step for maintaining taxonomic consistency, a key task for a wide variety of applications. Despite multiple taxonomic standardization efforts, resolving names to taxonomic identifiers is often not a standard step in the research work flow due to difficulty in accessing providers and the time consuming API queries necessary for resolving even moderately sized data sets. taxadb fills an important gap between existing tools and typical research patterns by providing a fast, reproducible approach for matching names to taxonomic identifiers. It could also be used to verify that conclusions were robust to the choice of naming provider. taxadb is not intended as an improvement or replacement for any existing approaches to taxonomic name resolution. In particular, taxadb is not a replacement for the APIs or databases provided, but merely an interface to taxonomic naming information contained within that data.

Lastly, we note that local database design used in taxadb is not unique to taxonomic names. Despite the rapid expansion of REST API-based interfaces to ecological data [@ropensci], in our experience, much of the data relevant to ecologists and evolutionary biologists today would be also be amenable to the local database design. The local database approach is much easier for data providers (who can leverage static scientific database repositories instead of maintaining REST servers) and often much faster for data consumers.

Acknowledgments

We thank the many researchers who contributed to the data and infrastructure of the various taxonomic providers we access through our package. Support for the development of this package was provided by United States Department of Energy through the Computational Sciences Graduate Fellowship (DOE CSGF) under grant number DE-FG02-97ER25308 awarded to K.E.A.N..

Data Availability

Code for the R package can be found on GitHub at https://github.com/ropensci/taxadb and is archived on Zenodo at DOI:10.5281/zenodo.3903858 [@taxadb]. The taxonomic database is also stored on Github at https://github.com/boettiger-lab/taxadb-cache. The original taxonomic data are stored by the individual provider, see "Catalogue of Life", http://www.catalogueoflife.org/ [@col], "ITIS", https://www.itis.gov [@itis], "NCBI", https://www.ncbi.nlm.nih.gov/taxonomy [@ncbi], "GBIF", https://gbif.org [@gbif], "Fishbase", https://fishbase.se [@fishbase], "Open Tree Taxonomy", https://tree.opentreeoflife.org [@Rees2017], "IUCN", https://www.iucnredlist.org/resources/tax-sources [@iucn].

Authors' Contributions

K.E.A.N., S.C., and C.B. contributed to conceptual development of the package. K.E.A.N. and C.B. developed the package and contributed to the manuscript.

\pagebreak

td_disconnect()

References

cboettig/taxald documentation built on April 20, 2024, 12:55 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

cboettig/taxald
A High-Performance Local Taxonomic Database Interface

In cboettig/taxald: A High-Performance Local Taxonomic Database Interface

Package Overview

Taxonomic Identifiers

Unresolved names

`filter_` functions for access to underlying tables

Direct database access

Discussion

Acknowledgments

Data Availability

Authors' Contributions

References

R Package Documentation

Browse R Packages

We want your feedback!

cboettig/taxald A High-Performance Local Taxonomic Database Interface

In cboettig/taxald: A High-Performance Local Taxonomic Database Interface

Package Overview

Taxonomic Identifiers

Unresolved names

filter_ functions for access to underlying tables

Direct database access

Discussion

Acknowledgments

Data Availability

Authors' Contributions

References

R Package Documentation

Browse R Packages

We want your feedback!

cboettig/taxald
A High-Performance Local Taxonomic Database Interface

`filter_` functions for access to underlying tables