In cboettig/taxadb: A High-Performance Local Taxonomic Database Interface

The goal of taxadb is to provide fast, consistent access to taxonomic data, supporting common tasks such as resolving taxonomic names to identifiers, looking up higher classification ranks of given species, or returning a list of all species below a given rank. These tasks are particularly common when synthesizing data across large species assemblies, such as combining occurrence records with trait records.

Existing approaches to these problems typically rely on web APIs, which can make them impractical for work with large numbers of species or in more complex pipelines. Queries and returned formats also differ across the different taxonomic authorities, making tasks that query multiple authorities particularly complex. taxadb creates a local database of most readily available taxonomic authorities, each of which is transformed into consistent, standard, and researcher-friendly tabular formats.

Install and initial setup

To get started, install the development version directly from GitHub:

devtools::install_github("ropensci/taxadb")

library(taxadb)
library(dplyr) # Used to illustrate how a typical workflow combines nicely with `dplyr`

Create a local copy of the Catalogue of Life (2018) database:

td_create("col")
#> Importing 2019_common_col.tsv.bz2 in 100000 line chunks:
#> 
[-] chunk 2
[\] chunk 3
[|] chunk 4
[/] chunk 5 ...Done! (in 9.042107 secs)

Read in the species list used by the Breeding Bird Survey:

bbs_species_list <- system.file("extdata/bbs.tsv", package="taxadb")
bbs <- read.delim(bbs_species_list)

Getting names and ids

Two core functions are get_ids() and get_names(). These functions take a vector of names or ids (respectively), and return a vector of ids or names (respectively). For instance, we can use this to attempt to resolve all the bird names in the Breeding Bird Survey against the Catalogue of Life:

birds <- bbs %>% 
  select(species) %>% 
  mutate(id = get_ids(species, "col"))

head(birds, 10)
#>                          species           id
#> 1         Dendrocygna autumnalis COL:35517330
#> 2            Dendrocygna bicolor COL:35517332
#> 3                Anser canagicus COL:35517329
#> 4             Anser caerulescens COL:35517325
#> 5  Chen caerulescens (blue form)         <NA>
#> 6                   Anser rossii COL:35517328
#> 7                Anser albifrons COL:35517308
#> 8                Branta bernicla COL:35517301
#> 9      Branta bernicla nigricans COL:35537100
#> 10             Branta hutchinsii COL:35536445

Note that some names cannot be resolved to an identifier. This can occur because of miss-spellings, non-standard formatting, or the use of a synonym not recognized by the naming provider. Names that cannot be uniquely resolved because they are known synonyms of multiple different species will also return NA. The filter_name filtering functions can help us resolve this last case (see below).

get_ids() returns the IDs of accepted names, that is dwc:AcceptedNameUsageIDs. We can resolve the IDs into accepted names:

birds %>% 
  mutate(accepted_name = get_names(id, "col")) %>% 
  head()
#>                         species           id        accepted_name
#> 1        Dendrocygna autumnalis COL:35517330      Tringa flavipes
#> 2           Dendrocygna bicolor COL:35517332    Picoides dorsalis
#> 3               Anser canagicus COL:35517329   Setophaga castanea
#> 4            Anser caerulescens COL:35517325  Bombycilla cedrorum
#> 5 Chen caerulescens (blue form)         <NA>       Icteria virens
#> 6                  Anser rossii COL:35517328 Somateria mollissima

This illustrates that some of our names, e.g. Dendrocygna bicolor are accepted in the Catalogue of Life, while others, Anser canagicus are known synonyms of a different accepted name: Chen canagica. Resolving synonyms and accepted names to identifiers helps us avoid the possible miss-matches we could have when the same species is known by two different names.

Taxonomic Data Tables

Local access to taxonomic data tables lets us do much more than look up names and ids. A family of filter_* functions in taxadb help us work directly with subsets of the taxonomic data. As we noted above, this can be useful in resolving certain ambiguous names.

For instance, Trochalopteron henrici gucenense does not resolve to an identifier in ITIS:

get_ids("Trochalopteron henrici gucenense") 
#> [1] NA

Using filter_name(), we find this is because the name resolves not to zero matches, but to more than one match:

filter_name("Trochalopteron henrici gucenense") 
#> # A tibble: 2 x 17
#>    sort taxonID scientificName taxonRank acceptedNameUsa… taxonomicStatus update_date kingdom phylum class order family genus
#>   <int> <chr>   <chr>          <chr>     <chr>            <chr>           <chr>       <chr>   <chr>  <chr> <chr> <chr>  <chr>
#> 1     1 ITIS:9… Trochaloptero… subspeci… ITIS:916117      synonym         <NA>        Animal… Chord… Aves  Pass… Leiot… Troc…
#> 2     1 ITIS:9… Trochaloptero… subspeci… ITIS:916116      synonym         <NA>        Animal… Chord… Aves  Pass… Leiot… Troc…
#> # … with 4 more variables: specificEpithet <chr>, vernacularName <chr>, infraspecificEpithet <chr>, input <chr>

filter_name("Trochalopteron henrici gucenense")  %>%
  mutate(acceptedNameUsage = get_names(acceptedNameUsageID)) %>% 
  select(scientificName, taxonomicStatus, acceptedNameUsage, acceptedNameUsageID)
#> # A tibble: 2 x 4
#>   scientificName                   taxonomicStatus acceptedNameUsage       acceptedNameUsageID
#>   <chr>                            <chr>           <chr>                   <chr>              
#> 1 Trochalopteron henrici gucenense synonym         Trochalopteron henrici  ITIS:916117        
#> 2 Trochalopteron henrici gucenense synonym         Trochalopteron elliotii ITIS:916116

Similar functions filter_id, filter_rank, and filter_common take IDs, scientific ranks, or common names, respectively. Here, we can get taxonomic data on all bird names in the Catalogue of Life:

filter_rank(name = "Aves", rank = "class", provider = "col")
#> # A tibble: 35,398 x 21
#>     sort taxonID scientificName acceptedNameUsa… taxonomicStatus taxonRank kingdom phylum class order family genus
#>    <int> <chr>   <chr>          <chr>            <chr>           <chr>     <chr>   <chr>  <chr> <chr> <chr>  <chr>
#>  1     1 COL:35… Sturnella mag… COL:35520416     accepted        species   Animal… Chord… Aves  Pass… Icter… Stur…
#>  2     1 COL:35… Tauraco porph… COL:35530219     accepted        infraspe… Animal… Chord… Aves  Muso… Musop… Taur…
#>  3     1 COL:35… Pyroderus scu… COL:35534370     accepted        infraspe… Animal… Chord… Aves  Pass… Cotin… Pyro…
#>  4     1 COL:35… Dromaius minor COL:35552206     synonym         infraspe… Animal… Chord… Aves  Casu… Droma… Drom…
#>  5     1 COL:35… Lepidocolapte… COL:35525495     accepted        species   Animal… Chord… Aves  Pass… Furna… Lepi…
#>  6     1 COL:35… Casuarius pap… COL:35552204     synonym         infraspe… Animal… Chord… Aves  Casu… Casua… Casu…
#>  7     1 COL:35… Forpus modest… COL:35536431     accepted        species   Animal… Chord… Aves  Psit… Psitt… Forp…
#>  8     1 COL:35… Pterocnemia p… COL:35552203     synonym         infraspe… Animal… Chord… Aves  Rhei… Rheid… Rhea 
#>  9     1 COL:35… Ceyx lepidus … COL:35532279     accepted        infraspe… Animal… Chord… Aves  Cora… Alced… Ceyx 
#> 10     1 COL:35… Rhea tarapace… COL:35552202     synonym         infraspe… Animal… Chord… Aves  Rhei… Rheid… Rhea 
#> # … with 35,388 more rows, and 9 more variables: specificEpithet <chr>, infraspecificEpithet <chr>, taxonConceptID <chr>,
#> #   isExtinct <chr>, nameAccordingTo <chr>, namePublishedIn <chr>, scientificNameAuthorship <chr>, vernacularName <chr>,
#> #   input <chr>

Combining these with dplyr functions can make it easy to explore this data: for instance, which families have the most species?

filter_rank(name = "Aves", rank = "class", provider = "col") %>%
  filter(taxonomicStatus == "accepted", taxonRank=="species") %>% 
  group_by(family) %>%
  count(sort = TRUE) %>% 
  head()
#> # A tibble: 6 x 2
#> # Groups:   family [6]
#>   family           n
#>   <chr>        <int>
#> 1 Tyrannidae     401
#> 2 Thraupidae     374
#> 3 Psittacidae    370
#> 4 Trochilidae    338
#> 5 Muscicapidae   314
#> 6 Columbidae     312

Using the database connection directly

filter_* functions by default return in-memory data frames. Because they are filtering functions, they return a subset of the full data which matches a given query (names, ids, ranks, etc), so the returned data.frames are smaller than the full record of a naming provider. Working directly with the SQL connection to the MonetDBLite database gives us access to all the data. The taxa_tbl() function provides this connection:

taxa_tbl("col")
#> # Source:   table<2019_dwc_col> [?? x 19]
#> # Database: duckdb_connection
#>    taxonID scientificName acceptedNameUsa… taxonomicStatus taxonRank kingdom phylum class order family genus specificEpithet
#>    <chr>   <chr>          <chr>            <chr>           <chr>     <chr>   <chr>  <chr> <chr> <chr>  <chr> <chr>          
#>  1 COL:31… Limacoccus br… COL:316423       accepted        species   Animal… Arthr… Inse… Hemi… Beeso… Lima… brasiliensis   
#>  2 COL:31… Coccus bromel… COL:316424       accepted        species   Animal… Arthr… Inse… Hemi… Cocci… Cocc… bromeliae      
#>  3 COL:31… Apiomorpha po… COL:316425       accepted        species   Animal… Arthr… Inse… Hemi… Erioc… Apio… pomaphora      
#>  4 COL:31… Eriococcus ch… COL:316441       accepted        species   Animal… Arthr… Inse… Hemi… Erioc… Erio… chaoticus      
#>  5 COL:31… Eriococcus ch… COL:316442       accepted        species   Animal… Arthr… Inse… Hemi… Erioc… Erio… chathamensis   
#>  6 COL:31… Eriococcus ch… COL:316443       accepted        species   Animal… Arthr… Inse… Hemi… Erioc… Erio… chilensis      
#>  7 COL:31… Eriococcus ci… COL:316444       accepted        species   Animal… Arthr… Inse… Hemi… Erioc… Erio… cingulatus     
#>  8 COL:31… Eriococcus ci… COL:316445       accepted        species   Animal… Arthr… Inse… Hemi… Erioc… Erio… cistacearum    
#>  9 COL:31… Eriococcus bu… COL:316447       accepted        species   Animal… Arthr… Inse… Hemi… Erioc… Erio… busariae       
#> 10 COL:31… Eriococcus au… COL:316450       accepted        species   Animal… Arthr… Inse… Hemi… Erioc… Erio… australis      
#> # … with more rows, and 7 more variables: infraspecificEpithet <chr>, taxonConceptID <chr>, isExtinct <chr>,
#> #   nameAccordingTo <chr>, namePublishedIn <chr>, scientificNameAuthorship <chr>, vernacularName <chr>

We can still use most familiar dplyr verbs to perform common tasks. For instance: which species has the most known synonyms?

taxa_tbl("col") %>% 
  count(acceptedNameUsageID, sort=TRUE)
#> # Source:     lazy query [?? x 2]
#> # Database:   duckdb_connection
#> # Ordered by: desc(n)
#>    acceptedNameUsageID     n
#>    <chr>               <dbl>
#>  1 COL:43082445          456
#>  2 COL:43081989          373
#>  3 COL:43124375          329
#>  4 COL:43353659          328
#>  5 COL:43223150          322
#>  6 COL:43337824          307
#>  7 COL:43124158          302
#>  8 COL:43081973          296
#>  9 COL:43333057          253
#> 10 COL:23162697          252
#> # … with more rows