knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

taxadb:::td_disconnect()
MonetDBLite::monetdblite_shutdown()
library(taxadb)
library(dplyr)
library(tidyr)

Introduction: the problem of name matching

We frequently want to combine different data sources using species names. Perhaps we have one table which gives us occurrance information for a given list of species, and another table that contains trait information for species, and maybe a third source that provides a phylogenetic tree. If our system of scientific species names was perfect, we could simply do a "table join" using species name as the joining key, and all would be well.

Unfortunately, as anyone who has attempted this kind of exercise over more than a handful of species has discovered, this often works for most but very rarely for all of the species names involved. There are several reasons this process runs into problems.

Synonyms. One of the most common problems is the existence of synonyms: different names or different spellings of a given scientific name. While this could include definite miss-spellings, for there are many species for which authorities can differ in the preferred name -- essentially, two or more different names correspond to the same species, and should be treated as such in our species join.

For instance, we consider the 233 primate species names used in the geiger package primates data. To avoid installing that dependency-heavy package, a copy of these names has been cached in taxadb and can be loaded as follows:

ex <- system.file("extdata", "primates.tsv.bz2", package = "taxadb", mustWork = TRUE)
primates <- readr::read_tsv(ex)
primates

Our goal will be to associate each name with a definitive taxonomic ID of an existing naming authority. This will allow us to merge on IDs directly, rather than names. By mapping both recognized names and synonyms to the same corresponding taxonomic ID, we can be sure that we can join the relevant data correctly.

Setup

To get started, we'll install all the data sources available to taxadb into a local database. This may take a while, particularly over a slow internet connection, but it needs to be done only once. The downloaded size of all data is around 3 GB. Once this task completes, our subsequent operations should all be quite fast.

td_create(authorities = "all")

Resolving names

Match a list of 233 species names against a naming authority:

my_taxa <- primates %>%  # 233 taxa
  mutate(id = get_ids(species, format = "prefix")) 
my_taxa %>% filter(is.na(id))

Only 3 species are missing ids, so we have managed to resolve 230 of the 233 species. Not bad! Actually,two of these do appear in ITIS records, but map to no accepted name according to ITIS:

unmatched <- my_taxa %>% filter(is.na(id)) %>% pull(species) 
synonyms(unmatched)

Under the hood, get_ids checks the names against synonyms known to the authority as well. If we only looke for direct matches, we would have faired far worse.

Mimicking taxize function of the same name get_ids is returning just a vector of ids, which is useful for operations such as mutate above. However, it is often convenient to return the full table of matched names. Here we can see that many of the names we have matched against are actually synonyms, and that get_ids has been able to resolve them into accepted ids.

ids(primates$species)

Note that name is the name we provided (species names in this case, but it they do not have to be). Because synonyms-check is TRUE by default, we get a list of accepted names recognized (by ITIS in this case, our default authority).

We aren't stuck with ITIS though, we can work with other naming authorities as well:

my_taxa <- primates %>%  
  mutate(itis = get_ids(species, "itis"),
         ncbi = get_ids(species, "ncbi"),
         col = get_ids(species, "col"),
         gbif = get_ids(species, "gbif"),
         wd = get_ids(species, "wd"),
         tpl = get_ids(species, "tpl"),
         fb = get_ids(species, "fb"),
         slb = get_ids(species, "slb")) 
my_taxa

Can any single authority resolve all species names?

my_taxa %>% 
  select(-species) %>% 
  purrr::map_dbl(function(x) sum(!is.na(x)))

Looks like itis has the most matches with 230. (Of course taxon-specific authorities like The Plant List, FishBase, and SeaLifeBase contain no primates).

We can also ask: do any species have no match in any authority?

has_match <- my_taxa %>% 
  select(-species) %>%
  purrr::map_dfc(function(x) !is.na(x)) %>% 
  rowSums() > 0

my_taxa %>% filter(!has_match)

One species still fails to resolve under any authority. A web search does resolve this as a misspelling, to Leontopithecus chrysomelas, but not one that is recongized automatically by the ITIS known synonyms.

Hierarchy

Once we have resolved our ids, we can also get full classification information. This example uses the default authority, ITIS:

primate_ids <- get_ids(primates$species, format = "prefix")
classification(id = primate_ids)

the classification function can work on species names directly, but this will not resolve syonyms to identifiers first. As a result, only the 180 already recognized species names can be resolved in this case:

classification(species = primates$species)
taxadb:::td_disconnect()
MonetDBLite::monetdblite_shutdown()


cboettig/taxadb documentation built on April 17, 2024, 6:34 p.m.