Load libraries and data to get started

library(tidyverse)
library(taxadb)
taxadb::td_create("all") # only needed once.
algae <- read_csv("~/projects/algae-names/algae_uncleanednames_NAdropped.csv")

Summarizing the data

Before we begin, let's take a look around the data to get a sense of what we have:

dim(algae)[[1]] # 57,700 rows. 
names(algae) # cols are: source, family, species, is_source_herbaria
algae %>% count(family) %>% arrange(desc(n))  # 1,126 families (uncleaned)
# 25 sources, including MACROALGAE, iDigBio, GBIF, OBIS, ...
algae %>% count(source) %>% arrange(desc(n))

lots of duplicate names across & sources: only 37,505 unique species.

algae %>% count(species) %>% arrange(desc(n))

However, note that some of these duplicated names have different families though:

unique_sp_family <- algae %>% count(species, family) %>% arrange(desc(n))
unique_sp_family  # 42,242 rows

We can see this is sometimes due to missing data or differences in capitalization, but also see cases where the same species name is assigned to different families:

species_multiple_families <- 
  unique_sp_family %>% 
  filter(species != "sp.", species != "Indet. sp.") %>%
  select(species) %>% 
  count(species) %>% 
  arrange(desc(n)) %>%
  left_join(unique_sp_family %>% select(-n))
species_multiple_families %>%  arrange(desc(n))

We will assume these differences correspond to changes in taxonomic group assignment at the family level, and not to two distinct species with the same scientific name (genus+specific epithet) belonging to different families.


Resolving names to IDs

So we'll begin by focusing on resolving the unique species names:

names <- unique(algae$species) # 37,505

Okay, without futher ado, let's start matching names. We'll resolve against the Open Tree Taxonomy (OTT) first, because it is a assembly that already includes names from WORMS, GBIF, and others.

ott_ids <- names %>% ids("ott")

The resulting data frame has at least one row per name provided, though possibily all NA.

This database includes synonyms, which do not have their own OTT identifiers in taxonID, but can be resolved to accepted names with accepted identifiers. Thus, we want to look at acceptedNameUsageID column to see who matched and who didn't. (This nomenclature for column names probably feels clumsy, but it comes from the Darwin Core standard and thus lets us be explicitly consistent with how these terms are used elsewhere.)

The ids function also automatically normalizes all strings to lowercase before matching against the (lowercased) scientificNames in the database. ids returns these strings in the input column. Standardizing between upper and lowercase reveals that many of the 37,505 names are still duplicates:

length(ott_ids$input)
length(unique(ott_ids$input))

The sort column indicates the position of the original input data. Without this column, input names that are identical after correcting for capitalization result in identical rows. We can drop these by filtering for only the distinct columns:

clean_ott_ids <- ott_ids %>% select(-sort) %>% distinct()
clean_ott_ids # 29,878 rows

We can now check how many unique names were matched:

matched <- clean_ott_ids %>% filter(!is.na(acceptedNameUsageID))
length(unique(matched$input)) # 18,568
length(unique(matched$acceptedNameUsageID)) #17,766

matched %>% filter(taxonomicStatus != "accepted") # 1,766

We matched 18,568 unique names (out of r length(unique(clean_ott_ids$input))), which resolved to 17,766 unique IDs in OTT, since 1,766 names were recognized synonyms to OTT. Meanwhile, in the unmatched names set, this leaves us with:

unmatched <- clean_ott_ids %>% filter(is.na(acceptedNameUsageID)) %>% pull(input)
length(unique(unmatched))

So 10,911 unmatched names still to resolve (a little more than one third of the unique names).

wordcount <- 
  data.frame(input = unmatched) %>% 
  mutate(n = stringi::stri_count_regex(input, "\\s")) %>% 
  arrange(desc(n))

wordcount

Wow, so some of our entries have a whole lot more words than a species name: the top 6 entries have dozens of words. Many more (1,551) have 4 words (three spaces), indicating subspecies and varities:

wordcount %>% filter(n==3)

If we restrict ourselves to matching on the first two-word names, we have a reasonable chance of resolving names to the genus or species level. clean_names does up to three transformations: we drop missing specific epithet indication sp, leaving only the genus name, we resolve the first two names, and we standardize delimiters between names.

clean_unmatched <- unmatched %>% clean_names() %>% unique()
length(clean_unmatched)

This gives us r length(clean_unmatched) unmatched names to resolve.

ott_ids2 <- clean_unmatched %>% ids("ott")
unmatched2 <- ott_ids2 %>% filter(is.na(acceptedNameUsageID)) %>% pull(input)

We can bind the new matches to our existing matched names table:

matched2 <- ott_ids2 %>% filter(!is.na(acceptedNameUsageID)) %>% bind_rows(matched)

With these clean names, another r length(clean_unmatched) - length(umatched2) can be matched to OTT ids, leaving r length(umatched2) unmatched:

length(unmatched2)
head(unmatched2, 10)

Working from the names directly:

df <- db_mutate(r_fn = clean_names,
          tbl = "ott",
          db = td_connect(),
          col = "scientificName",
          new_column = "input") 
sp <- tibble(input = unmatched2)

match_binomial <- right_join(df, sp, copy=TRUE, by = "input")  %>% collect()
match_binomial %>% filter(is.na(acceptedNameUsageID)) %>% distinct()

dim(match_binomial)

We can find at least some matches for these remaining names in most of the other databases:

unmatched2 %>% ids("slb") %>% summarise(matched = sum(!is.na(acceptedNameUsageID)))
unmatched2 %>% ids("col") %>% summarise(matched = sum(!is.na(acceptedNameUsageID)))
unmatched2 %>% ids("gbif") %>% summarise(matched = sum(!is.na(acceptedNameUsageID)))
unmatched2 %>% ids("itis") %>% summarise(matched = sum(!is.na(acceptedNameUsageID)))
unmatched2 %>% ids("ncbi") %>% summarise(matched = sum(!is.na(acceptedNameUsageID)))
unmatched2 %>% ids("wd") %>% summarise(matched = sum(!is.na(acceptedNameUsageID)))
unmatched2 %>% ids("iucn") %>% summarise(matched = sum(!is.na(acceptedNameUsageID)))
unmatched2 %>% ids("tpl") %>% summarise(matched = sum(!is.na(acceptedNameUsageID)))

GBIF does the best among these, resolving another 1614 names exactly.

gbif_ids <- unmatched2 %>% ids("gbif")
unmatched3 <- gbif_ids %>% filter(is.na(acceptedNameUsageID)) %>% pull(input)
length(names(umatched3))

FIXME also try matching against clean_names versions of scientificName column on the database side; in particular, on the synonyms.



cboettig/taxadb documentation built on April 17, 2024, 6:34 p.m.