library(tidyverse)
library(taxadb)

Populate the taxadb database (only required right after installing taxadb, otherwise this will just note the previously installed tables).

td_create("all")
algae <- read_csv("~/projects/algae-names/algae_uncleanednames_NAdropped.csv")
## Input table with clean names
algae <- algae %>% mutate(input = clean_names(species), sort = 1:length(input))

## Let's get some matches
taxa <- taxa_tbl("ott") %>% 
        mutate_db(clean_names, "scientificName", "input") %>%
        right_join(algae, copy=TRUE, by="input") %>% 
        arrange(sort)  %>% 
        collect()

## lots of duplicate matches, pick the first one for now:
matched <- taxa %>% select(acceptedNameUsageID, sort) %>% distinct() %>% 
  group_by(sort) %>% top_n(1, acceptedNameUsageID)


# 46,045 / 57,700 have been matched!

## Who is unmatched? (sort id is in algae table but not in the matched table)
unmatched <- anti_join(algae, matched, by="sort")

# 11,655 are still unmatched.  Many appear to be known synonyms to Algaebase...
unmatched %>% count(source) %>% arrange(desc(n))

I'm using names given for Red, Green, and Brown Algae from Guiry (2012):

Open Tree Taxonomy (OTT) has 32,347 names recognized as belonging to one the three phyla:

ott_phyla <- bind_rows(
  descendants(name = "Cyanobacteria", rank = "phylum", authority = "ott"),
  descendants(name = "Rhodophyta", rank = "phylum", authority = "ott"),
  descendants(name = "Phaeophyceae", rank = "phylum", authority = "ott")
)

ott_phyla

GBIF has 7,412 recognized names

gbif_phyla <- bind_rows(
  descendants(name = "Cyanobacteria", rank = "phylum", authority = "gbif"),
  descendants(name = "Rhodophyta", rank = "phylum", authority = "gbif"),
  descendants(name = "Phaeophyceae", rank = "phylum", authority = "gbif")
)

gbif_phyla

How many GBIF names also match the names given in OTT? Looks like only 2,923 exact matches.

gbif_in_ott <- gbif_phyla %>% 
  select(gbif_id = taxonID, scientificName, taxonRank) %>%
  inner_join(ott_phyla)


cboettig/taxadb documentation built on April 17, 2024, 6:34 p.m.