Taxonomic ranks are "whole thing", as I like to say when something is complicated/messy. Taxonomic ranks are the relative placement of a taxon in a taxonomic hierarchy. If there was only one taxonomic hierarchy with the same taxonomic rank names life would be so much easier for all of us. Unfortunately, this is not the case. Nearly every source of taxonomic information has slightly different taxonomic rank names they use, even if many are the same across sources. This article highlights some of the sticky messes in taxonomic ranks.

NCBI is weird

NCBI taxonomy (https://www.ncbi.nlm.nih.gov/taxonomy) is an outlier among the taxonomic data sources. They use a lot of rank names that hold no information as to the placement of that taxon within a taxonomic hierarchy. The most common of these is no rank. If you've retrieved taxonomic information from NCBI you may have noticed this supposed rank name. When you get data from NCBI in taxize you will most certainly run into this rank; just know that it doesn't tell you where it falls within a taxonomic hierarchy. You can however use classification() to hopefully get other rank names that do hold actual information.

Another one that holds no information which has started to become more common with NCBI is clade. It's started to replace no rank in at least some cases. As far as we can tell this rank name also holds no information about it's placement within a taxonomic hierarchy.

Downstream

downstream() and related functions in taxize require a rank to be given to the downto parameter, which says you want taxonomic names down to a particular rank.

Some ranks are not useful for the downto parameter. They include "no rank", "unspecified", "unranked", and "clade"; there are possibly more. They can't be used because they have no rank placement in a hierarchy - they can be anywhere in the hierarchy.

As a work around, try looking at the taxonomic hierarchy for your taxa of interest using classification() or on the data providers website and see if there's a taxonomic rank below the rank you want, and use that rank in combination with setting intermediate=TRUE in your downstream() function call. From the intermediate data you can fetch data you need.

Rank information within taxize

taxize maintains a taxonomic hierarchy reference within the package. There are two:

head(rank_ref)
#>   rankid                    ranks
#> 1     01                   domain
#> 2     05             superkingdom
#> 3     10                  kingdom
#> 4     20               subkingdom
#> 5     25 infrakingdom,superphylum
#> 6     30          phylum,division

We reference this rank information in classification(), tax_name(), and all of the exported downstream functions. We use the rankid numeric column to be able to filter data using </>/etc.

We often reference rank information inside the helper functions which_rank() and prune_too_low(), utility functions to help filter results from data providers by rank information.

We manage these two datasets using the script at inst/ignore/rank_ref_script.R in the package, which ends with updating each dataset in the data/ folder of the package.

Let us know

Let us know by opening an issue (https://github.com/ropensci/taxize/issues) if you think you have run into any issues with rank information. The most common issues we see with ranks is with the downstream() function and the data source specific functions that support it.



ropensci/taxize documentation built on Jan. 25, 2024, 6:49 p.m.