knitr::opts_chunk$set(echo = TRUE)
library(kableExtra)

Overview

The taxonomyCleanr is easy to use and requires no taxonomic expertiese. Simply send your data through a series of cleaning functions (count_taxa, trim_taxa, replace_taxa, remove_taxa), send the resultant output to the resolver functions (resolve_sci_taxa, resolve_comm_taxa), and create a revision of your raw data (revise_taxa). Voila! Clean taxonomic data!

Below is a demonstration of this process using example data that comes installed with the taxonomyCleanr package.

Installation

Install taxonomyCleanr from the project GitHub.

# Install from GitHub
# remotes::install_github('EDIorg/taxonomyCleanr')
library(taxonomyCleanr)

Load data

Load the taxonomic data into RStudio as a data frame (or tibble). The taxa must be listed in a single column of character type class, not factor type class.

# Load test data installed with the taxonomyCleanr package
data <- data.table::fread(file = system.file('example_data.txt', package = 'taxonomyCleanr'))

This data table has 6 columns:

knitr::kable(data, caption = "Test data containing a column of taxa to be cleaned.") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')

Create taxa map

The taxa map (taxa_map.csv) links the raw data to the cleaned data. Each cleaning function logs changes to taxa_map.csv thereby facilitating an understanding of how the data were changed and a means by which to update the raw data table. A thorough explanation of the maps contents will be provided after the cleaning and resolver processes have been run on these example data.

# Create the taxa map
my_path <- tempdir()
taxa_map <- create_taxa_map(path = my_path, x = data, col = 'Species')

Count taxa

Get the unique taxa names and respective counts with count_taxa. This function helps identify issues that should be fixed before sending the taxa list to the resolver functions. Doing so increases the success of an authority match. Notice, some of the taxa in the test data are obviously misspelled (e.g. Achillea millefolium(lanulosa) and Achillea millefolium(lanulosaaaa) likely represent the same taxon), and some of the listed names are clearly not taxa (e.g. -9999 and Miscellaneous litter).

# Get unique taxa and counts
output <- count_taxa(x = data, col = 'Species', path = my_path)
# Count taxa_map.csv
knitr::kable(output, caption = "Unique taxa and their respective counts. Several issues exist with these taxa.") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')

Trim taxa

Several of the taxa have variations of common suffixes found in taxonomic data (e.g. c.f. and sp.), but frequently cause issues when searching taxonomic authorities. The trim_taxa function removes these excess characters as well as leading and trailing white spaces and under score characters.

# Trim excess characters from the taxa list
output <- trim_taxa(path = my_path)

Running count_taxa on the raw data frame (i.e. data), in combination with the information logged to taxa_map.csv from trim_taxa, creates a view of the updated taxa list.

# View the taxa after running trim_taxa
output <- count_taxa(x = data, col = 'Species', path = my_path)
# View unique taxa after trimming
knitr::kable(output, caption = "Unique taxa and counts after trim_taxa. Notice, extraneous characters (e.g. c.f., spp., and underscores) have been removed.") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')

Replace taxa

Some of the taxa are misspelled. Use replace_taxa to replace the misspelled taxa with the correct spelling, or the best guess of the correct spelling. Use count_taxa to verify these changes.

# Replace misspelled taxa with the correct spelling
output <- replace_taxa(path = my_path, input = 'Achillea millefolium(lanulosa)', output = 'Achillea millefolium')
output <- replace_taxa(path = my_path, input = 'Achillea millefolium(lanulosaaaa)', output = 'Achillea millefolium')
output <- replace_taxa(path = my_path, input = 'Achillea millefolium(lanulosabb)', output = 'Achillea millefolium')
output <- replace_taxa(path = my_path, input = 'Achillea millefolium(lanulosacc)', output = 'Achillea millefolium')

# Get the list of unique taxa
output <- count_taxa(x = data, col = 'Species', path = my_path)
# View unique taxa after trimming
knitr::kable(output, caption = "Unique taxa counts after replacing misspelled taxa.") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')

Remove taxa

Some taxa in the list are clearly not taxa, and should be removed with remove_taxa before attempting to resolve to an authority.

# Remove taxa
output <- remove_taxa(path = my_path, input = '')
output <- remove_taxa(path = my_path, input = '-9999')
output <- remove_taxa(path = my_path, input = 'Unsorted biomass')
output <- remove_taxa(path = my_path, input = 'Miscellaneous litter')

# Get unique taxa and counts
output <- count_taxa(x = data, col = 'Species', path = my_path)
# View unique taxa after removal
knitr::kable(output, caption = "Unique taxa and counts after non-taxa have been removed.") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')

Resolve scientific taxa

Now the list of taxa looks reasonable. Extraneous characters have been removed, occurences of similarly spelled taxa have been harmonized, and non-taxa names have been removed. Send the list of taxa to resolve_sci_taxa, along with a preferred list of authorities to search, and successful hits will return the accepted scientific spelling, taxonomic serial number, and taxonomic rank. resolve_sci_taxa will give preference to the ordering of the taxonomic authorites input to the function. View the list of authorities supported by resolve_sci_taxa with view_taxa_authorities

# Supported authorities are listed in the column titled resolve_sci_taxa
view_taxa_authorities()
# View taxa_map.csv after running resolve_taxa
output <- view_taxa_authorities()
knitr::kable(output, caption = "Authorities supported by resolve_sci_taxa and resolve_comm_taxa") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')

The authorities ITIS and WORMS will be used.

# Resolve taxa using ITIS and WORMS
output <- resolve_sci_taxa(path = my_path, data.sources = c(3,9))
knitr::kable(output, caption = "Output from resolve_sci_taxa call") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')

The taxa that could be resolved to ITIS and WORMS were logged to taxa_map.csv, along with their taxonomic serial numbers and taxonomic ranks.

Resolve common taxa

Some of the taxa that couldn't be resolved by resolve_sci_taxa is because their common names were listed. Use resolve_comm_taxa to attempt resolution of these common names to an authority. resolve_comm_taxa is similar to resolve_sci_taxa in that it requires a preferred list of authorities to search against. Select authorities supported by resolve_comm_taxa.

# View the list of authorities supported by resolve_comm_taxa
view_taxa_authorities()
# View authorities
output <- view_taxa_authorities()
knitr::kable(output, caption = "Authorities supported by the resolve_sci_taxa and resolve_comm_taxa") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')
# Resolve common using ITIS
output <- resolve_comm_taxa(path = my_path, data.sources = 3)
knitr::kable(output, caption = "Output from resolve_comm_taxa call") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')

Taxa map overview

Throughout the cleaning process, results have been logged to taxa_map.csv facilitating understanding of the changes to the raw taxa list. The taxa map will be used to create a revision of the raw taxa list, but first an explanation of the columns of this file is warranted. Information about taxa_map.csv can also be found in the documentation for create_taxa_map (i.e. ?create_taxa_map). The taxa map has 10 columns:

Below is the taxa_map.csv for the cleaning procedures implemented on the test data. Some noteworthy features of this map:

output <- read_taxa_map(my_path)
# View taxa_map.csv after running resolve_common
knitr::kable(output, caption = "taxa_map.csv after all the cleaning procedures have been applied.") %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')

Revise taxa

Now that the taxa have been cleaned, as best they can, the raw data table can be updated with the new taxonomic information. This new information is contained in 4 new columns, which have the same definitions as listed in the taxa map:

These 4 columns are appended to the raw data table and written to a file named "taxonomyCleanr_output".

# Revise the raw data table and write to file
output <- revise_taxa(path = my_path, x = data, col = 'Species', sep = '\t')
knitr::kable(output, caption = "A revision of the raw data table with new taxonomic data appended")  %>%
  kableExtra::kable_styling() %>%
  kableExtra::scroll_box(width = '100%', height = '400px')

Make taxonomicCoverage EML

When creating EML metadata (Ecological Metadata Language), it is a good practice to include the taxonomic entities and their respective hierarchies to facilitate search and discovery.

# Create the taxonomicCoverage EML node set and write to file
output <- make_taxonomicCoverage(path = my_path, write.file = TRUE)
output <- XML::xmlTreeParse(paste0(my_path, '/taxonomicCoverage.xml'))
output$doc$children


EDIorg/taxonomyCleanr documentation built on April 9, 2023, 2:43 a.m.