knitr::opts_chunk$set(echo = TRUE) library(kableExtra)
The taxonomyCleanr
is easy to use and requires no taxonomic expertiese. Simply send your data through a series of cleaning functions (count_taxa
, trim_taxa
, replace_taxa
, remove_taxa
), send the resultant output to the resolver functions (resolve_sci_taxa
, resolve_comm_taxa
), and create a revision of your raw data (revise_taxa
). Voila! Clean taxonomic data!
Below is a demonstration of this process using example data that comes installed with the taxonomyCleanr
package.
Install taxonomyCleanr
from the project GitHub.
# Install from GitHub # remotes::install_github('EDIorg/taxonomyCleanr') library(taxonomyCleanr)
Load the taxonomic data into RStudio as a data frame (or tibble). The taxa must be listed in a single column of character type class, not factor type class.
# Load test data installed with the taxonomyCleanr package data <- data.table::fread(file = system.file('example_data.txt', package = 'taxonomyCleanr'))
This data table has 6 columns:
knitr::kable(data, caption = "Test data containing a column of taxa to be cleaned.") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
The taxa map (taxa_map.csv) links the raw data to the cleaned data. Each cleaning function logs changes to taxa_map.csv thereby facilitating an understanding of how the data were changed and a means by which to update the raw data table. A thorough explanation of the maps contents will be provided after the cleaning and resolver processes have been run on these example data.
# Create the taxa map my_path <- tempdir() taxa_map <- create_taxa_map(path = my_path, x = data, col = 'Species')
Get the unique taxa names and respective counts with count_taxa
. This function helps identify issues that should be fixed before sending the taxa list to the resolver functions. Doing so increases the success of an authority match. Notice, some of the taxa in the test data are obviously misspelled (e.g. Achillea millefolium(lanulosa) and Achillea millefolium(lanulosaaaa) likely represent the same taxon), and some of the listed names are clearly not taxa (e.g. -9999 and Miscellaneous litter).
# Get unique taxa and counts output <- count_taxa(x = data, col = 'Species', path = my_path)
# Count taxa_map.csv knitr::kable(output, caption = "Unique taxa and their respective counts. Several issues exist with these taxa.") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
Several of the taxa have variations of common suffixes found in taxonomic data (e.g. c.f. and sp.), but frequently cause issues when searching taxonomic authorities. The trim_taxa
function removes these excess characters as well as leading and trailing white spaces and under score characters.
# Trim excess characters from the taxa list output <- trim_taxa(path = my_path)
Running count_taxa
on the raw data frame (i.e. data), in combination with the information logged to taxa_map.csv from trim_taxa
, creates a view of the updated taxa list.
# View the taxa after running trim_taxa output <- count_taxa(x = data, col = 'Species', path = my_path)
# View unique taxa after trimming knitr::kable(output, caption = "Unique taxa and counts after trim_taxa. Notice, extraneous characters (e.g. c.f., spp., and underscores) have been removed.") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
Some of the taxa are misspelled. Use replace_taxa
to replace the misspelled taxa with the correct spelling, or the best guess of the correct spelling. Use count_taxa
to verify these changes.
# Replace misspelled taxa with the correct spelling output <- replace_taxa(path = my_path, input = 'Achillea millefolium(lanulosa)', output = 'Achillea millefolium') output <- replace_taxa(path = my_path, input = 'Achillea millefolium(lanulosaaaa)', output = 'Achillea millefolium') output <- replace_taxa(path = my_path, input = 'Achillea millefolium(lanulosabb)', output = 'Achillea millefolium') output <- replace_taxa(path = my_path, input = 'Achillea millefolium(lanulosacc)', output = 'Achillea millefolium') # Get the list of unique taxa output <- count_taxa(x = data, col = 'Species', path = my_path)
# View unique taxa after trimming knitr::kable(output, caption = "Unique taxa counts after replacing misspelled taxa.") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
Some taxa in the list are clearly not taxa, and should be removed with remove_taxa
before attempting to resolve to an authority.
# Remove taxa output <- remove_taxa(path = my_path, input = '') output <- remove_taxa(path = my_path, input = '-9999') output <- remove_taxa(path = my_path, input = 'Unsorted biomass') output <- remove_taxa(path = my_path, input = 'Miscellaneous litter') # Get unique taxa and counts output <- count_taxa(x = data, col = 'Species', path = my_path)
# View unique taxa after removal knitr::kable(output, caption = "Unique taxa and counts after non-taxa have been removed.") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
Now the list of taxa looks reasonable. Extraneous characters have been removed, occurences of similarly spelled taxa have been harmonized, and non-taxa names have been removed. Send the list of taxa to resolve_sci_taxa
, along with a preferred list of authorities to search, and successful hits will return the accepted scientific spelling, taxonomic serial number, and taxonomic rank. resolve_sci_taxa
will give preference to the ordering of the taxonomic authorites input to the function. View the list of authorities supported by resolve_sci_taxa
with view_taxa_authorities
# Supported authorities are listed in the column titled resolve_sci_taxa view_taxa_authorities()
# View taxa_map.csv after running resolve_taxa output <- view_taxa_authorities() knitr::kable(output, caption = "Authorities supported by resolve_sci_taxa and resolve_comm_taxa") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
The authorities ITIS and WORMS will be used.
# Resolve taxa using ITIS and WORMS output <- resolve_sci_taxa(path = my_path, data.sources = c(3,9))
knitr::kable(output, caption = "Output from resolve_sci_taxa call") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
The taxa that could be resolved to ITIS and WORMS were logged to taxa_map.csv, along with their taxonomic serial numbers and taxonomic ranks.
Some of the taxa that couldn't be resolved by resolve_sci_taxa
is because their common names were listed. Use resolve_comm_taxa
to attempt resolution of these common names to an authority. resolve_comm_taxa
is similar to resolve_sci_taxa
in that it requires a preferred list of authorities to search against. Select authorities supported by resolve_comm_taxa
.
# View the list of authorities supported by resolve_comm_taxa view_taxa_authorities()
# View authorities output <- view_taxa_authorities() knitr::kable(output, caption = "Authorities supported by the resolve_sci_taxa and resolve_comm_taxa") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
# Resolve common using ITIS output <- resolve_comm_taxa(path = my_path, data.sources = 3)
knitr::kable(output, caption = "Output from resolve_comm_taxa call") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
Throughout the cleaning process, results have been logged to taxa_map.csv facilitating understanding of the changes to the raw taxa list. The taxa map will be used to create a revision of the raw taxa list, but first an explanation of the columns of this file is warranted. Information about taxa_map.csv can also be found in the documentation for create_taxa_map
(i.e. ?create_taxa_map
). The taxa map has 10 columns:
create_taxa_map
.trim_taxa
, with the resultant name listed.replace_taxa
, with the resultant replacement listed.remove_taxa
. TRUE if taxa was removed, NA otherwise.resolve_sci_taxa
or resolve_comm_taxa
. The listed taxa spelling is accepted by the matched authority.resolve_sci_taxa
or a value of Common if resolved by resolve_comm_taxa
.resolve_sci_taxa
or resolve_comm_taxa
.Below is the taxa_map.csv for the cleaning procedures implemented on the test data. Some noteworthy features of this map:
output <- read_taxa_map(my_path) # View taxa_map.csv after running resolve_common knitr::kable(output, caption = "taxa_map.csv after all the cleaning procedures have been applied.") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
Now that the taxa have been cleaned, as best they can, the raw data table can be updated with the new taxonomic information. This new information is contained in 4 new columns, which have the same definitions as listed in the taxa map:
These 4 columns are appended to the raw data table and written to a file named "taxonomyCleanr_output".
# Revise the raw data table and write to file output <- revise_taxa(path = my_path, x = data, col = 'Species', sep = '\t')
knitr::kable(output, caption = "A revision of the raw data table with new taxonomic data appended") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = '100%', height = '400px')
When creating EML metadata (Ecological Metadata Language), it is a good practice to include the taxonomic entities and their respective hierarchies to facilitate search and discovery.
# Create the taxonomicCoverage EML node set and write to file output <- make_taxonomicCoverage(path = my_path, write.file = TRUE) output <- XML::xmlTreeParse(paste0(my_path, '/taxonomicCoverage.xml')) output$doc$children
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.