knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # Increase width for printing tibbles old <- options(width = 220)
dwctaxon has two major purposes, (1) editing and (2) validation of taxonomic data in Darwin Core (DwC) format. This vignette is about the latter.
Start by loading packages and setting the random number generator seed since this vignette involves some random samples.
library(dwctaxon) library(dplyr) set.seed(12345)
As before, we will use the example dataset that comes with dwctaxon, dct_filmies
:
dct_filmies
However, dct_filmies
already is well-formatted and would pass all validation checks! So lets introduce some noise to make things more interesting.
filmies_dirty <- dct_filmies |> # Change taxonomic status of one row to 'good' dct_modify_row(taxonID = "54115096", taxonomicStatus = "good") |> # Duplicate some rows at the end bind_rows(tail(dct_filmies)) |> # Insert bad values for `acceptedNameUsageID` of 5 random rows rows_update( tibble( taxonID = sample(dct_filmies$taxonID, 5), acceptedNameUsageID = sample(letters, 5) ), by = "taxonID" ) filmies_dirty
The first few rows may look the same, but we know that these data now have some problems.
dct_validate()
is the workhorse function for validating DwC data.
In default mode, dct_validate()
will issue an error the first time it finds something wrong with the data (in other words, on the first check that fails):
dct_validate(filmies_dirty)
dup_taxid <- dct_validate(filmies_dirty, on_fail = "summary") |> filter(stringr::str_detect(error, "taxonID .* duplicated value")) |> pull(taxonID) |> knitr::combine_words()
dwctaxon tries to provide useful error messages that help you determine what in the data is causing the problem. Here, we see that rows with taxonID
r dup_taxid
are duplicated. Here of course we know that's because we duplicated them on purpose; in a real dataset, you could use this information to search out the duplicated values and fix them.
If you are troubleshooting a DwC taxon dataset, it may be more useful to know about all of the problems at once instead of fixing them one at a time. In that case, set the on_fail
argument to "summary"
(on_fail
can be either its default value "error"
or "summary"
):
dct_validate(filmies_dirty, on_fail = "summary")
(You may need to scroll to the right in the output below to see all the text).
In this case, dct_validate()
still issues a warning to let us know validation did not pass. The error
and check
columns describe what went wrong; the other columns tell us where in the data to find the errors.
With this detailed summary, we should definitely be able to hunt down the bugs in this dataset!
You may be wondering, why the separate "error" and "check" columns in the summary output?
That is because dct_validate()
conducts many smaller checks, each of which can be turned on or off. For a complete description, run ?dct_validate()
. In turn, the checks can each identify different particular problems; the most granular description is given in the "error" column.
Furthermore, each of the checks run by dct_validate()
can also be run as an individual function. For example, let's just check that all values of acceptedUsageID
have a corresponding taxonID
(in other words, that all synonyms map properly):
filmies_dirty |> dct_check_mapping()
It is important to note that not all checks are compatible with each other. For example, check_sci_name
checks that all scientific names (DwC term scientificName
) are non-missing and unique; check_status_diff
checks that in cases of identical scientific names, the taxonomic status of each name is different. The default settings for dct_validate()
are to use the former but not the latter. Whether you expect all scientific names to be unique or not depends on how you set up your data^[According to the rules of taxonomic nomenclature, of course each full scientific name should be unique, but there have been errors in the past where the same author published the same name more than once!].
Some DwC taxon terms are expected only to take a small number values from a controlled vocabulary. For example, taxonStatus
(taxonomic status of a scientific name) may only be expected to include the values "accepted", "synonym", etc. This is unlike, e.g., scientificName
, where we would not try to control the range of possible values.
However, although DwC recommends using a controlled vocabulary for such terms, it does not specify the actual values! So dwctaxon lets you set those yourself (and tries to employ reasonable defaults), as shown in the next section.
Say you want to use a different set of allowed values for taxonStatus
. Here, let's include "good" so that the data will pass the check for taxonomic status (remember we modified the data so the taxonomicStatus
of one of the rows was "good"
).
One way would be to use the valid_tax_status
argument of dct_validate()
or dct_check_tax_status()
:
filmies_dirty |> dct_check_tax_status( valid_tax_status = "good, accepted, synonym", on_success = "logical" # Issue "TRUE" if the check passes )
But specifying this argument every time you want to check something gets tedious.
So we can change the default setting for valid_tax_status
with dct_options()
like so:
# First save the current settings before making any changes old_settings <- dct_options() # Change valid_tax_status setting dct_options(valid_tax_status = "good, accepted, synonym")
Now we can run dct_check_tax_status()
and it will use the new default value:
filmies_dirty |> dct_check_tax_status(on_success = "logical")
You can change back to the original default values with reset = TRUE
:
dct_options(reset = TRUE)
Now running the same code as above throws an error:
filmies_dirty |> dct_check_tax_status(on_success = "logical")
There are a large number of settings that can be modified. See ?dct_options()
for a description of each.
You can view the current status of all options (default values) by running dct_options()
with no arguments:
dct_options()
Or check the value of one particular setting by passing its name with the $
operator:
dct_options()$valid_tax_status
We can restore the settings as they were before any of these changes were applied by running do.call()
on the settings we saved above:
do.call(dct_options, old_settings)
# Reset options options(old)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.