Validating DwC taxon data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

# Increase width for printing tibbles
old <- options(width = 220)

dwctaxon has two major purposes, (1) editing and (2) validation of taxonomic data in Darwin Core (DwC) format. This vignette is about the latter.

Setup

Start by loading packages and setting the random number generator seed since this vignette involves some random samples.

library(dwctaxon)
library(dplyr)

set.seed(12345)

The data

As before, we will use the example dataset that comes with dwctaxon, dct_filmies:

dct_filmies

However, dct_filmies already is well-formatted and would pass all validation checks! So lets introduce some noise to make things more interesting.

filmies_dirty <-
  dct_filmies |>
  # Change taxonomic status of one row to 'good'
  dct_modify_row(taxonID = "54115096", taxonomicStatus = "good") |>
  # Duplicate some rows at the end
  bind_rows(tail(dct_filmies)) |>
  # Insert bad values for `acceptedNameUsageID` of 5 random rows
  rows_update(
    tibble(
      taxonID = sample(dct_filmies$taxonID, 5),
      acceptedNameUsageID = sample(letters, 5)
    ),
    by = "taxonID"
  )

filmies_dirty

The first few rows may look the same, but we know that these data now have some problems.

Error on failure

dct_validate() is the workhorse function for validating DwC data.

In default mode, dct_validate() will issue an error the first time it finds something wrong with the data (in other words, on the first check that fails):

dct_validate(filmies_dirty)
dup_taxid <- dct_validate(filmies_dirty, on_fail = "summary") |>
  filter(stringr::str_detect(error, "taxonID .* duplicated value")) |>
  pull(taxonID) |>
  knitr::combine_words()

dwctaxon tries to provide useful error messages that help you determine what in the data is causing the problem. Here, we see that rows with taxonID r dup_taxid are duplicated. Here of course we know that's because we duplicated them on purpose; in a real dataset, you could use this information to search out the duplicated values and fix them.

Summary on failure

If you are troubleshooting a DwC taxon dataset, it may be more useful to know about all of the problems at once instead of fixing them one at a time. In that case, set the on_fail argument to "summary" (on_fail can be either its default value "error" or "summary"):

dct_validate(filmies_dirty, on_fail = "summary")

(You may need to scroll to the right in the output below to see all the text).

In this case, dct_validate() still issues a warning to let us know validation did not pass. The error and check columns describe what went wrong; the other columns tell us where in the data to find the errors.

With this detailed summary, we should definitely be able to hunt down the bugs in this dataset!

Checks

You may be wondering, why the separate "error" and "check" columns in the summary output?

That is because dct_validate() conducts many smaller checks, each of which can be turned on or off. For a complete description, run ?dct_validate(). In turn, the checks can each identify different particular problems; the most granular description is given in the "error" column.

Furthermore, each of the checks run by dct_validate() can also be run as an individual function. For example, let's just check that all values of acceptedUsageID have a corresponding taxonID (in other words, that all synonyms map properly):

filmies_dirty |>
  dct_check_mapping()

It is important to note that not all checks are compatible with each other. For example, check_sci_name checks that all scientific names (DwC term scientificName) are non-missing and unique; check_status_diff checks that in cases of identical scientific names, the taxonomic status of each name is different. The default settings for dct_validate() are to use the former but not the latter. Whether you expect all scientific names to be unique or not depends on how you set up your data^[According to the rules of taxonomic nomenclature, of course each full scientific name should be unique, but there have been errors in the past where the same author published the same name more than once!].

Controlled vocabularies

Some DwC taxon terms are expected only to take a small number values from a controlled vocabulary. For example, taxonStatus (taxonomic status of a scientific name) may only be expected to include the values "accepted", "synonym", etc. This is unlike, e.g., scientificName, where we would not try to control the range of possible values.

However, although DwC recommends using a controlled vocabulary for such terms, it does not specify the actual values! So dwctaxon lets you set those yourself (and tries to employ reasonable defaults), as shown in the next section.

Changing the defaults

Say you want to use a different set of allowed values for taxonStatus. Here, let's include "good" so that the data will pass the check for taxonomic status (remember we modified the data so the taxonomicStatus of one of the rows was "good").

One way would be to use the valid_tax_status argument of dct_validate() or dct_check_tax_status():

filmies_dirty |>
  dct_check_tax_status(
    valid_tax_status = "good, accepted, synonym",
    on_success = "logical" # Issue "TRUE" if the check passes
  )

But specifying this argument every time you want to check something gets tedious.

So we can change the default setting for valid_tax_status with dct_options() like so:

# First save the current settings before making any changes
old_settings <- dct_options()

# Change valid_tax_status setting
dct_options(valid_tax_status = "good, accepted, synonym")

Now we can run dct_check_tax_status() and it will use the new default value:

filmies_dirty |>
  dct_check_tax_status(on_success = "logical")

You can change back to the original default values with reset = TRUE:

dct_options(reset = TRUE)

Now running the same code as above throws an error:

filmies_dirty |>
  dct_check_tax_status(on_success = "logical")

There are a large number of settings that can be modified. See ?dct_options() for a description of each.

You can view the current status of all options (default values) by running dct_options() with no arguments:

dct_options()

Or check the value of one particular setting by passing its name with the $ operator:

dct_options()$valid_tax_status

We can restore the settings as they were before any of these changes were applied by running do.call() on the settings we saved above:

do.call(dct_options, old_settings)
# Reset options
options(old)


Try the dwctaxon package in your browser

Any scripts or data that you put into this service are public.

dwctaxon documentation built on May 29, 2024, 5:53 a.m.