knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # Increase width for printing tibbles old <- options(width = 140) knitr::read_chunk(system.file("extdata", "vascan_url.R", package = "dwctaxon"))
This vignette demonstrates using dwctaxon on "real life" data found in the wild. Our goal is to import the data and validate it.
First, load the packages used for this vignette.
library(dwctaxon) library(readr) library(tibble) library(dplyr)
We will use the Database of Vascular Plants of Canada (VASCAN), which is available as a Darwin Core Archive.
The data can be obtained manually by going to the VASCAN website, downloading the Darwin Core Archive, and unzipping it^[If you download the data manually, it may be a different version than the one used here, v37.12].
Alternatively, it can be downloaded and unzipped with R. First, we set up some temporary folders for downloading and specify the URL:
# - Specify temporary folder for downloading data temp_dir <- tempdir() # - Set name of zip file temp_zip <- paste0(temp_dir, "/dwca-vascan.zip") # - Set name of unzipped folder temp_unzip <- paste0(temp_dir, "/dwca-vascan")
# Check if file can be downloaded safely, quit early if not # Make sure this URL matches the one in the next chunk if (!dwctaxon:::safe_to_download(vascan_url)) { cat( paste0( "Vignette rendering stopped. The zip file (", vascan_url, ") could not be downloaded. Check your internet connection and the URL." ) ) knitr::knit_exit() }
Next, download and unzip the zip file.
# Download data download.file(url = vascan_url, destfile = temp_zip, mode = "wb") # Unzip unzip(temp_zip, exdir = temp_unzip) # Check the contents of the unzipped data (the Darwin Core Archive) list.files(temp_unzip)
Finally, load the taxonomic data (taxon.txt
) into R. It is a tab-separated text file, so we use readr::read_tsv()
to load it.
vascan <- read_tsv(paste0(temp_unzip, "/taxon.txt")) # Take a peak at the data vascan
The dataset includes r nrow(vascan)
rows (taxa) and r ncol(vascan)
columns.
Let's see if the dataset passes validation with dwctaxon.
It is usually a good idea to just run dct_validate()
with default settings the first time.
If it passes, you can move on.
dct_validate(vascan)
Looks like we've got problems...
To dig into these in more detail, let's run dct_validate()
again, but this time output a summary of errors.
validation_res <- dct_validate(vascan, on_fail = "summary") validation_res
The summary lists one taxonID
per row. Let's count these to get a higher-level view of what's going on.
validation_res %>% count(check, error)
validation_res_sum <- validation_res %>% count(check, error) n_error_types <- nrow(validation_res_sum) %>% english::english() n_bad_cols <- validation_res_sum %>% filter(error == "Invalid column names detected: id") %>% pull(n) %>% english::english() n_bad_sci_name <- validation_res_sum %>% filter(error == "scientificName detected with duplicated value") %>% pull(n)
We see there are r n_error_types
kinds of errors.
There is r n_bad_cols
column with an invalid name (id
) and r n_bad_sci_name
rows with duplicated scientific names.
Let's take a closer look at some of those duplicated names.
dup_names <- validation_res %>% filter(grepl("scientificName detected with duplicated value", error)) %>% arrange(scientificName) dup_names
We can join back to the original data to investigate these names.
inner_join( select(dup_names, taxonID), vascan, by = "taxonID" ) %>% # Just look at the first 6 columns select(1:6)
We see that in some cases, multiple entries for the exact same scientific name (for example, Arnica monocephala Rydberg
) differ only in the value for nameAccordingToID
.
So this seems like something the database manager should fix.
Let's see what is in the id
column.
vascan %>% select(id) n_distinct(vascan$id)
id
contains numbers that are all unique.
In other words, these appear to be unique key values to each row in the dataset (as one would expect from the name id
).
It is probably the case that this dataset has a good reason for using the id
column, even though it is not a standard DwC column.
Let's see if we can get this dataset to pass validation.
First, let's remove the duplicated names. This is something that should be done with more thought, but here let's just keep the first name of each pair.
vascan_fixed <- vascan %>% filter(!duplicated(scientificName))
Next, we will run validation again, but this time allow id
as an extra column.
dct_validate( vascan_fixed, extra_cols = "id" )
It passes, so we have now confirmed that the only steps needed to obtain correctly formatted DwC data are to de-duplicate the species names and account for the id
column.
This vignette shows how dwctaxon can be used on DwC data to find possible problems in a taxonomic dataset. We were able to identify several rows with duplicated scientific names and one column that does not follow DwC standards. Other than that, it passes validation, giving us confidence that this dataset can be used for downstream taxonomic analyses.
# Reset options options(old)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.