Identification of Modified Data

Code here written by Erica Krimmel.

Here, we explore a situation where data from the provider were modified by iDigBio during ingestion as part of its data quality processing. Understanding how data were modified can help collections staff identify updates that need to be made on their specimen records as well as issues that the aggregator may need to remedy in their data quality processing.

General Overview

In this demo we will cover how to:

  1. Write a query to search for specimens using idig_search_records
  2. Compare the difference between the data providers and aggregators data\
  3. Identify specimen records that need to be reviewed

Load Packages

# Load core libraries; install these packages if you have not already
library(ridigbio)
library(tidyverse)

# Load library for making nice HTML output
library(kableExtra)
verify_df_names <- FALSE

#Test that examples will run
tryCatch({
    # Your code that might throw an error
    verify_df_names <- idig_search_records(rq = list(recordset = "5082e6c8-8f5b-4bf6-a930-e3e6de7bf6fb"),
                    fields = c("uuid",
                               "data.dwc:occurrenceID",
                               "data.dwc:catalogNumber",
                               "family",
                               "data.dwc:family",
                               "genus",
                               "data.dwc:genus",
                               "specificepithet",
                               "data.dwc:specificEpithet",
                               "infraspecificepithet",
                               "data.dwc:infraspecificEpithet",                             
                               "data.dwc:scientificName",
                               "flags"),
                    # Set the limit for how many records are returned by the
                    # search to a low number for the purposes of this demo
                    limit = 10)
}, error = function(e) {
    # Code to run if an error occurs
    cat("An error occurred during the idig_search_records call: ", e$message, "\n")
    cat("Vignettes will not be fully generated. Please try again after resolving the issue.")
    # Optionally, you can return NULL or an empty dataframe
    verify_df_names <- FALSE
})

Write a query to search for specimen records

First, let's find all the specimen records from a given recordset, e.g. all of the records published by a single collection. Do this using the idig_search_records function from the ridigbio package. You can learn more about this function from the iDigBio API documentation and ridigbio documentation. In this example, we want to start by searching for specimens from the Invertebrate Paleontology collection at the Natural History Museum of Los Angeles.

# Edit the value after `recordset` to search for data from a different collection
# and the fields (e.g. `uuid`) in `fields` to adjust the columns returned in
# your results
df_names <- idig_search_records(rq = list(recordset = "5082e6c8-8f5b-4bf6-a930-e3e6de7bf6fb"),
                    fields = c("uuid",
                               "data.dwc:occurrenceID",
                               "data.dwc:catalogNumber",
                               "family",
                               "data.dwc:family",
                               "genus",
                               "data.dwc:genus",
                               "specificepithet",
                               "data.dwc:specificEpithet",
                               "infraspecificepithet",
                               "data.dwc:infraspecificEpithet",                             
                               "data.dwc:scientificName",
                               "flags"),
                    # Set the limit for how many records are returned by the
                    # search to a low number for the purposes of this demo
                    limit = 1000) %>% 
  # Rename fields to more easily reflect their provenance (either from the
  # data provider directly or modified by the data aggregator)
  rename(occurrenceID = `data.dwc:occurrenceID`,
         catalogNumber = `data.dwc:catalogNumber`,
         provider_family = `data.dwc:family`,
         provider_genus = `data.dwc:genus`,
         provider_species = `data.dwc:specificEpithet`,
         provider_subspecies = `data.dwc:infraspecificEpithet`,
         provider_scientificName = `data.dwc:scientificName`,
         aggregator_family = `family`,
         aggregator_genus = `genus`,
         aggregator_species = `specificepithet`,
         aggregator_subspecies = `infraspecificepithet`) %>% 
  # Reorder columns for easier viewing
  select(uuid, occurrenceID, catalogNumber, aggregator_family, provider_family,
         aggregator_genus, aggregator_species, aggregator_subspecies, 
         provider_genus, provider_species, provider_subspecies,
         provider_scientificName, flags)

Here is what our query result data looks like, with the data from the aggregator's processing highlighted in red text:

# Subset `df_names` to show example
df_names[1:50,] %>% 
  select(-flags) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                font_size = 12,
                fixed_thead = T) %>% 
  column_spec(c(4,6,7,8), color = "red") %>% 
  scroll_box(width = "100%", height = "400px")

Explore differences in the data

We can already see that there are some formatting differences between the data from the provider and that modified by the aggregator. For example, iDigBio converts all data to lowercase, which was historically useful for standardizing and indexing data across all of the recordsets represented in the iDigBio database. Family and genus names are capitalized by convention, so we will reformat those fields here:

# Reformat aggregator fields to title case
df_names <- df_names %>% 
  mutate(aggregator_family = str_to_title(aggregator_family)) %>% 
  mutate(aggregator_genus = str_to_title(aggregator_genus))

# Subset `df_names` to show example
df_names[1:5,] %>% 
  select(-flags) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                font_size = 12,
                fixed_thead = T) %>% 
  column_spec(c(4,6,7,8), color = "red") %>% 
  scroll_box(width = "100%", height = "400px")

Let's use the power of R to filter out data that have not been modified so that we can focus on rows where the aggregator has made changes. As an example, we will look at rows where the genus name does not match between the provider and the aggregator:

# Filter for rows where genus does not match
df_names %>% 
  filter(provider_genus != aggregator_genus) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                font_size = 12,
                fixed_thead = T) %>% 
  column_spec(c(4,6,7,8), color = "red") %>% 
  scroll_box(width = "100%", height = "400px")

When iDigBio makes modifications to data, these actions are recorded with data quality flags, for instance you will notice that all of the rows in the filtered data above have the flag "dwc_genus_replaced." We could have used values in the flags field, like "dwc_genus_replaced," to search for records back at the beginning of this demo. You can learn more about the flags that iDigBio uses here.

Summarize differences in the data

If you want to make changes based on the modifications we have discovered here, it may be helpful to summarize the distinct modifications, as opposed to seeing them repeated across many individual specimen records. We can summarize the distinct modifications for genus names using the group_by and tally functions from the dplyr package.

# Summarize modifications made to genus names
df_names %>% 
  filter(provider_genus != aggregator_genus) %>% 
  # Because of the nature of scientific names, it makes sense to group data by
  # all of the primary fields that comprise a scientific name
  group_by(provider_genus, provider_species, provider_subspecies,
           aggregator_genus, aggregator_species, aggregator_subspecies,
           provider_scientificName) %>% 
  # Count how many rows are affected by this modification made to genus name
  tally() %>% 
  # Order by frequency of rows affected
  arrange(desc(n)) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                font_size = 12,
                fixed_thead = T) %>% 
  column_spec(c(4,5,6), color = "red") %>% 
  scroll_box(width = "100%", height = "400px")

After reviewing the summarized data, you may wish to review individual specimens and possibly update their data. We can use the information from the summary above to search for the catalog numbers of which specimens to review.

# Search for specimen records of an example modified genus name
df_names %>% 
  filter(provider_genus == "Glossaulax" & provider_species == "reclusiana") %>%
  select(catalogNumber)


Try the ridigbio package in your browser

Any scripts or data that you put into this service are public.

ridigbio documentation built on Oct. 1, 2024, 9:06 a.m.