Identification of Suspicious Coordinates

Code here written by Erica Krimmel. Code here is modified from original given in a presentation at the 2019 ADBC Summit in Gainesville, FL.

General Overview

Here we explore a situation where geographic coordinate data from the provider was modified by iDigBio during its data quality assurance process. See here for more information about iDigBio's data quality flags.

In this demo we will cover how to:

  1. Write a query to search for specimens using idig_search_records
  2. Identify and map suspicious coordinates

Load Packages.

# Load core libraries; install these packages if you have not already
library(ridigbio)
library(tidyverse)

# Load library for making nice HTML output
library(kableExtra)

# Load libraries for visualizing geographic data
library(leaflet)

library(cowplot)
verify_df_flagCoord <- FALSE

#Test that examples will run
tryCatch({
    # Your code that might throw an error
    verify_df_flagCoord <- idig_search_records(
        rq = list(flags = "rev_geocode_corrected", institutioncode = "lacm"),
        limit = 10
    )
}, error = function(e) {
    # Code to run if an error occurs
    cat("An error occurred during the idig_search_records call: ", e$message, "\n")
    cat("Vignettes will not be fully generated. Please try again after resolving the issue.")
    # Optionally, you can return NULL or an empty dataframe
    verify_df_flagCoord <- FALSE
})

Write a query to search for specimen records

First, let's find all the specimen records for the data quality flag we are interested in. Do this using the idig_search_records function from the ridigbio package. You can learn more about this function from the iDigBio API documentation and ridigbio documentation. In this example, we want to start by searching for specimens flagged with "rev_geocode_corrected."

# Edit the fields (e.g. `flags`) and values (e.g. "rev_geocode_corrected") in
# `list()` to adjust your query and the fields (e.g. `uuid`) in `fields` to
# adjust the columns returned in your results
df_flagCoord <- idig_search_records(rq = list(flags = "rev_geocode_corrected",
                                              institutioncode = "lacm"),
                    fields = c("uuid",
                               "institutioncode",
                               "collectioncode",
                               "country",
                               "data.dwc:country",
                               "stateprovince",
                               "county",
                               "locality",
                               "geopoint",
                               "data.dwc:decimalLongitude",
                               "data.dwc:decimalLatitude",
                               "flags"),
                    limit = 100000) %>% 
  # Rename fields to more easily reflect their provenance (either from the
  # data provider directly or modified by the data aggregator)
  rename(provider_lon = `data.dwc:decimalLongitude`,
         provider_lat = `data.dwc:decimalLatitude`,
         provider_country = `data.dwc:country`,
         aggregator_lon = `geopoint.lon`,
         aggregator_lat = `geopoint.lat`,
         aggregator_country = country,
         aggregator_stateprovince = stateprovince,
         aggregator_county = county,
         aggregator_locality = locality) %>% 
  # Reorder columns for easier viewing
  select(uuid, institutioncode, collectioncode, provider_lat, aggregator_lat,
         provider_lon, aggregator_lon, provider_country, aggregator_country,
         aggregator_stateprovince, aggregator_county, aggregator_locality,
         flags)

Here is what our query result data looks like:

# Subset `df_flagCoord` to show example
df_flagCoord[1:50,] %>% 
  select(-flags) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                font_size = 12,
                fixed_thead = T) %>% 
  scroll_box(width = "100%", height = "400px")

Visualize suspicious coordinates

One example of a geographic coordinate data quality issue would be that the latitude/longitude has a reversed sign, e.g. the data provider gave the value latitude = "7.1789" but meant latitude = "-7.1789." In the map below we can see a few examples of specimen records published to iDigBio where this is the case. These data have been adjusted by iDigBio and this action is recorded with the data quality flag "rev_geocode_flip_lat_sign."

# Create function to allow subsetting the `df_flagCoord` dataset by other flags
# found on these same records
df_flagSubset <- function(subsetFlag) {
  df_flagCoord %>% 
  filter(grepl(subsetFlag, flags)) %>% 
  select(uuid, matches("_lat|_lon")) %>% 
  unite(provider_coords, c("provider_lat", "provider_lon"), sep = ",") %>% 
  unite(aggregator_coords, c("aggregator_lat", "aggregator_lon"), sep = ",") %>% 
  gather(key = type, value = coordinates, -uuid) %>% 
  separate(coordinates, c("lat","lon"), sep = ",") %>% 
  mutate(lat = as.numeric(lat)) %>% 
  mutate(lon = as.numeric(lon)) %>% 
  arrange(uuid, type)}

# Subset `df_flagCoord` by records flagged for having had their latitude negated
# to place point in stated country by reverse geocoding process
df_rev_geocode_lat_sign <- df_flagSubset("rev_geocode_lat_sign")

# Create map displaying a few examples of records with the
# rev_geocode_flip_lat_sign flag
pal <- leaflet::colorFactor(palette = c("#d7191c", "#fdae61", "#ffffbf", "#abdda4", "#2b83ba"),
                   domain = df_rev_geocode_lat_sign$uuid[1:10])

map <- df_rev_geocode_lat_sign[1:10,] %>% 
  mutate(popup = str_c(type, " = ", lat, ", ", lon, sep = "")) %>% 
  leaflet() %>%
  addTiles() %>% 
  addCircleMarkers(
    lng = ~lon,
    lat = ~lat,
    radius = 10,
    weight = 1,
    color = ~pal(uuid),
    stroke = FALSE,
    fillOpacity = 100,
    popup = ~popup) %>% 
    addLegend("bottomright", pal = pal, values = ~uuid,
    title = "Specimen Records",
    opacity = 1)

We can visualize this data on a map to better understand what the data quality flag is telling us. For example, in the map below you can see the effect of accidentally reversing the latitude on three example georeferenced specimen records.

map

Summarize and explore data

The iDigBio API provides a means for an institution to examine data quality issues across collections, which sometimes is not possible internally when data in different collections are managed in different databases.

# Summarize flagged records by collection type
spmByColl <- df_flagCoord %>% 
  group_by(collectioncode) %>% 
  tally()

# Generate graph to display counts of flagged records by collection within the
# institution
graph_spmByColl <- ggplot(spmByColl, 
                          aes(x = reorder(collectioncode, -n), 
                              y = n,
                              fill = collectioncode)) +
  geom_col() +
  theme(panel.background = element_blank(),
        legend.title = element_blank(),
        axis.title.x = element_text(face = "bold"),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_text(face = "bold"),
        plot.title = element_text(size = 12, face = "bold")) +
  labs(x = "collection", 
       y = "# of specimen records",
       title = "LACM records flagged with geo-coordinate data quality issues by iDigBio") +
  geom_text(aes(label = n, vjust = -0.5))

# Get count of total records published by the institution using function
# `idig_count_records`
totalInstSpm <- idig_count_records(rq = list(institutioncode = "lacm"))

# Calculate flagged records as percent of total records
percentFlagged <- sum(spmByColl$n)/totalInstSpm*100

For example, we can ask how many specimen records from which collections at the Natural History Museum of Los Angeles (LACM) have been flagged as "rev_geocode_corrected" by iDigBio. As an aside, although this graph highlights the number of specimen records with data quality issues, these represent only r round(percentFlagged, 2)% of the total specimen records published by LACM.

graph_spmByColl <- graph_spmByColl +
                   theme_minimal_grid() +
                   theme(
                       text = element_text(size = 22),
                       axis.text = element_text(size = 22),
                       plot.title = element_text(size = 22, face = "bold")
                   )

knitr::include_graphics(save_plot("plot.png", graph_spmByColl, base_height = 10, base_width = 24))

We can also explore what other data quality flags these specimen records have been flagged with.

# Collate `df_flagAssoc` to describe other data quality flags that are associated
# with rev_geocode_corrected in `df_flagCoord`
df_flagAssoc <- df_flagCoord %>% 
  select(uuid, flags) %>% 
  unnest(flags) %>% 
  group_by(flags) %>% 
  tally() %>% 
  mutate("category" = case_when(str_detect(flags, "geo|country|state")
                              ~ "geography",
                      str_detect(flags, "dwc_datasetid_added|dwc_multimedia_added|datecollected_bounds")
                              ~ "other",
                      str_detect(flags, "gbif|dwc|tax")
                              ~ "taxonomy")) %>% 
  mutate("percent" = n/(nrow(df_flagCoord))*100) %>% 
  arrange(category, desc(n))

# Visualize associated data quality flags
graph_spmByColl <- ggplot(df_flagAssoc, aes(x = reorder(flags, -percent), y = percent, fill = category)) +
  geom_col() +
  theme(axis.title.x = element_text(face = "bold"),
        axis.text.x = element_text(angle = 75, hjust = 1),
        axis.ticks.y = element_blank(),
        axis.title.y = element_text(face = "bold"),
        plot.title = element_text(size = 12, face = "bold")
        ) +
  labs(x = "additional iDigBio data quality flag", 
       y = "% specimen records",
       title = "LACM records flagged for geo-coordinate issues are also flagged for...",
       fill = "flag category")
graph_spmByColl <- graph_spmByColl +
                   theme_minimal_grid() +
                   theme(
                       text = element_text(size = 22),
                       axis.text = element_text(size = 22),
                       axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
                       plot.title = element_text(size = 22, face = "bold")
                   )

knitr::include_graphics(save_plot("plot2.png", graph_spmByColl, base_height = 10, base_width = 24))


Try the ridigbio package in your browser

Any scripts or data that you put into this service are public.

ridigbio documentation built on Oct. 1, 2024, 9:06 a.m.