In walkerjeffd/myrwaR: Tools for the Mystic River Watershed Association

knitr::opts_chunk$set(collapse = FALSE, comment = "#>")

This document describes functions in the myrwaR package for handling STORET data. First, we'll walk through how to fetch MyRWA data that is currently in STORET. Then we will compare the STORET dataset to what is currently in the Access database.

library(myrwaR)
library(lubridate)
library(dplyr)
library(tidyr)

Fetching Data from STORET

Locations Table

The MyRWA locations currently in STORET can be fetched using the get_storet_locations() function. This function will download a zip file from STORET, unzip the contents, and load the resulting station.csv into a data frame. Note that verbose is optional and TRUE by default, but you can set it to FALSE to suppress the log output.

storet_locations <- get_storet_locations(verbose = TRUE)
str(storet_locations)

Results Table

The MyRWA results table currently in STORET can be fetched using the get_storet_results() function. This function will download a zip file from STORET, unzip the contents, and load the resulting results.csv into a data frame. Note that the verbose argument can be used here too.

storet_results <- get_storet_results()
str(storet_results)

Comparing STORET Data to MyRWA Database

This section compares the number of samples by different variables (e.g. location, characteristic, year) between STORET and the current water quality database.

First, we need to extract the dataset from the current database using the function wqx_results() (this will convert the dataset to WQX format, which is similar, though not identical, to the format provided by STORET):

ch <- db_connect("D:/Dropbox/Work/mystic/db/MysticDB_20160208.accdb")
db_results <- wqx_results(ch, projects='BASE')
close(ch)

Next, we'll convert the ActivityStartDate column in the db_results data frame to Date types so we can extract the Year later, and also prepend "MYRWA-" to the MonitoringLocationID and ActivityID columns since the data returned from STORET includes this prefix but the WQX export format does not.

db_results <- mutate(db_results,
                     ActivityStartDate    = as.Date(ActivityStartDate),
                     MonitoringLocationID = paste("MYRWA", MonitoringLocationID, sep = "-"),
                     ActivityID           = paste("MYRWA", ActivityID, sep = "-"))

Compare Sample Counts

To compare the STORET dataset to that in the current database, we can use the storet_compare_count() function. This function takes two arguments, one for each dataset. Each argument should be a named list containing one or more columns from the original results table. The names in each list let you rename the columns from the original dataset (e.g. LocationID = storet_results$MonitoringLocationIdentifier)) and also define names for new columns (e.g. Year = year(storet_results$ActivityStartDate)). The names in the two list arguments (storet and db) must be identical and cannot be "Source" or "n", which are created internally within the function. For each list, the function will count the number of samples for each unique combination of values, and then compares those counts between the two datasets to identify any differences.

By Location ID

First, let's compare the number of samples by Location ID. Note how the names of the original columns are different in storet_results and db_results, and thus are renamed to both be LocationID within the list arguments.

storet_compare_count(storet = list(LocationID = storet_results$MonitoringLocationIdentifier),
                     db     = list(LocationID = db_results$MonitoringLocationID))

By Characteristic Name

Next, let's compare the number of samples by CharacteristicName:

storet_compare_count(storet = list(CharacteristicName = storet_results$CharacteristicName),
                     db     = list(CharacteristicName = db_results$CharacteristicName))

By Year

And then, we'll compare counts by Year. Note how the columns are converted from dates to years in each list using the year() function.

storet_compare_count(storet = list(Year = year(storet_results$ActivityStartDate)),
                     db     = list(Year = year(db_results$ActivityStartDate)))

By Date

Finally, we'll compare counts by ActivityStartDate:

storet_compare_count(storet = list(Date = storet_results$ActivityStartDate),
                     db     = list(Date = db_results$ActivityStartDate))

Summary

These comparisons show that there are a number of sample events from 2015 that are in the database but not yet in STORET. This approach can thus be useful for figuring out which sample events should be exported for the next STORET upload. It's also important to recognize that the differences shown in these individual tables do not reflect individual discrepancies, but rather that there are a handful of sampling events that have not yet been uploaded to STORET. In other words, the differences in sample counts for individual locations or characteristics simply reflect that some of the visits have not been uploaded yet.

Compare Sample Counts among STORET Visits Only

To verify that the counts by location and characteristic match among only the visits that have been uploaded to STORET, we can simply filter the database dataset to only include sample dates that are also in the STORET dataset.

db_results <- filter(db_results, ActivityStartDate %in% unique(storet_results$ActivityStartDate))

storet_compare_count(storet = list(LocationID = storet_results$MonitoringLocationIdentifier),
                     db     = list(LocationID = db_results$MonitoringLocationID))
storet_compare_count(storet = list(CharacteristicName = storet_results$CharacteristicName),
                     db     = list(CharacteristicName = db_results$CharacteristicName))
storet_compare_count(storet = list(Year = year(storet_results$ActivityStartDate)),
                     db     = list(Year = year(db_results$ActivityStartDate)))
storet_compare_count(storet = list(Date = storet_results$ActivityStartDate),
                     db     = list(Date = db_results$ActivityStartDate))

Based on these comparisons using only the dates that are in STORET, we see one issue in the counts by location ID. There were 8 samples where the LocationID appears to have been changed in the database from BEI001 to BEI093. We can then identify the specific sample date associated with this difference by doing a comparison on both Date and LocationID.

storet_compare_count(storet = list(Date       = storet_results$ActivityStartDate,
                                   LocationID = storet_results$MonitoringLocationIdentifier),
                     db     = list(Date       = db_results$ActivityStartDate,
                                   LocationID = db_results$MonitoringLocationID))

And we see that the difference in LocationID occurs in the 2015-01-23 sample visit.

Another thing we can do is look for differences in the ActivityID, which shows the same result. Note that the lab (L) and field (F) samples each have different ActivityIDs, so there are four rows total, two for each dataset.

storet_compare_count(storet = list(ActivityID = storet_results$ActivityIdentifier),
                     db     = list(ActivityID = db_results$ActivityID))

We now have some useful information to identify and correct the issue. Here are the corresponding rows from the STORET and Database result tables.

filter(storet_results,
       MonitoringLocationIdentifier == "MYRWA-BEI093",
       ActivityStartDate == as.Date("2015-01-23")) %>%
  select(ActivityIdentifier, ActivityStartDate, ActivityStartTime.Time,
         MonitoringLocationIdentifier, CharacteristicName, ResultMeasureValue,
         ResultMeasure.MeasureUnitCode)

filter(db_results,
       MonitoringLocationID == "MYRWA-BEI001",
       ActivityStartDate == as.Date("2015-01-23")) %>%
  select(ActivityID, ActivityStartDate, ActivityStartTime, MonitoringLocationID,
         CharacteristicName, ResultValue, ResultUnit)