knitr::opts_chunk$set(collapse = FALSE, comment = "#>")
This document describes functions in the myrwaR
package for handling STORET data. First, we'll walk through how to fetch MyRWA data that is currently in STORET. Then we will compare the STORET dataset to what is currently in the Access database.
library(myrwaR) library(lubridate) library(dplyr) library(tidyr)
The MyRWA locations currently in STORET can be fetched using the get_storet_locations()
function. This function will download a zip file from STORET, unzip the contents, and load the resulting station.csv
into a data frame. Note that verbose
is optional and TRUE
by default, but you can set it to FALSE
to suppress the log output.
storet_locations <- get_storet_locations(verbose = TRUE) str(storet_locations)
The MyRWA results table currently in STORET can be fetched using the get_storet_results()
function. This function will download a zip file from STORET, unzip the contents, and load the resulting results.csv
into a data frame. Note that the verbose
argument can be used here too.
storet_results <- get_storet_results() str(storet_results)
This section compares the number of samples by different variables (e.g. location, characteristic, year) between STORET and the current water quality database.
First, we need to extract the dataset from the current database using the function wqx_results()
(this will convert the dataset to WQX format, which is similar, though not identical, to the format provided by STORET):
ch <- db_connect("D:/Dropbox/Work/mystic/db/MysticDB_20160208.accdb") db_results <- wqx_results(ch, projects='BASE') close(ch)
Next, we'll convert the ActivityStartDate
column in the db_results
data frame to Date
types so we can extract the Year later, and also prepend "MYRWA-"
to the MonitoringLocationID
and ActivityID
columns since the data returned from STORET includes this prefix but the WQX export format does not.
db_results <- mutate(db_results, ActivityStartDate = as.Date(ActivityStartDate), MonitoringLocationID = paste("MYRWA", MonitoringLocationID, sep = "-"), ActivityID = paste("MYRWA", ActivityID, sep = "-"))
To compare the STORET dataset to that in the current database, we can use the storet_compare_count()
function. This function takes two arguments, one for each dataset. Each argument should be a named list containing one or more columns from the original results table. The names in each list let you rename the columns from the original dataset (e.g. LocationID = storet_results$MonitoringLocationIdentifier)
) and also define names for new columns (e.g. Year = year(storet_results$ActivityStartDate)
). The names in the two list arguments (storet
and db
) must be identical and cannot be "Source"
or "n"
, which are created internally within the function. For each list, the function will count the number of samples for each unique combination of values, and then compares those counts between the two datasets to identify any differences.
First, let's compare the number of samples by Location ID. Note how the names of the original columns are different in storet_results
and db_results
, and thus are renamed to both be LocationID
within the list arguments.
storet_compare_count(storet = list(LocationID = storet_results$MonitoringLocationIdentifier), db = list(LocationID = db_results$MonitoringLocationID))
Next, let's compare the number of samples by CharacteristicName
:
storet_compare_count(storet = list(CharacteristicName = storet_results$CharacteristicName), db = list(CharacteristicName = db_results$CharacteristicName))
And then, we'll compare counts by Year
. Note how the columns are converted from dates to years in each list using the year()
function.
storet_compare_count(storet = list(Year = year(storet_results$ActivityStartDate)), db = list(Year = year(db_results$ActivityStartDate)))
Finally, we'll compare counts by ActivityStartDate
:
storet_compare_count(storet = list(Date = storet_results$ActivityStartDate), db = list(Date = db_results$ActivityStartDate))
These comparisons show that there are a number of sample events from 2015 that are in the database but not yet in STORET. This approach can thus be useful for figuring out which sample events should be exported for the next STORET upload. It's also important to recognize that the differences shown in these individual tables do not reflect individual discrepancies, but rather that there are a handful of sampling events that have not yet been uploaded to STORET. In other words, the differences in sample counts for individual locations or characteristics simply reflect that some of the visits have not been uploaded yet.
To verify that the counts by location and characteristic match among only the visits that have been uploaded to STORET, we can simply filter the database dataset to only include sample dates that are also in the STORET dataset.
db_results <- filter(db_results, ActivityStartDate %in% unique(storet_results$ActivityStartDate)) storet_compare_count(storet = list(LocationID = storet_results$MonitoringLocationIdentifier), db = list(LocationID = db_results$MonitoringLocationID)) storet_compare_count(storet = list(CharacteristicName = storet_results$CharacteristicName), db = list(CharacteristicName = db_results$CharacteristicName)) storet_compare_count(storet = list(Year = year(storet_results$ActivityStartDate)), db = list(Year = year(db_results$ActivityStartDate))) storet_compare_count(storet = list(Date = storet_results$ActivityStartDate), db = list(Date = db_results$ActivityStartDate))
Based on these comparisons using only the dates that are in STORET, we see one issue in the counts by location ID. There were 8 samples where the LocationID appears to have been changed in the database from BEI001 to BEI093. We can then identify the specific sample date associated with this difference by doing a comparison on both Date
and LocationID
.
storet_compare_count(storet = list(Date = storet_results$ActivityStartDate, LocationID = storet_results$MonitoringLocationIdentifier), db = list(Date = db_results$ActivityStartDate, LocationID = db_results$MonitoringLocationID))
And we see that the difference in LocationID
occurs in the 2015-01-23
sample visit.
Another thing we can do is look for differences in the ActivityID
, which shows the same result. Note that the lab (L
) and field (F
) samples each have different ActivityIDs
, so there are four rows total, two for each dataset.
storet_compare_count(storet = list(ActivityID = storet_results$ActivityIdentifier), db = list(ActivityID = db_results$ActivityID))
We now have some useful information to identify and correct the issue. Here are the corresponding rows from the STORET and Database result tables.
filter(storet_results, MonitoringLocationIdentifier == "MYRWA-BEI093", ActivityStartDate == as.Date("2015-01-23")) %>% select(ActivityIdentifier, ActivityStartDate, ActivityStartTime.Time, MonitoringLocationIdentifier, CharacteristicName, ResultMeasureValue, ResultMeasure.MeasureUnitCode) filter(db_results, MonitoringLocationID == "MYRWA-BEI001", ActivityStartDate == as.Date("2015-01-23")) %>% select(ActivityID, ActivityStartDate, ActivityStartTime, MonitoringLocationID, CharacteristicName, ResultValue, ResultUnit)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.