knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE )
We will walk through the specific use case of query, downloading, preparing, and QA/QCing data from the San Juan and Animas Rivers near Farmington, New Mexico with a focus on total dissolved solids and conductivity.
This vignette is designed to be concise and provide an example of how you might use TADA functions for basic data discovery and preparation. It leverages functions from TADA Modules 1 and 2. For more details on each function, see the TADA Function Reference or the more comprehensive module vignettes:
The fist step is to load the TADA package and other packages used in our workflow. (If you are a first-time TADA user, see TADA Module 1 Beginner Training for TADA package installation instructions.)
We'll also record the start time for our analysis, beginning after loading packages.
# Load required packages library(EPATADA) library(dplyr) library(yaml) library(lwgeom) library(sf) library(lubridate) library(knitr) library(DT) # Record start time start.time <- Sys.time()
To help define the area of interest, we can start by taking a look at How's My Waterway by searching for Farmington, New Mexico: HMW_Farmington, NM. We can identify the Assessment Unit Identifiers for the segments of the San Juan and Animas Rivers closest to their confluence. These are:
1) San Juan River (Navajo bnd at Hogback to Animas River) - NM-2401_10
2) Animas River (San Juan River to Estes Arroyo) - NM-2403.A_00
We can also identify the HUC-12s for each segment. This information will help us build our Water Quality Portal query to find results for the segments of interest plus other nearby water bodies. The HUC-12s are 140801050505 (San Juan River) and 140801041006 (Animas River).
The water chemistry data we will use are downloaded from the Water Quality Portal (https://www.waterqualitydata.us/). The WQP integrates publicly available water quality data from the USGS National Water Information System (NWIS) and the EPA Water Quality Exchange (WQX) Data Warehouse. The EPA water quality data originate from the Water Quality Exchange, the EPA's repository of water quality monitoring data collected by water resource management groups across the country. Organizations, including states, tribes, watershed groups, other federal agencies, volunteer groups, and universities, submit data to the WQX.
One of the fields that can be used to query the WQP is HUC. As we know
the HUC-12s of interest, we will use them to retrieve WQP data using
TADA_DataRetrieval
. As we are most interested in total dissolved
solids and conductivity and these are in the "Physical" characteristic
type group in the WQP, we will also include characteristic group type in
our query. We will not set start and end dates as we are interested in
all available data for these locations and characteristics. .
# Import data from WQP data <- TADA_DataRetrieval( statecode = "NM", huc = c("140801050505", "140801041006"), characteristicType = "Physical", applyautoclean = TRUE, ask = FALSE )
As part of TADA_DataRetrieval
, TADA_AutoClean
is performed
automatically unless specified otherwise by the user (by setting
applyautoclean = FALSE). When TADA_AutoClean
is run, the following
functions are performed on the data retrieved from the WQP:
TADA_ConvertSpecialChars
- converts result value columns to
numeric and flags non-numeric values that could not be converted.
TADA_ConvertResultUnits
- unifies result units for easier quality
control and review
TADA_ConvertDepthUnits
- converts depth units to a consistent unit
(meters).
TADA_IDCensoredData
- categorizes detection limit data and
identifies mismatches in result detection condition and result
detection limit type.
Other helpful actions - converts important text columns to all upper-case letters, and uses WQX format rules to harmonize specific NWIS metadata conventions (e.g. move characteristic speciation from the TADA.ResultMeasure.MeasureUnitCode column to the TADA.MethodSpeciationName column)
As a general rule, TADA functions do not change any contents in the WQP-served columns. Instead, they add new columns with the prefix "TADA." This allows users to easily review any changes TADA functions have made.
Now we can review the results and decide which ones we would like to retain for further QC and analysis. First let's take a look at all unique values in the TADA.ComparableDataIdentifier column and see how how many results are associated with each. TADA.ComparableDataIdentifier concatenates TADA.CharacteristicName, TADA.ResultSampleFractionText, TADA.MethodSpeciationName, and TADA.ResultMeasure.MeasureUnitCode.
# use TADA_FieldValuesTable to create a table of the number of results per TADA.ComparableDataIdentifier compid_table <- TADA_FieldValuesTable(data, field = "TADA.ComparableDataIdentifier") DT::datatable(compid_table, fillContainer = TRUE)
As we scroll through the table, we can see that there are some repeated
characteristic names. It may be possible that some of these are synonyms
which can be harmonized using TADA_HarmonizeSynonyms
so their results
can be directly compared. However, for purposes of this demo we will not
explore harmonization and will instead focus on the two
TADA.ComparableDataIdentifiers with the greatest number of results,
"TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L" and "SPECIFIC
CONDUCTANCE_TOTAL_NA_US/CM @25C". For information on harmonization, see
the EPATADA Function
Reference.
Now, we can filter the data set to retain only the TADA.ComparableDataIdentifiers of interest.
data <- data %>% dplyr::filter(TADA.ComparableDataIdentifier %in% c( "TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L", "SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C" ))
Before performing any analysis, there are some additional data preparation and QC steps that can be performed using TADA functions.
First, we can check the data for additional conditions which might cause
us to exclude certain results from further analysis. For example, we can
use TADA_FlagMethod
to check for invalid analytical
method-characteristic combinations; TADA_FlagSpeciation
to check for
invalid characteristics-method speciation combinations,
TADA_FlagResultUnit
to check the the validity of each
characteristic-media-result unit combination, and TADA_FlagFraction
to
check the validity of characteristic-fraction combinations. In this
simple example, we will remove all invalid combinations identified by
each of these functions. When setting up your own workflow, you will
want to review each function and its output more carefully before moving
on.
```r # Flag and remove results} data <- data %>% TADA_FlagMethod(clean = TRUE) %>% TADA_FlagSpeciation(clean = "both") %>% TADA_FlagResultUnit(clean = "both") %>% TADA_FlagFraction(clean = TRUE)
Another set of TADA flagging functions, `TADA_FlagAboveThreshold` and `TADA_FlagBelowThreshold`, can be used to check results against national lower and upper thresholds. As in the previous examples, we will set these functions to remove any results that fall outside the national thresholds. ```r # Flag and remove results} data <- data %>% TADA_FlagAboveThreshold(clean = TRUE) %>% TADA_FlagBelowThreshold(clean = TRUE)
We can also review the data set to see if it contains potential duplicate results from within a single organization or potential duplicates from within multiple organizations (such as when two or more organizations monitor the same location and may submit duplicate results).
We will select to keep only unique samples from
TADA_FindPotentialDuplicatesSingleOrg
by filtering for
TADA.SingleOrgDup.Flag equals "Unique". Then we will search for
potential duplicates from multiple orgs with
TADA_FindPotentialDuplicatesMultipleOrgs
and filter only to retain
only one value per potential duplicate by filtering for
TADA.ResultSelectedMultipleOrgs equals "Y".
If you would like to prioritize results from one organization over
another, this can be done using the org_hierarchy argument in
TADA_FindPotentialDuplicatesMultipleOrgs
. Information on how to this
can be found in the function reference for
TADA_FindPotentialDuplicatesMultipleOrgs.
In this example, we will use the default setting "none" where a single
representative is selected randomly from each duplicate group.
```r # Flag and remove results} data <- data %>% TADA_FindPotentialDuplicatesSingleOrg() %>% dplyr::filter(TADA.SingleOrgDup.Flag == "Unique") %>% TADA_FindPotentialDuplicatesMultipleOrgs( dist_buffer = 100, org_hierarchy = "none" ) %>% dplyr::filter(TADA.ResultSelectedMultipleOrgs == "Y")
Next we can remove QC and qualified samples (with qualifiers identified as suspect) using `TADA_FindQCActivities` and `TADA_FlagMeasureQualifierCode`, respectively. See the TADA Module 1 vignettes for more information on how these functions identify and flag results for removal as you may wish to make different decisions than the TADA defaults regarding which results to retain. ```r # Flag and remove results} data <- data %>% TADA_FindQCActivities(clean = TRUE) %>% TADA_FlagMeasureQualifierCode(clean = TRUE)
We can use a TADA function to identify any censored data in the data set.
data <- TADA_IDCensoredData(data)
Because there are no censored data in the data set, we do not need to
take any further action. If censored data do exist,
TADA_SimpleCensoredMethods
could be used. See the TADA Module 1
vignettes or for more information.
All geospatial data used in this example are downloaded from the Assessment Total Maximum Daily Load (TMDL) Tracking and Implementation System (ATTAINS) (https://www.epa.gov/waterdata/attains). ATTAINS is an online system for accessing information about conditions in the Nation's surface waters.
This information reported to EPA by states is available in ATTAINS. The public information is made available via ATTAINS web services. New TADA functions leverage the ATTAINS geospatial services to make geospatial information including catchment and assessment unit geometry easily accessible to R users.
We can use the function, TADA_GetATTAINS
to obtain geospatial data
from ATTAINS relevant to the Monitoring Locations included in the
WQP data set. See TADA Module
2 for a much
more detailed look at the logic behind this and the other TADA
geospatial functions.
# Import data from ATTAINS geospatial services ATTAINS_data <- TADA_GetATTAINS(data)
The TADA_ViewATTAINS
function allows us to see where the monitoring
locations from the WQP data set are relative to ATTAINS-indexed
catchments and assessment units. This can be helpful when deciding which
Monitoring Locations should be retained for additional analysis. It can
also allow us to sub
# View catchments and assessment units on map ATTAINS_map <- TADA_ViewATTAINS(ATTAINS_data) ATTAINS_map
Now that we've associated geospatial data from ATTAINS with the WQP data, we can generate pie charts for each TADA.ComparableDataIdentifier to give us some information about how many results are available per each monitoring location group.
# Select TADA data with ATTAINS data data <- ATTAINS_data$TADA_with_ATTAINS %>% # Remove geometry to reduce size of data set sf::st_drop_geometry() %>% # Adding "Other" as name for unnamed assessment units dplyr::mutate( ATTAINS.assessmentunitname = ifelse(is.na(ATTAINS.assessmentunitname), "Other", ATTAINS.assessmentunitname ) ) # Create pie chart of results for each AUID group for total dissolved solids TADA_FieldValuesPie(data, field = "ATTAINS.assessmentunitname", characteristicName = "TOTAL DISSOLVED SOLIDS" ) # Create pie chart of results for each AUID group for specific conductance TADA_FieldValuesPie(data, field = "ATTAINS.assessmentunitname", characteristicName = "SPECIFIC CONDUCTANCE" )
We can also create a table to give us a little more information about assessment unit each monitoring location belongs to, the number of results per each parameter and the oldest and most recent sampling date.
# Create table of monitoring location identifiers MonitoringLocations <- data %>% dplyr::select( MonitoringLocationName, MonitoringLocationIdentifier, OrganizationFormalName, ATTAINS.assessmentunitname, TADA.CharacteristicName, ActivityStartDate ) %>% dplyr::group_by(MonitoringLocationIdentifier) %>% dplyr::mutate( n_TDS = length(TADA.CharacteristicName[TADA.CharacteristicName == "TOTAL DISSOLVED SOLIDS"]), n_SpecificConductance = length(TADA.CharacteristicName[TADA.CharacteristicName == "SPECIFIC CONDUCTANCE"]), StartDate = min(ActivityStartDate), EndDate = max(ActivityStartDate) ) %>% dplyr::select( MonitoringLocationIdentifier, ATTAINS.assessmentunitname, n_TDS, n_SpecificConductance, StartDate, EndDate ) %>% dplyr::distinct() DT::datatable(MonitoringLocations, fillContainer = TRUE)
Now, let's create subsets of the data by assessment unit name. These will be useful to create some basic data visualizations.
data_animas <- data %>% dplyr::filter(ATTAINS.assessmentunitname == "Animas River (San Juan River to Estes Arroyo)") data_sanjuan <- data %>% dplyr::filter(ATTAINS.assessmentunitname == "San Juan River (Navajo bnd at Hogback to Animas River)") data_other <- data %>% dplyr::filter(ATTAINS.assessmentunitname == "Other")
First, let's create a two characteristic scatterplot of total dissolved
solids and specific conductance from the entire data set to begin
visualizing the data, using TADA_TwoCharacteristicScatterplot
.
# choose two and generate scatterplot TADA_TwoCharacteristicScatterplot(data, id_cols = "TADA.ComparableDataIdentifier", groups = c("SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C", "TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L"))
While there are some older records (including a specific conductance result from 1900), we may decide we are interested only in more recent data. We could filter the data set to exclude any results collected prior to 2015 and recreate the scatterplot.
# choose two and generate scatterplot TADA_TwoCharacteristicScatterplot( data %>% dplyr::filter (ActivityStartDate > "2014-12-31"), id_cols = "TADA.ComparableDataIdentifier", groups = c( "SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C", "TOTAL DISSOLVED SOLIDS_DISSOLVED_NA_UG/L" ) )
We could also choose to explore one characteristic from one of the assessment units more closely. For example, we could create a histograms for both characteristics for the Animas River only, excluding samples prior to 2014.
TADA_Histogram( data %>% dplyr::filter( ActivityStartDate > "2014-12-31", TADA.CharacteristicName == "SPECIFIC CONDUCTANCE", ATTAINS.assessmentunitname == "Animas River (San Juan River to Estes Arroyo)" ), id_cols = "TADA.ComparableDataIdentifier" )
Or we could create a boxplot for a particular characteristic and assessment unit, for example, total dissolved solids for the San Juan River.
TADA_Boxplot( data %>% dplyr::filter( ActivityStartDate > "2014-12-31", TADA.CharacteristicName == "SPECIFIC CONDUCTANCE", ATTAINS.assessmentunitname == "San Juan River (Navajo bnd at Hogback to Animas River)" ), id_cols = "TADA.ComparableDataIdentifier" )
We might also want to directly compare the same characteristic overtime
between the two rivers. We can do this by revisiting
TADA.TwoCharacteristicScatterplot
.
# create two characteristic scatterplot using TADA_TWoCharacteristicScatterplot twochar_scatter <- TADA_TwoCharacteristicScatterplot( data %>% dplyr::filter( ActivityStartDate > "2014-12-31", TADA.ComparableDataIdentifier == "SPECIFIC CONDUCTANCE_TOTAL_NA_US/CM @25C" ), id_cols = "ATTAINS.assessmentunitname", groups = c("San Juan River (Navajo bnd at Hogback to Animas River)", "Animas River (San Juan River to Estes Arroyo)") ) %>% # remove default plot features that are not applicable for a location comparison plotly::layout( yaxis2 = list(overlaying = "y", side = "right", title = "", visible = FALSE), title = TADA_InsertBreaks("SPECIFIC CONDUCTANCE for the San Juan and Animas Rivers Over Time") ) # modify plot titles to reflect group/location names twochar_scatter <- plotly::plotly_build(twochar_scatter) twochar_scatter$x$data[[2]]$name <- "San Juan River" twochar_scatter$x$data[[3]]$name <- "Animas River" # rebuild graph with corrected legend labels plotly::plotly_build(twochar_scatter)
We can also use TADA_Stats
to generate a summary table for the entire
data set or subsets of it for each river.
# create table with all data data_stats <- TADA_Stats(data) DT::datatable(data_stats, fillContainer = TRUE)
# create table with animas data animas_stats <- TADA_Stats(data %>% dplyr::filter(ATTAINS.assessmentunitname == "Animas River (San Juan River to Estes Arroyo)")) DT::datatable(animas_stats, fillContainer = TRUE)
# create table with san juan data sanjuan_stats <- TADA_Stats(data %>% dplyr::filter(ATTAINS.assessmentunitname == "San Juan River (Navajo bnd at Hogback to Animas River)")) DT::datatable(sanjuan_stats, fillContainer = TRUE)
This workflow is reproducible and the decisions at each step are well documented. This means that it is easy to go back and review every step, understand the decisions that were made, make changes as necessary, and run it again.
For example, we could modify the code to repeat the same analysis at a
different location. Or we could run TADA_ConvertResultUnits
an
additional time outside to convert units for one or more
characteristics.
We could change or add additional characteristics to our analysis. We could even write additional custom functions or even use functions from other packages as needed for more complex standards, evaluate trends, or answer other questions using a TADA dataframe.
The code chunk below displays the elapsed time it took to run the code chunks and create this document, providing an example of how incorporating TADA in your workflow can increase efficiency.
end.time <- Sys.time() end.time - start.time
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.