knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, eval = F )
We will walk through the specific use case of query, downloading, preparing, QA/QCing data from a single Assessment Unit with multiple Monitoring Locations sampled by several organizations. Three characteristics, pH, water temperature, and total nitrogen are used in this example. Two of the characteristics, pH and water temperature, will be compared to a state water quality standard from Illinois. The third characteristic, total nitrogen, does not have an Illinois water quality standard, but is used to demonstrate some helpful functions for data exploration and visualization.
This vignette is designed to be concise and provide an example of how you might use TADA functions for basic data discovery and analysis with regards to a single Assessment Unit and a small subset of characteristics. It leverages functions from TADA Modules 1 and 2. For more details on each function, see the TADA Function Reference or the more comprehensive module vignettes:
First, install and load the remotes package specifying the repo. This is needed before installing TADA because it is only available on GitHub (not CRAN).
install.packages("remotes", repos = "http://cran.us.r-project.org" ) library(remotes)
Next, install and load EPATADA using the remotes package.
remotes::install_github("USEPA/EPATADA", ref = "develop", dependencies = TRUE )
remotes::install_github("USEPA/EPATADA", ref = "prerelease", dependencies = TRUE ) library(EPATADA)
The fist step is to load the TADA package and other packages used in our workflow. (If you are a first-time TADA user, see TADA Module 1 Beginner Training for TADA package installation instructions.)
We'll also record the start time for our analysis, beginning after loading packages.
list.of.packages <- c( "dplyr", "yaml", "lwgeom", "sf", "lubridate", "knitr", "DT" ) new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])] if (length(new.packages)) install.packages(new.packages) # Load required packages library(dplyr) library(yaml) library(lwgeom) library(sf) library(lubridate) library(knitr) library(DT) # Record start time start.time <- Sys.time()
The water chemistry data we will use are downloaded from the Water Quality Portal (https://www.waterqualitydata.us/). The WQP integrates publicly available water quality data from the USGS National Water Information System (NWIS) and the EPA Water Quality Exchange (WQX) Data Warehouse. The EPA water quality data originate from the Water Quality Exchange, the EPA's repository of water quality monitoring data collected by water resource management groups across the country. Organizations, including states, tribes, watershed groups, other federal agencies, volunteer groups, and universities, submit data to the WQX.
We are interested in Assessment Unit "IL_I-84". This is a Mississippi River Assessment Unit. Because we don't have a comprehensive list of the Monitoring Locations in this Assessment Unit, we will need to create a broad query to retrieve all results that might be associated the IL_I-84. One way to do this is to include all HUC10s that are relevant to the Assessment Unit.
We can use TADA_DataRetrieval
to retrieve WQP data. In this example,
we have specified lists of HUCs and Characteristic Types to reduce the
size of the query since we have a general idea of the location of our
assessment unit of interest and know which characteristics we'd like to
explore. One quick way to find this information is to view IL_I-84's
Waterbody Report on How's My Waterway:
WaterbodyReport/IL_EPA/IL_I-84/2022.
We'll also need to set start and end dates. For this example we'll use a ten year range between 2010 and 2020. We will also set the Characteristic Type to "Nutrient" and "Physical", because these types contain nitrogen, pH and water temperature Characteristics. It is possible to query by a specific Characteristic Name, but different organizations may input the same name in a variety of ways. Therefore, we recommend keeping your initial queries as broad as possible so all relevant results are downloaded and useful data from synonymous Characteristics are not missed. There are TADA functions that can be used to harmonize these synonyms.
# Import data from WQP data <- TADA_DataRetrieval( statecode = "IL", startDate = "2010-01-01", endDate = "2020-12-31", huc = c("0714010505", "0714010504", "0714010508", "0714010501", "0714010503"), applyautoclean = TRUE, ask = FALSE )
As part of TADA_DataRetrieval
, TADA_AutoClean
is performed
automatically unless specified otherwise by the user (by setting
applyautoclean = FALSE). When TADA_AutoClean
is run, the following
functions are performed on the data retrieved from the WQP:
TADA_ConvertSpecialChars
- converts result value columns to
numeric and flags non-numeric values that could not be converted.
TADA_ConvertResultUnits
- unifies result units for easier quality
control and review
TADA_ConvertDepthUnits
- converts depth units to a consistent unit
(meters).
TADA_IDCensoredData
- categorizes detection limit data and
identifies mismatches in result detection condition and result
detection limit type.
Other helpful actions - converts important text columns to all upper-case letters, and uses WQX format rules to harmonize specific NWIS metadata conventions (e.g. move characteristic speciation from the TADA.ResultMeasure.MeasureUnitCode column to the TADA.MethodSpeciationName column)
As a general rule, TADA functions do not change any contents in the WQP-served columns. Instead, they add new columns with the prefix "TADA." This allows users to easily review any changes TADA functions have made.
All geospatial data used in this example are downloaded from the Assessment Total Maximum Daily Load (TMDL) Tracking and Implementation System (ATTAINS) (https://www.epa.gov/waterdata/attains). ATTAINS is an online system for accessing information about conditions in the Nation's surface waters.
This information reported to EPA by states is available in ATTAINS. The public information is made available via ATTAINS web services. New TADA functions leverage the ATTAINS geospatial services to make geospatial information including catchment and assessment unit geometry easily accessible to R users.
We can use the function, TADA_GetATTAINS
to obtain geospatial data
from ATTAINS relevant to the Monitoring Locations included in the
WQP data set. See TADA Module
2 for a much
more detailed look at the logic behind this and the other TADA
geospatial functions.
# Import data from ATTAINS geospatial services ATTAINS_data <- TADA_GetATTAINS(data)
The TADA_ViewATTAINS
function allows us to see where the monitoring
locations from the WQP data set are relative to ATTAINS-indexed
catchments and assessment units. This can be helpful when deciding which
Monitoring Locations should be retained for additional analysis. For
this demo, we are focusing on a single Assessment Unit, IL-84, with
multiple monitoring locations (MonitoringLocationIdentifier).
# View catchments and assessment units on map ATTAINS_map <- TADA_ViewATTAINS(ATTAINS_data) ATTAINS_map
Now that we've associated geospatial data from ATTAINS with the WQP data, we can filter the data set to retain only results that were collected in the specified Assessment Unit. We can also generate a new table to give us some information about the individual monitoring locations within the assessment unit. After this filter step is complete, we can remove the "ATTAINS." prefixed columns to reduce the size of the data set, as we won't need them for the remaining steps in this example.
We can also create a table with some basic information about the Monitoring Locations in Assessment Unit IL-84 and a pie chart to display the relative number of results contributed by each organization.
# Filter data for specified assessment unit AUID_data <- ATTAINS_data$TADA_with_ATTAINS %>% dplyr::filter(ATTAINS.assessmentunitidentifier == "IL_I-84") Analysis_data <- ATTAINS_data$TADA_with_ATTAINS %>% dplyr::filter(ATTAINS.assessmentunitidentifier == "IL_I-84") %>% dplyr::select(-contains("ATTAINS.")) %>% sf::st_drop_geometry() %>% TADA_RetainRequired() # Create table of monitoring location identifiers MonitoringLocations <- Analysis_data %>% dplyr::select(MonitoringLocationName, MonitoringLocationIdentifier, OrganizationFormalName) %>% dplyr::distinct() DT::datatable(MonitoringLocations, fillContainer = TRUE) # Create pie of results by organization Orgs_pie <- TADA_FieldValuesPie(Analysis_data, field = "OrganizationFormalName") Orgs_pie
# Determine number of organizations norgs <- length(unique(Analysis_data$OrganizationFormalName)) # Determine number of monitoring locations nmls <- length(unique(Analysis_data$MonitoringLocationIdentifier)) # Determine N results nres <- length(Analysis_data$TADA.ResultMeasureValue) # Determine earliest sample date earlydate <- min(Analysis_data$ActivityStartDate) # Determine latest sample date latedate <- max(Analysis_data$ActivityStartDate)
This allows us to easily review how many monitoring locations, organizations, and results are in the dataframe for the selected time period.
Before we can begin comparisons between the water chemistry results and a water quality standard, there are some additional data preparation and QC steps that can be performed using TADA functions.
First, we can check the data for additional conditions which might cause
us to exclude certain results from further analysis. For example, we can
use TADA_FlagMethod
to check for invalid analytical
method-characteristic combinations; TADA_FlagSpeciation
to check for
invalid characteristics-method speciation combinations,
TADA_FlagResultUnit
to check the the validity of each
characteristic-media-result unit combination, and TADA_FlagFraction
to
check the validity of characteristic-fraction combinations. In this
simple example, we will remove all invalid combinations identified by
each of these functions. When setting up your own workflow, you will
want to review each function and its output more carefully before moving
on.
# Flag and remove results Analysis_data <- Analysis_data %>% TADA_FlagMethod(clean = TRUE) %>% TADA_FlagSpeciation(clean = "both") %>% TADA_FlagResultUnit(clean = "both") %>% TADA_FlagFraction(clean = TRUE)
Another set of TADA flagging functions, TADA_FlagAboveThreshold
and
TADA_FlagBelowThreshold
, can be used to check results against national
lower and upper thresholds. As in the previous examples, we will set
these functions to remove any results that fall outside the national
thresholds.
# Flag and remove results Analysis_data <- Analysis_data %>% TADA_FlagAboveThreshold(clean = TRUE) %>% TADA_FlagBelowThreshold(clean = TRUE)
We can also review the data set to see if it contains potential duplicate results from within a single organization or potential duplicates from within multiple organizations (such as when two or more organizations monitor the same location and may submit duplicate results).
We will select to keep only unique samples from
TADA_FindPotentialDuplicatesSingleOrg
by filtering for
TADA.SingleOrgDup.Flag equals "Unique". Then we will search for
potential duplicates from multiple orgs with
TADA_FindPotentialDuplicatesMultipleOrgs
and filter only to retain
only one value per potential duplicate by filtering for
TADA.ResultSelectedMultipleOrgs equals "Y".
If you would like to prioritize results from one organization over
another, this can be done using the org_hierarchy argument in
TADA_FindPotentialDuplicatesMultipleOrgs
. Information on how to this
can be found in the function reference for
TADA_FindPotentialDuplicatesMultipleOrgs.
In this example, we will use the default setting "none" where a single
representative is selected randomly from each duplicate group.
# Flag and remove results Analysis_data <- Analysis_data %>% TADA_FindPotentialDuplicatesSingleOrg() %>% dplyr::filter(TADA.SingleOrgDup.Flag == "Unique") %>% TADA_FindPotentialDuplicatesMultipleOrgs( dist_buffer = 100, org_hierarchy = "none" ) %>% dplyr::filter(TADA.ResultSelectedMultipleOrgs == "Y")
Next we can remove QC and qualified samples (with qualifiers identified
as suspect) using TADA_FindQCActivities
and
TADA_FlagMeasureQualifierCode
, respectively. See the TADA Module 1
vignettes for more information on how these functions identify and flag
results for removal as you may wish to make different decisions than the
TADA defaults regarding which results to retain.
# Flag and remove results Analysis_data <- Analysis_data %>% TADA_FindQCActivities(clean = TRUE) %>% TADA_FlagMeasureQualifierCode(clean = TRUE)
Censored data are measurements for which the true value is not known,
but we can estimate the value based on lower or upper detection
conditions and limit types. TADA fills missing TADA.ResultMeasureValue
and TADA.ResultMeasure.MeasureUnitCode values with values and units
from TADA.DetectionQuantitationLimitMeasure.MeasureValue and
TADA.DetectionQuantitationLimitMeasure.MeasureUnitCode, respectively,
using the TADA_AutoClean
function.
The TADA package currently has functions that summarize censored data
incidence in the dataset and perform simple substitutions of censored
data values, including x times the detection limit and random selection
of a value between 0 and the detection limit. The user may specify the
methods used for non-detects and over-detects separately in the input to
the TADA_SimpleCensoredMethods
function. The next step we take in this
example is to perform simple conversions to the censored data in the
dataset: we keep over-detects as is (no conversion made) and convert
non-detect values to 0.5 times the detection limit (half the detection
limit).
See the TADA Module 1 vignettes for more detailed information on the
logic behind TADA_SimpleCensoredMethods
.
After running TADA_SimpleCensoredMethods
, we will also filter the data
to remove any remaining NAs.
Analysis_data <- TADA_SimpleCensoredMethods(Analysis_data, nd_method = "multiplier", nd_multiplier = 0.5, od_method = "as-is", od_multiplier = "null" ) %>% dplyr::filter(!is.na(TADA.ResultMeasureValue))
The TADA_GetSynonymRef
function generates a synonym reference table
that is specific to the input dataframe. Users can review how their
input data relates to standard TADA values for the following elements:
TADA.CharacteristicName, TADA.ResultSampleFractionText,
TADA.MethodSpeciationName, and TADA.ResultMeasure.MeasureUnitCode and
edit if desired.
Users can also edit the reference file to meet their needs if desired.
The download argument can be used to save the harmonization file to your
current working directory when download = TRUE
, the default is
download = FALSE
. In this example, we will modify the harmonization
file to group results with the fraction "NA" and results with the
fraction "TOTAL" for pH and water temperature.
The TADA_HarmonizeSynonyms
function then compares the input dataframe
to the TADA Synonym Reference Table and makes conversions where target
characteristics/fractions/speciations/units are provided. This function
also appends a column called TADA.Harmonized.Flag, indicating which
results had metadata changed/converted in this function. The purpose of
this function is to make similar data consistent and therefore easier to
compare and analyze.
UniqueHarmonizationRef <- TADA_GetSynonymRef(Analysis_data) UniqueHarmonizationRef_edit <- UniqueHarmonizationRef %>% dplyr::mutate( Target.TADA.CharacteristicName = ifelse(TADA.CharacteristicName == "TEMPERATURE, WATER", "TEMPERATURE", TADA.CharacteristicName), Target.TADA.ResultSampleFractionText = ifelse(TADA.CharacteristicName == "PH", NA, Target.TADA.ResultSampleFractionText), HarmonizationGroup = ifelse(TADA.CharacteristicName == "PH", "PH", HarmonizationGroup), HarmonizationGroup = ifelse(Target.TADA.CharacteristicName == "TEMPERATURE", "TEMPERATURE", HarmonizationGroup ) ) Harmonized_data <- TADA_HarmonizeSynonyms(Analysis_data, ref = UniqueHarmonizationRef_edit )
TADA_CalculateTotalNP
uses the Nutrient Aggregation
logic
to add together specific subspecies to obtain a total. TADA adds one
more equation to the mix: total particulate nitrogen + total dissolved
nitrogen. The function uses as many subspecies as possible to calculate
a total for each given site, date, and depth group, but it will estimate
total nitrogen with whatever subspecies are present. This function
creates NEW total nutrient measurements (total nitrogen unfiltered as N
and total phosphorus unfiltered as P) and adds them to the dataframe.
Users can use the default summation worksheet (see
TADA_GetNutrientSummationRef
) or customize it to suit their needs. The
function also requires a daily aggregation value, either minimum,
maximum, or mean. The default is 'max', which means that if multiple
measurements of the same subspecies-fraction-speciation-unit occur on
the same day at the same site and depth, the function will pick the
maximum value to use in summation calculations. In this example, we will
use the default summation worksheet to obtain total nitrogen.
Harmonized_data <- TADA_CalculateTotalNP(Harmonized_data, daily_agg = "max")
Now we can filter to our three parameters of interest and use
TADA_Stats
take a look at some basic statistics for each, including
location count, measurement count, min, max, and more.
# review unique identifiers unique(Harmonized_data$TADA.ComparableDataIdentifier) # filter for three comparable data identifiers of interest Filtered_data <- Harmonized_data %>% dplyr::filter(TADA.ComparableDataIdentifier %in% c("TEMPERATURE_NA_NA_DEG C", "PH_NA_NA_STD UNITS", "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L")) # generate stats table Filtered_data_stats <- TADA_Stats(Filtered_data) DT::datatable(Filtered_data_stats, fillContainer = TRUE)
Next, we will create a two characteristic scatterplot of pH and
temperature to begin visualizing the data, using
TADA_TwoCharacteristicScatterplot
.
# choose two and generate scatterplot TADA_TwoCharacteristicScatterplot(Harmonized_data, id_cols = "TADA.ComparableDataIdentifier", groups = c("TEMPERATURE_NA_NA_DEG C", "PH_NA_NA_STD UNITS"))
Let's focus on just one characteristic, pH. We can filter the dataframe
to retain only pH samples. We can also add a column to indicate whether
each result falls within the water quality standard range for pH (6.5 -
9). From this pH-only frame we can create a table with just a few
columns for easier review, and a single characteristic scatterplot using
TADA_Scatterplot
. In this example, horizontal lines have been added to
indicate the upper and lower ends of the water quality standard. These
are not a default option in TADA_Scatterplot
but can be added via the
plotly
package.
# comparison to standard for ph pH_Standard <- Filtered_data %>% dplyr::filter(TADA.ComparableDataIdentifier == "PH_NA_NA_STD UNITS") %>% dplyr::mutate(MeetsStandard = ifelse(TADA.ResultMeasureValue >= 6.5 & TADA.ResultMeasureValue <= 9, "Yes", "No")) # subset of pH data with fewer rows pH_Table <- pH_Standard %>% dplyr::select( MonitoringLocationIdentifier, OrganizationFormalName, ActivityStartDate, TADA.ResultMeasureValue, MeetsStandard ) DT::datatable(pH_Table, fillContainer = TRUE) # pH scatterplot pH_Scatter <- TADA_Scatterplot(pH_Standard, id_cols = "TADA.ComparableDataIdentifier") %>% plotly::add_lines( y = 6.5, x = c(min(pH_Standard$ActivityStartDate), max(pH_Standard$ActivityStartDate)), inherit = FALSE, showlegend = FALSE, line = list(color = "red"), hoverinfo = "none" ) %>% plotly::add_lines( y = 9, x = c(min(pH_Standard$ActivityStartDate), max(pH_Standard$ActivityStartDate)), inherit = FALSE, showlegend = FALSE, line = list(color = "red"), hoverinfo = "none" ) pH_Scatter
Comparison of temperature to water quality standards is a little more complicated as the maximum temperature varies by season and there are multiple components of the temperature standard. For this example, we will focus on the "shall never exceed" seasonal temperature standards.
These are:
17.7 deg C for January, February, March, and December
33.7 deg C for April, May, June, July, August, September, October, and November.
We can use lubridate
and dplyr
functions to assign the appropriate
standard to each result by using lubridate::month
to identify the
month in which each sample was collected and dplyr::mutate
to create a
new column for the temperature standard. Then the results can be
compared to the standard.
We can use TADA_Scatterplot
again to visualize the data. Due to the
seasonal variation in the standard and the ten-year date range in our
data set, we may want to consider other visualizations to visually
review the number of results not meeting the standard. We can use
TADA_FieldValuesPie
on the MeetsStandard field we created to see how
many samples met or did not meet the standard.
Creating a table also allows for easy filtering of results to identify the results that did not meet the standard and review during which years and seasons they occurred.
# comparison to standard for temperature Temp_Standard <- Filtered_data %>% dplyr::filter(TADA.ComparableDataIdentifier == "TEMPERATURE_NA_NA_DEG C") %>% dplyr::mutate( MonthForAnalysis = lubridate::month(ActivityStartDate), TempStandard = ifelse(MonthForAnalysis %in% c(1, 2, 3, 12), 17.7, 33.7), MeetsStandard = ifelse(TADA.ResultMeasureValue < TempStandard, "Yes", "No" ) ) # create scatterplot for temperature Temp_Scatter <- TADA_Scatterplot(Temp_Standard, id_cols = "TADA.ComparableDataIdentifier") Temp_Scatter # create pie chart for temperature Temp_Pie <- TADA_FieldValuesPie(Temp_Standard, field = "MeetsStandard") Temp_Pie # create subset table for temperature Temp_Table <- Temp_Standard %>% dplyr::select( MonitoringLocationIdentifier, OrganizationFormalName, ActivityStartDate, TADA.ResultMeasureValue, TempStandard, MeetsStandard ) DT::datatable(Temp_Table, fillContainer = TRUE)
We may also want to use TADA functions for exploration and visualization of characteristics without water quality standards. For this example, we will use "TOTAL NITROGEN, MIXED FORMS_FILTERED, LAB_NA_MG/L".
To better understand the distribution of results, we can use the TADA
functions TADA_Histogram
and TADA_Boxplot
.
TADA_Histogram
can be useful for identifying the overall shape of the
data.
# filter dataframe to comparable data identifier of interest Nitrogen_data <- dplyr::filter(Filtered_data, TADA.ComparableDataIdentifier == "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L") # generate a histogram Nitrogen_Histogram <- TADA_Histogram(Nitrogen_data, id_cols = "TADA.ComparableDataIdentifier") # view histogram Nitrogen_Histogram
TADA_Boxplot
can be useful for identifying skewness and percentiles.
Nitrogen_Boxplot <- TADA_Boxplot(Nitrogen_data, id_cols = "TADA.ComparableDataIdentifier") Nitrogen_Boxplot
Filtering the results of TADA_Stats
for only "TOTAL NITROGEN, MIXED
FORMS_TOTAL RECOVERABLE_NA_MG/L" and selecting a smaller subset of
columns or creating a single characteristic scatterplot may also provide
useful information.
# create table with nitrogen stats Nitrogen_stats <- Filtered_data_stats %>% dplyr::filter(TADA.ComparableDataIdentifier == "TOTAL NITROGEN, MIXED FORMS_UNFILTERED_AS N_MG/L") %>% dplyr::select(Location_Count, Measurement_Count, Min, Max, Mean) DT::datatable(Nitrogen_stats, fillContainer = TRUE) # create nitrogen scatterplot Nitrogen_Scatterplot <- TADA_Scatterplot(Nitrogen_data, id_cols = "TADA.ComparableDataIdentifier") Nitrogen_Scatterplot
Major benefits to this type are workflow are that it is reproducible and the decisions at each step are well documented. This means that it is easy to go back and review every step, understand the decisions that were made, make changes as necessary, and run it again. If someone asks "How were data filtered prior to analysis?" or "How did you identify additional Monitoring Locations within the Assessment Unit?", you can refer to the code and function documentation to provide answers.
For example, we could change the Assessment Unit of interest, modify the
relevant code chunks (potentially including the WQP query if the new
Assessment Unit is in a different HUC or state) and repeat the same
analysis for the same characteristics at a different location. We could
also modify the harmonization reference to group additional comparable
data (if appropriate) or add a step to run TADA_ConvertResultUnits
an
additional time outside of TADA_AutoClean
if we wanted to convert
temperature results to deg F for analysis.
Or we could change or add additional characteristics to our analysis. We could even write additional custom functions or even use functions from other packages as needed for more complex standards, evaluate trends, or answer other questions using a TADA dataframe.
The code chunk below displays the elapsed time it took to run the code chunks and create this document, providing an example of how incorporating TADA in your workflow can increase efficiency.
end.time <- Sys.time() end.time - start.time
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.