knitr::opts_chunk$set(echo = FALSE) knitr::opts_chunk$set(comment = "") knitr::opts_chunk$set(warning = FALSE) knitr::opts_chunk$set(message = FALSE) knitr::opts_chunk$set(collapse = TRUE) options(cli.num_colors = 1) # bug? https://community.rstudio.com/t/glimpse-output-encoding-issue-with-pkgdown/117526/4 library(tidyr) library(dplyr) library(purrr) library(hera) library(darleq3) library(rict) set.seed(42)
# directly adding css to output html without ruining css style https://stackoverflow.com/questions/29291633/adding-custom-css-tags-to-an-rmarkdown-html-document cat(" <style> img { border: 0px; outline: 0 ; } </style> ")
WORK IN PROGRESS - Drafting
{ width=50% }
WARNING: Blue sky thinking ahead
Keywords: collection, modelling, prediction, classification, forecast, assessment, scenario
TL:DR - To get a flavour of running multiple classification tools using a shared interface and data structures - check out the hera demo website GUI
We propose regulatory tools share a common set of design principles, interfaces and data structures.
Specifically, we propose an official collection of R packages designed to provide collaborative workflow for building and using assessment tools. In turn, these packages will be unified into a single package called 'hera'. This provides a common interface to run regulatory assessments. We expect this process will facilitate code re-use, faster integration and knowledge exchange between method developers and practitioners.
UKTAG has guided the development of an impressive range of classification tools. This has involved many developers, researchers and experts dedicating their time and effort to creating tools to better understand pressures on the environment. We are confident that there are many future opportunities for collaboration and tool development in response to changing environmental pressures and improving scientific understanding. As access to modelling tools become more routine, we expect a proliferation of models and indices in the years ahead. For instance, new tools for diagnosing pressures, updates to existing tools and catchment scale planning. To aid better understanding of the environment through effectively combining multiple models and tools, we propose they share a common design philosophy to aid integration and collaboration.
The aims of a shared design philosophy for regulatory classification R packages include:
In the next 10 years...
We assume it is likely that our aquatic ecological models will be subsumed into a more large scale environmental and climate models. These 'total environment' models may for instance use climate change models to forecast impact on invertebrates, water-use models to predict the impact on fish or spatial planning tools the impact on nutrient levels. The outputs will be used across regulatory reporting for RBMP, flood management, biodiversity improvements, carbon sequestering etc. Allowing multi-discipline assessment of impacts and trade-offs for each planning scenario and proposed measures - ensuring well-informed decisions-making.
All ecological data along with supporting data such as chemistry, climate, meteorological, geological, and satellite imagery, will be freely and easily accessible. We assume agencies will upload all data such as fish counter data, plant DNA or aerial imagery etc, into a 'lake' of environmental data.
To take step towards this vision, the underlying design of a models and tools must be modular and easy to connect and integrate in a variety of ways.
To aid collaboration and respond to the changing environmentally pressures, we propose creating a joint collections of packages to share understanding on the environment while providing the software infrastructure to lighten the burden of more mundane tasks involved in maintaining and deploying new models and interfaces.
This proposal is influenced by the work within the climate change research community such as the Climate Modelling and Diagnostics Toolkit, the MET office's unified model approach. As well as best software practices such as on-going work in the R community such as ropenSci and research projects such as the Virtual Observatory.
Currently we have a number of classification tool R packages shared on github:
We propose these tools become part of official collection of packages and
we work towards making them inter-operable via the hera
package.
Here we introduce a prototype R package called
hera. The key idea, is hera
provides a
common interface for existing WFD packages and future developments. This is
achieved through a set of shared functions required to run and report
classification. It builds on the best practice idea of how to run many
models simultaneously in R while
keeping the input and output data formats simple and homogeneous. We explain
each function in detail below. In summary, each step represents a function
within the hera package. This allows the re-use code and rules between existing
and future tool development.
Steps
1. Validation 2. Indices/metrics 3. Assessment 4. Reporting
The examples below illustrative of how this RFC could be implemented. Keep in mind these examples are not full, complete or accurate. The data structure, naming and details could all change, they are presented as a rough draft.
The validation()
function will returns the passing data and list of warnings/fails.
# install.packages("devtools") # devtools::install_github("ecodata1/hera") demo_data <- hera::demo_data data <- validation(demo_data) data[5, ]
Using the assess()
function, we can calculate as required:
1. Indices/Metrics (summary stats)
2. Predictions (output from models)
3. Classification/Reports (assessments combining summary and prediction stats)
The 'assessment' stage includes any type of index, test or categorisation of our data. This could be WFD classification, compliance check, bathing water or discharge assessment. Broadly, any type of grading or assessment.
By default the assess()
function we run all the assessments possible based on the data provided.
r
data %>%
assess() %>%
select(sample_id, question, response) %>%
slice_sample(n = 4)
Alternatively, we can select a list of specific assessments from the catalogue
to run as required.
catalogue %>% filter(assessment %in% c("Macroinvertebrate Metrics", "DARLEQ3")) %>% select(assessment)
data %>% assess(name = c("Macroinvertebrate Metrics","DARLEQ3")) %>% select(sample_id, question, response) %>% slice_sample(n = 4)
The catalogue
can contain complex reporting assessments that compile, group
and presents one or more assessment output. For instance running and presenting
a number of WFD assessments, grouping by water body and hierarchy for a River
Basin Management Plan.
# TODO! assess(demo_data, 'rbmp_report')
As well as classification of water quality, additional we need to diagnose
potential pressures. Again, we can add a assessment to the catalogue
for
diagnostic reporting. We use the assess()
function to build a custom report to
help diagnose potential pressures.
# TODO! assess(demo_data, 'diagnose_wfd_pressures') %>% select(sample_id, question, response) head()
To facilitate general data exploration, an aggregate
helper function can group the outputs by season, year, multi-year or water body as
required. This allows a standard way to aggregate data within the hera
toolset.
aggregate <- aggregate(demo_data, c("year","season","waterbody")) head(class)
A compare
helper function allows generic comparison between years, locations or samples to allow general data analysis and comparison.
# TODO! compare_report <- compare(new_data, old_data) compare_report
Another, example of comparing two samples (perhaps up and downstream):
# TODO! compare_report <- compare(site_one, site_two) compare_report
A number of forecasting or scenario tools could be incorporated for either projecting current trends or assessing the impact of proposed measures.
# TODO! assess(demo_data, trends, scenario="wfd_forecast") %>% head() assess(demo_data, measures, scenario="measure_impact") %>% head()
For the most part we don't expect users to go through each of these steps. But
for developers and researchers it is useful to think about classification within
this framework of discrete steps. For the majority of end users, agency staff or
consultants, they can open the GUI hera_app()
or visit the website directly.
Furthermore, agencies can integrate these functions into their systems using web services. Please see the demo web service and api documentation for using opencpu hosted packages.
Models rely on observed sampling for training. Samples come in a range of forms from points, transects, images, grabs, DNA etc. But the general feature of modelling is based on being able to predict what we expect to find from whatever sampling technique we deploy. The sample is the fundamental observation which we compare against our prediction. The samples are discreet, either observed instantaneously or perhaps over a few minutes or hours (where dynamic changes are not significant).
Multiple samples can be aggregated to smooth variance but the sample still remains the fundamental building block. The sample could be a single pixel from an aerial image or a salmon moving through a fish counter. We still make predictions of what we expect this sample to be like even if the true picture only emerges after several samples are aggregated or compared.
Therefore, all our data share similarities, they consist of samples and observations. And additionally each sample will have predictor variables to allow us to predict expected reference values or outcomes.
There are a number of data dictionaries, metadata standards and semantic data definition across disciplines and organisations. It's unclear exactly how we define our input and output datasets. Therefore, we propose using common definition and standards where possible. In a similar way to the climate model communities use of the Climate and Forecast Standard Names.
The Europe Environment Agency has produced a data dictionary for reporting. However, this is mainly for high-level reporting. In particular, taxonomic results are not exchanged using this data structure. However, we use some aspects of this standard within hera to aid onward reporting to EEA.
A small sample demo dataset demo_data
contains diatoms, macrophytes and
inverts quality elements.
hera::demo_data %>% select(location_description, date_taken, sample_id, quality_element) %>% group_by(quality_element) %>% slice_sample() %>% head(5)
First of all, we have 'book-keeping' variables. These allow us to reference data associated with particular samples, locations or WFD methods. And allow results to be aggregate at different levels.
hera::demo_data %>% select(location_id, sample_id, date_taken, quality_element) %>% head(5)
All data passed into hera must have these four variables. For ad hoc
reporting, consultancies and students etc who don't routinely record unique
sample ids, a sample_id is generate if date_taken
and location_id
are
provided.
These three variables are the minimal required, but in practice water_body_id
maybe required for aggregation or simply location_description
or NGR
etc to
help reference sites more easily. There is no restriction to the number of extra
columns and these extra columns will be appended to outputs.
An observation consists of three variables question
, response
. The
question
variable identifies what is being determined such as alkalinity
or depth
etc. And the response
is the value observed or recorded for that
question.
Below is an example of diatom records, invert data and river flow in a shared input format.
demo_data %>% select(question, response)
In theory, this is all that is required. However for ease for interacting with
existing datasets and ad hoc data, a third column label
is useful due to the historic way taxon data is usually stored.
demo_data %>% select(question, response, label) %>% slice_sample(n = 5)
Predictive variables such as temperature
, altitude
, slope
are added as
additional columns. There is a trade-off here as predictor variables are added
for each row in the dataset, increasing the size of the dataset. However this
does make data analysis straightforward and this repeated data can be easily
compressed if size becomes an issue. For instance, nested JSON data or
dataframes in R.
demo_data %>% select(`mean_alkalinity`, `grid_reference`, `slope`)
The outputs are all presented in a consistent format making outputs from different models instantly comparable and portable.
For students and consultancies requiring ad hoc usage, templates and documentation for preparing data will be provided.
For Agencies, data queries can be written to prepare outputs in the correct format.
For instance, here is a prototype function to pull data Environment Agency's
data.gov.uk web service and convert to the required hera
input format.
environment_agency_data <- hera:::get_data(location_id = c(43378, 92751)) tibble(environment_agency_data)
We can then run this EA data through hera
:
report <- hera::assess(environment_agency_data) report %>% select(date_taken, question, response) %>% unique()
Following data tables are shared through hera
...
hera::catalogue
This RFC is mainly looking at a share design for inputs and outputs from classification tools.
This framework does however encourage a shared principles in thinking about the approach to modelling required which drives the classification method. However, we see no need to prescribe a modelling program or software. Researchers can download the reference and predictor data required and use any software they desire. Ultimately, the model we need to be called by R. So either the model needs to be written into R or in language which can be called by R (Python, fortran, C++ etc).
Alternatively, if researchers can't provide an api for R to call, the recommendation is to use R - which integrates more directly into the pipeline.
Once modelling is completed, the model object is saved and deployed. Any existing or future data collected using the platform will be run through the model at the sample level.
Researchers can then build tools to display and aggregate the sample level results as required (Waterbody, Year, Catchment etc). Where it would be possible to share techniques for producing Confidence of Class, assessment of data suitability and adjustment factors etc.
As new regulatory developments and updates requirements are identified, the lead contacts from the agencies and method developers are 'on-boarded' to demonstrate the design principles and collaborative framework of packages. Where skill development is required further training can be provided, or additional external or internal support from the agency commissioning the work.
A workshop for lead data experts / R coders from each agency delivers institutional knowledge on how internally developed tools will fit with the shared design philosophy as well as setting expectations for collaboration with external researchers.
Hera allows multiple ecological elements to be assessed through the same interface. But not just the interface is shared. Other areas of the infrastructure are shared including:
All UKTAG sub-groups and their nominated leads in the devolved agencies would contribute new method develops and tools to the shared collection of packages. Where tools are agency specific, these could also make use of the platform if required.
As agencies commission new tools to be developed, researchers can upload their predictive variables, reference data and models into a central repo for easier collaboration.
Status Quo - Future development takes the form of bespoke, custom code in self-contained R packages or Excel spreadsheets. No code re-use, collaboration or consistency against tools/metrics/assessments.
Other computational document standards such as Python Julia, JS based Jupiter or Quarto documents. Currently, most of our code is either Excel or R based. We wish to avoid re-writing R code into other languages. But this doesn't preclude future developments using other languages or computational document standards.
Web assembly - Compile our R, Python, Julia, C++ etc into widely supported Web Assembly language for fast use within web browser (on or offline). This is only at a prototype stage in R and Python and many existing libraries cannot compile to Web Assembly. Possibly in future this may be an option. But current tools for integrating and combining languages provide broad enough consistency/compatibility even if a back-end server is required. Requiring a back-end server can impact offline or native usage in some situations such on mobile devices. For instance, R only supports MacOS, Linux and Windows. R isn't officially supported to run locally/natively on Android or iOS. On these devices, R must run in a web server and requires mobile/wifi signal to communicate with the device. Using Web Assembly in future may avoid the need for web-server for mobile / offline devices.
Demo input data format (truncated) and full list of column names
DataHeldBy SiteCode Alternative.site.code Repeat.check SiteName 1 EA 525052 NA N 13809 2 EA 507441 NA Y 13810 3 EA 525051 NA N 34611 4 EA 122881 NA Y 5435 Site.description ... 1 13809 ... 2 13810 ... 3 34611 ... 4 5435 ... [1] #DataHeldBy# #SiteCode# [3] #Alternative.site.code# #Repeat.check# [5] #SiteName# #Site.description# [7] #Easting# #Northing# [9] #NGR# #SurveyDate# [11] #WBId# #WBName# [13] #NumberOfRuns# #SurveyArea# [15] #WetWidth# #Slope# [17] #BarrierType# #ImpassableBarriers# [19] #Sense.check.passed.# #CatchmentAreaUpstream# [21] #CatchmentDrainageDirection# #GeologyClass# [23] #Altitude# #DistanceFromSource# [25] #DistanceToSea# #AnnualMeanFlow# [27] #AlkalinityValue# #TotalPValue# [29] #DOCValue# #SuspendedSolidsValue# [31] #IOH.hydrometric.area# #HydrometricAreaNo# [33] #LandUse.AgriculturalAreas# #LandUse.ConiferousForests# [35] #LandUse.Wetlands# #Substrate.Small# [37] #Substrate.Large# #Substrate.Bedrock# [39] #Salmon_fry.Run1Total# #Salmon_fry.Run2Total# [41] #Salmon_fry.Run3Total# #Salmon_fry.Run4Total# [43] #Salmon_parr.Run1Total# #Salmon_parr.Run2Total# [45] #Salmon_parr.Run3Total# #Salmon_parr.Run4Total# [47] #Trout_fry.Run1Total# #Trout_fry.Run2Total# [49] #Trout_fry.Run3Total# #Trout_fry.Run4Total# [51] #Trout_parr.Run1Total# #Trout_parr.Run2Total# [53] #Trout_parr.Run3Total# #Trout_parr.Run4Total#
Ouput
WBName ... 1 EA 03/09/2015 525052 13809 10675 ... 2 EA 04/09/2015 507441 13810 10675 White Esk (u/s Rae Burn) ... 3 EA 03/09/2015 525051 34611 10676 ... 4 EA 01/09/2015 122881 5435 10676 Garwald Water ... [1] #DataHeldBy# [2] #SurveyDate# [3] #SiteCode# [4] #SiteName# [5] #WBId# [6] #WBName# [7] #All species WB EQR Bad %# [8] #All species WB EQR Poor %# [9] #All species WB EQR Moderate %# [10] #All species WB EQR Good %# [11] #All species WB EQR High %# [12] #All species WB EQR mean# [13] #All species survey EQR Bad %# [14] #All species survey EQR Poor %# [15] #All species survey EQR Moderate %# [16] #All species survey EQR Good %# [17] #All species survey EQR High %# [18] #All species survey EQR mean# [19] #Salmon_fry WB EQR mean# [20] #Salmon_fry survey EQR mean# [21] #Salmon_fry observed total catch# [22] #Salmon_fry probability present# [23] #Salmon_fry expected total catch if present# [24] #Salmon_fry expected total catch# [25] #Salmon_parr WB EQR mean# [26] #Salmon_parr survey EQR mean# [27] #Salmon_parr observed total catch# [28] #Salmon_parr probability present# [29] #Salmon_parr expected total catch if present# [30] #Salmon_parr expected total catch# [31] #Trout_fry WB EQR mean# [32] #Trout_fry survey EQR mean# [33] #Trout_fry observed total catch# [34] #Trout_fry probability present# [35] #Trout_fry expected total catch if present# [36] #Trout_fry expected total catch# [37] #Trout_parr WB EQR mean# [38] #Trout_parr survey EQR mean# [39] #Trout_parr observed total catch# [40] #Trout_parr probability present# [41] #Trout_parr expected total catch if present# [42] #Trout_parr expected total catch#
Input data for DARLEQ3 tool is a list of dataframes. Here's an example of input data format (truncated) and full list of column names
file <- system.file("extdata/DARLEQ2TestData.xlsx", package = "darleq3") data <- read_DARLEQ(file, "Rivers TDI Test Data") test <- data$diatom_data[1:4, 1:8] test$... <- "..." test names(data$diatom_data) names(data$header)
Output (list of dataframes)
fn <- system.file("extdata/DARLEQ2TestData.xlsx", package = "darleq3") d <- read_DARLEQ(fn, "Rivers TDI Test Data") results <- calc_Metric_EQR(d, metrics = c("TDI4", "TDI5LM")) head(results$TDI5LM$EQR[, 9:13]) head(results$TDI5LM$Uncertainty[, 9:13]) head(results$TDI5LM$Metric) head(results$TDI5LM$Job_Summary, 4)
Here's an example of input data format (truncated) and full list of column names
test <- rict::demo_observed_values[1:4, 1:8] test$... <- "..." test names(demo_observed_values)
Output
test <- rict(demo_ni_observed_values) example <- head(test[1:6], 4) example$... <- "..." example names(test)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.