knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The RWDataPlyr package provides an interface to read, aggregate, and summarize data from one or more RiverWare^TM^ simulations and then work with those data in a dplyr pipeline. As RiverWare (and RiverSMART) produce data in multiple formats, and these data can be combined using RiverSMART in multiple ways, there are several different workflows that may make sense depending on the goals of the analysis. This vignette provides details on the different workflows that are covered by v0.6.0 of RWDataPlyr.

As of version 0.6.0, the package assumes the following:

Note that only the summary and aggregation functions include these implicit assumptions, i.e., reading rdfs should work regardless of timestep size, start timestep, and end timestep. However, only data conforming to these standards has been tested.

Background

RiverWare^TM^ is a generalized river system modeling tool that represents physical processes and basin features, while allowing for complex reservoir operation policies to be expressed in rule-based logic. RiverWare is widely used across the U.S. by the Bureau of Reclamation, Tennessee Valley Authority, and U.S. Army Corps to assist their water management missions, and has applications worldwide, e.g., the Nile Basin.

One of the strengths of RiverWare is its "multiple run" mode, where many unique realizations, or traces, of hypothetical operations can be simulated with different input assumptions, e.g., hydrology. Multiple traces can be grouped together to form a single scenario, and it is often desirable to compare multiple scenarios together to understand how changes in assumptions affect model results. The RiverSMART tool is a wrapper to RiverWare (and other tools) that helps in the simulation and analysis of multiple scenarios.

RiverWare stores important data in "slots" on "objects" in the RiverWare model file. Any subset of these data can be output and saved in ASCII text files, known as RiverWare data files, with a .rdf extension; csv files; and/or NetCDF files. When multiple scenarios are run using RiverSMART, these output files are saved in unique scenario folders. RiverSMART can also combine csv output from multiple scenarios into a single csv file to expedite the analysis. This setup is summarized in figure \@ref(fig:rwFlowChart).

knitr::include_graphics("RiverWareFlowChart.png")

When using RiverSMART to simulate multiple scenarios, each with hundreds or thousands of traces of data, it is increasingly easy to generate large amounts of data from RiverWare simulations. This package provides a tool to read these data into R, manipulate and summarize the data, and aggregate multiple scenarios' data together.

As there are multiple ways RiverWare and RiverSMART can generate the data, there are also different workflows that may expedite the analysis process. This vignette documents these different workflows.

RWDataPlyr Workflows

RiverWare and RiverSMART can create data (rdf, csv, nc) for one or more scenarios, and depending on the analysis different workflows will provide advantages to the user. There are currently three different workflows supported by RWDataPlyr that are presented in the following sections.

  1. Reading and manipulating a single scenario
    • Fast
    • Best for inspecting a single slot
    • If comparing scenarios, must manually repeat for each scenario
    • Relies on read_rdf() and read_rw_csv()
  2. Summarizing multiple slots of data from a single scenario
    • Repeatable; allows user to process many slots at once
    • Best for producing "polished" analyses of a single scenario
    • Relies on rdf_aggregate() and user specified rwd_agg object
  3. Aggregating and summarizing many scenarios
    • Repeatable; allows user to process many slots for many scenarios at once
    • Repeats summary of a single scenario on multiple scenarios and combines results together
    • Relies on rw_scen_aggregate() and user specified rwd_agg object

As of v0.6.0, RWDataPlyr can only read rdf and csv files, and the summary and aggregation can only be performed on rdf files.

Reading and Manipulating a Single Scenario

Both rdf and csv files can be read into R using read_rdf() and read_rw_csv().

read_rdf() returns an rdf object; to be useful, the rdf object must be converted to a tbl_df using rdf_to_rwtbl2() or individual slot data can be obtained using rdf_get_slot(). rdf_get_slot() creates a timestep by traces matrix for a single slot. Several useful summary functions are available to summarize a single slot: rwslot_annual_sum(), rwslot_annual_min(), rwslot_annual_max(). A typical workflow to obtain the minimum annual Mead pool elevation from the included sample data might be:

library(RWDataPlyr)
suppressPackageStartupMessages(library(dplyr))
rdf <- read_rdf(system.file(
  "extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/KeySlots.rdf", 
  package = "RWDataPlyr"
))

# ensure the slot you want is in the rdf:
rdf_slot_names(rdf)

# then get the minimum annual Mead pool elevation for all 5 years and 4 traces
rdf %>% 
  rdf_get_slot("Mead.Pool Elevation") %>%
  rwslot_annual_min()

read_rw_csv() returns a tbl_df that can easily be manipulated using the dplyr framework. There are no analogous functions to rdf_get_slot() or rwslot_annual_*(), but dplyr::filter() and dplyr::summarise() work well instead. The following will obtain the same minimum annual Mead pool elevation data from the example csv file with the added benefit of having years and trace number (note that the output is converted to a data.frame to avoid the rounding in the current tibble print method):

library(tidyr)
read_rw_csv(system.file(
  "extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/KeySlots.csv",
  package = "RWDataPlyr"
)) %>%
  filter(ObjectSlot == "Mead.Pool Elevation") %>% 
  group_by(Year, TraceNumber) %>% 
  summarise(Value = min(Value)) %>%
  spread(TraceNumber, Value) %>%
  as.data.frame()

The fields that are stored in csv files are controlled by the user in RiverWare. Depending on which fields the user specifies in RiverWare and the keep_cols options provided by the user, read_rw_csv(x) is approximately equal to rdf_to_rwtbl2(read_rdf(y), keep_cols = TRUE) if x is a csv file and y is an rdf file that contain the same RiverWare slots.

Summarizing Multiple Slots of Data From a Single Scenario

To summarize and process multiple slots of data from a single scenario at once, the user can define a RiverWare data aggregator object (rwd_agg) and provide that to rdf_aggregate(), which returns a tbl_df of the summarized data. Slots can be from different rdf files, but it is up to the user to know which slots come from which rdf file.

As of v0.6.0, the aggregation and summarization will only work for rdf files.

rwd_agg

rwd_agg objects are data.frames with specific column names and expected values in those columns. The expected column names are file, slot, period, summary, eval, t_s, and variable in that order. It is up to the user to specify which rdf file contains each slot. In a general case, the user specifies the slot that is found in a specific rdf file (file). A summary function is applied to a subset period of the slot data, and then compared (eval) to a threshold (t_s) and saved as the variable.

rwd_agg objects can be specified in three ways:

  1. By providing a data.frame with the expected column names (see below).
  2. By providing a vector of rdf file names using the rdfs parameter. If specified in this manor, all of the slots in each rdf file will be read into a rwtbl, but will not be aggregated/summarized.
  3. By reading in a csv file, with the expected column names (see below), using read_rdf_agg(). The user may first use rdf_agg_template() to create a blank template, to help ensure it has the correct column names.

In order to specify how each slot should be aggregated, each column should include specific keywords, which are described below.

Custom Period Functions

The period in a slot_agg object that a slot will be summarized over can be a function that is provided by RWDataPlyr, e.g., wy() for summarizing over a water year, or it can be a user defined custom function to allow for more complex time aggregations. The custom function should return a list with three elements:

For example, a user might want to aggregate all summer months together (June - August). A relatively simple custom function for this would be:

summer <- function() 
{
  list(
    fun = function(x) x, 
    filter_months = c("June", "July", "August"), 
    group_tbl = c("Year")
  )
}

The fun in the summer example simply returns the original data, because it does not need to provide any complex logic for grouping different years together. A more complex example would be summarizing all winter months together (December - February). In this example December from year 0 should be grouped with January and February of year 1 for the winter of year 1 summary. The djf() custom function would be:

djf <- function()
{
  djf_convert <- function(rwtbl)
  {
    rwtbl %>%
      dplyr::mutate_at(
        "Timestep", 
        .funs = list("Year" = zoo::as.yearmon)
      ) %>%
      # can use the ym_get_wateryear b/c djf are all in same water year
      dplyr::mutate_at("Year", list(ym_get_wateryear)) 
  }

  list(
    fun = djf_convert, 
    filter_months = month.name[c(12, 1, 2)], 
    group_tbl = c("Year")
  )

}

In this example, fun now includes a function that takes in the original data (and expects a Timestep column). The time step is converted to a yearmon object, and then the years are re-assigned based on their water year (since December - February are in the same water year).

rdf_aggregate() Example

To summarize multiple slots at once, the first step is to define the rwd_agg object. In this example, the following summaries are desired:

For this, we want to create the following table:

knitr::kable(data.frame(
  file = c(rep("KeySlots.rdf", 5), "SystemConditions.rdf"),
  slot = c(
    rep("Mead.Pool Elevation", 2), 
    rep("Powell.Outflow", 3), 
    "SummaryOutputData.LBShortageConditions"
  ), 
  period = c("cy", "eocy", "djf", "July", "wy", "asis"),
  summary = c("min", NA, "sum", NA, "sum", NA),
  eval = c("<=", rep(NA, 5)),
  t_s = formatC(
    c(1060, NA, 0.001, NA, 0.000001, NA), 
    big.mark = ",", 
    drop0trailing = TRUE
  ),
  variable = c("meadLt1060", "meadEocy", "powellDjfRel", "powellJulRel", 
               "powellWyRel", "short")
))

Often times, it will be easier to define the table in Excel and then read it in as a csv using read_rwd_agg(x) but below shows an example of defining it directly in R.

rwa1 <- rwd_agg(data.frame(
  file = c(rep("KeySlots.rdf", 5), "SystemConditions.rdf"),
  slot = c(
    rep("Mead.Pool Elevation", 2), 
    rep("Powell.Outflow", 3), 
    "SummaryOutputData.LBShortageConditions"
  ), 
  period = c("cy", "eocy", "djf", "July", "wy", "asis"),
  summary = c("min", NA, "sum", NA, "sum", NA),
  eval = c("<=", rep(NA, 5)),
  t_s = c(1060, NA, 0.001, NA, 0.000001, NA),
  variable = c("meadLt1060", "meadEocy", "powellDjfRel", "powellJulRel", 
               "powellWyRel", "short"),
  stringsAsFactors = FALSE
))

After the rwd_agg is specified the object is passed to rdf_aggregate() along with a few parameters specifying where the data are stored, and a tbl_df is returned:

rdf_aggregate(
  rwa1,
  rdf_dir = system.file(
    "extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/",
    package = "RWDataPlyr"
  )
)

To keep all of the columns (attribute info), specify keep_cols = TRUE:

rdf_aggregate(
  rwa1,
  rdf_dir = system.file(
    "extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/",
    package = "RWDataPlyr"
  ),
  keep_cols = TRUE
)

Aggregating and Summarizing Many Scenarios

rw_scen_aggregate() will aggregate and summarize multiple scenarios, essentially calling rdf_aggregate() for each scenario. Similar to rdf_aggregate() it relies on a user specified rwd_agg object to know how to summarize and process the scenarios.

rw_scen_aggregate() is setup to work with a directory structure typical to a RiverSMART study. RiverSMART manages the output of multiple scenarios by creating unique scenario folders in a "Scenario" directory. The scen_dir argument should point to this scenario directory in a typical study setup. For example, if the file system is setup as shown in figure \@ref(fig:dirStruct), scen_dir should point to "C:/user/crss/CRSS.Jan2017/Scenario" (or the correct relative path to the same location).

knitr::include_graphics("dir_structure.png")

rw_scen_aggregate() also needs to know the scenario folders and allows the user to specify scenario names that may differ from the folder name so they are easier to use in R. In this example, the scenarios argument would be c("ISM1988_2014,2007Dems,IG,Most", "ISM1988_2014,2007Dems,IG,2002"). The scenario names can be specified either through the scen_names argument, or by naming the scenarios vector. The following will obtain the same results as defined by the earlier rwd_agg object for both of these scenarios.

my_scens <-c("ISM1988_2014,2007Dems,IG,Most", "ISM1988_2014,2007Dems,IG,2002")
names(my_scens) <- c("most", "yr2002")
scen_res <- rw_scen_aggregate(
  my_scens,
  agg = rwa1,
  scen_dir = system.file("extdata/Scenario", package ="RWDataPlyr")
)

To compare results from multiple studies, i.e., the different top level CRSS.JanNNNN folders in figure \@ref(fig:dirStruct), scen_dir should point to "C:/user/crss" and scenarios should be assigned to c("CRSS.Jan2017/Scenario/ISM1988_2014,2007Dems,IG,Most", "CRSS.Jan2018/Scenario/ISM1988_2014,2007Dems,IG,Most").

If scenarios has a length of one, it will match the results of rdf_aggregate() if the scenario parameter is used and the same as the scen_names parameter in rw_scen_aggregate().

Legacy Workflow

getDataForAllScens() is the legacy function that preceded rw_scen_aggregate() and rdf_aggregate() in RWDataPlyr v0.5.0 and earlier. It works similarly to rw_scen_aggregate() and relies on a slot_agg_list, which is analogous to the rwd_agg object, though it only allows for certain pre-specified aggregation methods. The help pages include information on these functions, though rw_scen_aggregate() is superior and preferred as of v0.6.0.

getDataForAllScens() treats NaNs as 0s in its aggregation methods. The nans_are parameter defaults to "0" in rdf_aggregate() and rw_scen_aggregate() and thus matches the way getDataForAllScens() treats NaNs.

Utilizing RWDataPlyr within RiverSMART

RiverSMART includes a plugin that allows R scripts to be executed from RiverSMART for post-processing scenario(s). The exact configuration of the script and the RiverSMART R plugin is beyond the scope of this vignette, but a user can pass scenario folders and names to rw_scen_aggregate() from within RiverSMART, thus streamlining the post-processing steps.



BoulderCodeHub/RWDataPlyr documentation built on May 8, 2023, 3:55 a.m.