prep_metabolism: Prepare StreamPULSE data for metabolism modeling
In streampulse/StreamPULSE: Run Stream Metabolism Models on StreamPULSE data

prep_metabolism

R Documentation

Prepare StreamPULSE data for metabolism modeling

Description

Formats the output of request_data for stream metabolism model of choice. Filters flagged data and imputes missing data. Acquires/estimates additional variables if necessary. NOTE: support for modeling with BASE is currently in development. Please use streamMetabolizer in the meantime.

Usage

prep_metabolism(
  d,
  model = "streamMetabolizer",
  type = "bayes",
  interval = NA,
  rm_flagged = list("Bad Data", "Questionable"),
  fillgaps = "interpolation",
  maxhours = 3,
  zq_curve = list(sensor_height = NULL, Z = NULL, Q = NULL, a = NULL, b = NULL, fit =
    "power", ignore_oob_Z = TRUE, plot = TRUE),
  estimate_areal_depth = FALSE,
  estimate_PAR = TRUE,
  retrieve_air_pres = FALSE,
  ...
)

Arguments

`d`	the output of `request_data`, or a `list` of `data.frame`s so organized.
`model`	either 'streamMetabolizer' (the default) or 'BASE'. If 'BASE', `type` must be set to `'bayes'`.
`type`	either 'mle' or 'bayes'. If `model='BASE'`, `type` must be set to `'bayes'`.
`interval`	a string specifying the between-sample time interval to which the dataset should be coerced, or NA to determine automatically. If not NA, Must be of the form '<number> <unit>', as in '15 min'. Unit can be 'min' or 'hour'. Non-integer hours are tolerated, but minutes must be specified as integers. See details.
`rm_flagged`	a list containing any of 'Interesting', 'Questionable', and 'Bad Data'. Any data points flagged with these specified tags will be removed (replaced with NA), and then imputed according to `fillgaps`. If data for a selected site and timespan have been cleaned using https://data.streampulse.org/clean, it is a good idea to remove any data points flagged as "Questionable" or "Bad Data". Set this argument to 'none' to keep all flagged data points. Defaults to `list('Questionable', 'Bad Data')`.
`fillgaps`	a string specifying one of the imputation methods available to `imputeTS::na.seasplit`, namely: 'interpolation', 'locf', 'mean', 'random', 'kalman', or 'ma'. May also be 'none'. The imputation method, if specified, will be attempted after seasonal decomposition. Periodicity depends on the between-sample interval, and is determined programmatically (see details for `interval`). If the desired imputation method fails, which sometimes occurs when series consist largely of NAs, basic linear interpolation will be performed instead and the user will be notified. See `maxhours`.
`maxhours`	the maximum number of hours of consecutive NAs to impute.
`zq_curve`	a list containing specifications for a rating curve, used to estimate discharge from level or depth. Elements of this list may include any of the following: Z (a vector of level or depth data), Q (a vector of discharge data), a (the first parameter of an existing rating curve), b (the second parameter of an existing rating curve), sensor_height (the vertical distance between streambed and sensor, in meters), fit (the form of the rating curve to predict discharge from and, if Z and Q supplied, to fit), ignore_oob_Z (if there are depth or level readings that exceed the maximum measured Z value of the rating curve, whether to replace these with NA), and plot (whether to plot the fitted curve, if applicable, as well as predicted discharge). See details for more.
`estimate_areal_depth`	logical; Metabolism models expect that input depth time series represent depth averaged over an area delineated by the width of the stream and the approximate O2 turnover distance. Set to TRUE if you'd like to estimate this average depth, or FALSE if your depth data already approximate it. For example, if your depth data represent average depth over the aforementioned area already, or average depth for a stream cross-section, you'd probably want to use FALSE. If your depth data represent only depth-at-sensor, or worse, level-at-sensor, you might be better off with TRUE, assuming you have discharge data to estimate areal depth from, or a rating curve by which to generate discharge data.
`estimate_PAR`	logical; should Photosynthetically Active Radiation (PAR) be estimated from geographic coordinates and time? Only use light data if you're confident that your light sensors accurately represent light reaching the upstream area defined by O2 turnover distance.
`retrieve_air_pres`	logical; if some AirPres_kPa values are missing, should they be retrieved from NCDC (NOAA)? Retrieval will happen automatically if air pressure data are required and entirely missing.
`...`	additional arguments passed to `imputeTS::na.seasplit`.

Details

BASE and streamMetabolizer, the two metabolism modeling platforms available via StreamPULSE, require different data input formats. Formatting also varies depending on whether one is using a Bayesian framework or MLE. This function supplements and rearranges the raw output of request_data to prepare it for a desired set of model specifications.

Both BASE and streamMetabolizer require dissolved oxygen (DO) concentration, water temperature, and light (PAR) data. If light is missing, it will automatically be estimated based on solar angle. In addition to these variables, streamMetabolizer requires DO % saturation and depth, and BASE requires atmospheric pressure. If DO % saturation is missing, it will be calculated automatically from DO concentration, water temperature, and atmospheric pressure. In turn, atmospheric pressure estimates will be automatically retrieved from NOAA (NCDC), if missing, for sites anywhere on earth.

If streamMetabolizer is being used and type='bayes', discharge time series data are also required. In the absence of such data, they can be estimated from the relationship between discharge and depth (i.e. the vertical distance between streambed and surface) or level (AKA stage; i.e. the vertical distance between some arbitrary datum, such as sensor height, and surface), via the zq_curve parameter. Here, depth or level is referred to as Z, discharge is reffered to as Q, and the relationship between them is called a rating curve. In order to fit such a curve, one must collect, sometimes manually, a set of data points for both Z and Q. Here we assume the user also has time series data for Z, which can then be used to predict a series of Q at each time point. If the sampled Z data used to fit the curve represent level, and the Z time series data represent depth, the sensor_height parameter can be used to make them commensurable.

If Z is supplied, Q must be supplied, and vice-versa. Likewise with a and b. If all are supplied, Z and Q will be ignored. Rating curves can take many forms. Options here include power, exponential, and linear. A common difficulty of fitting these curves is that it's hard to accurately measure discharge in high flow conditions, yet without accounting for these conditions in the curve, high flow discharge estimates can be far off from reality, especially if the curve's form is power or exponential. In these cases, it's often safest to omit high flow data points from the curve entirely by setting ignore_oob_Z=TRUE. In some cases it makes sense to model the curve with a linear fit, though of course this too will misrepresent reality. Using fit='linear' may also result in negative discharge estimates.

All single-station models assume that, where applicable, variables represent averages throughout an area delineated by the width of the stream and the approximate oxygen turnover distance. More on this and other considerations can be found by clicking the "Before modeling stream metabolism..." button on https://data.streampulse.org.

The between-sample interval is determined programmatically for each variable within d. It is assumed to be the mode if the between-sample interval varies within a series. If the between-sample interval varies across series, the longest interval is used for the whole dataset, unless interval is specified. If the user-specified interval is a multiple of the programmatically determined longest interval, the dataset will be quietly coerced to the user-specified interval. This is useful for thinning extremely long datasets in order to avoid out-of-memory errors while running models. If intervals vary across series, the user may specify which of the available intervals to coerce all series to. If user-specified and programmatically-determined intervals are identical, no action is taken.

Value

returns an S4 object containing a data.frame formatted for the model specified by model and type.

Author(s)

Mike Vlah, vlahm13@gmail.com

Aaron Berdanier

Examples

query_available_data(region='all')

streampulse_data = request_data(sitecode='NC_Eno',
    startdate='2016-06-10', enddate='2016-10-23')

fitdata = prep_metabolism(d=streampulse_data, type='bayes',
    model='streamMetabolizer', interval='15 min',
    rm_flagged=list('Bad Data', 'Questionable'), fillgaps=fillgaps,
    zq_curve=list(sensor_height=NULL, Z=Z_data, Q=Q_data,
    fit='power', plot=TRUE), estimate_areal_depth=TRUE)

streampulse/StreamPULSE documentation built on Nov. 2, 2024, 9:54 p.m.