#' Prepare StreamPULSE data for metabolism modeling
#'
#' Formats the output of \code{\link{request_data}} for stream metabolism model
#' of choice. Filters flagged data and imputes missing data. Acquires/estimates
#' additional variables if necessary. NOTE: support for modeling with
#' \code{BASE} is currently in development. Please use \code{streamMetabolizer}
#' in the meantime.
#'
#' \code{BASE} and \code{streamMetabolizer}, the two metabolism modeling
#' platforms available via StreamPULSE, require different data input formats.
#' Formatting also varies depending on whether one is using a Bayesian framework
#' or MLE. This function supplements and rearranges the raw output of
#' \code{\link{request_data}} to prepare it for a desired set of model
#' specifications.
#'
#' Both \code{BASE} and \code{streamMetabolizer} require dissolved oxygen (DO)
#' concentration, water temperature, and light (PAR) data. If light is missing,
#' it will automatically be estimated based on solar angle. In addition to these
#' variables,
#' \code{streamMetabolizer} requires DO \% saturation and depth, and
#' \code{BASE} requires atmospheric pressure. If DO \% saturation is missing,
#' it will be calculated automatically from DO concentration, water temperature,
#' and atmospheric pressure. In turn, atmospheric pressure estimates will
#' be automatically retrieved from NOAA (NCDC), if missing,
#' for sites anywhere on earth.
#'
#' If \code{streamMetabolizer} is being used and
#' \code{type='bayes'}, discharge time series data are also required.
#' In the absence of such data, they can be estimated from the relationship
#' between discharge and depth (i.e. the vertical distance between streambed and
#' surface) or
#' level (AKA stage; i.e. the vertical distance between some arbitrary datum,
#' such as sensor
#' height, and surface), via the \code{zq_curve} parameter.
#' Here, depth or level is referred to as Z, discharge
#' is reffered to as Q, and the relationship between them is called a rating
#' curve. In order to fit such a curve, one must collect, sometimes
#' manually, a set of data points for both Z and Q. Here we assume the user
#' also has time series data for Z, which can then be used to predict a series
#' of Q at each time point. If the sampled Z data used to fit the curve
#' represent level, and the Z time series data represent depth, the
#' \code{sensor_height} parameter can be used to make them commensurable.
#'
#' If Z is supplied, Q must be supplied, and vice-versa. Likewise with a and b.
#' If all are supplied, Z and Q will be ignored. Rating curves can take many
#' forms. Options here include power, exponential, and linear. A common
#' difficulty of
#' fitting these curves is that it's hard to accurately measure discharge in
#' high flow conditions, yet without accounting for these conditions
#' in the curve,
#' high flow discharge estimates can be far off from reality, especially if
#' the curve's form is power or exponential. In these cases, it's often safest
#' to omit high flow data points from the curve entirely by setting
#' \code{ignore_oob_Z=TRUE}. In some cases it makes sense to model the curve
#' with a linear fit, though of course this too will misrepresent reality.
#' Using \code{fit='linear'} may also result in negative discharge estimates.
#'
#' All single-station models assume that, where applicable, variables
#' represent averages
#' throughout an area delineated by the width of the stream and the approximate
#' oxygen turnover distance. More on this and other considerations can be
#' found by clicking the "Before modeling stream metabolism..." button on
#' \url{https://data.streampulse.org}.
#'
#' The between-sample interval is determined programmatically
#' for each variable within \code{d}. It is assumed to be the mode
#' if the between-sample interval varies within a series. If the
#' between-sample interval varies across series, the longest interval is
#' used for the whole dataset, unless \code{interval} is specified.
#' If the user-specified interval is a multiple of the programmatically
#' determined longest interval, the dataset will be quietly coerced to the
#' user-specified interval.
#' This is useful for thinning extremely long datasets in order to
#' avoid out-of-memory errors while running models. If intervals vary across
#' series, the user may specify which of the available intervals to coerce
#' all series to. If user-specified and
#' programmatically-determined intervals are identical, no action is taken.
#'
#' @author Mike Vlah, \email{vlahm13@gmail.com}
#' @author Aaron Berdanier
#' @param d the output of \code{request_data},
#' or a \code{list} of \code{data.frame}s so organized.
#' @param model either 'streamMetabolizer' (the default) or 'BASE'. If 'BASE',
#' \code{type} must be set to \code{'bayes'}.
#' @param type either 'mle' or 'bayes'. If \code{model='BASE'}, \code{type}
#' must be set to \code{'bayes'}.
#' @param interval a string specifying the between-sample time interval to
#' which the dataset should be coerced, or NA to determine automatically.
#' If not NA, Must be
#' of the form '<number> <unit>', as in '15 min'. Unit can be
#' 'min' or 'hour'. Non-integer hours are tolerated, but minutes must be
#' specified as integers. See details.
#' @param rm_flagged a list containing any of 'Interesting', 'Questionable',
#' and 'Bad Data'. Any data points flagged with these specified tags will be
#' removed (replaced with NA), and then imputed according to \code{fillgaps}.
#' If data for a selected site and timespan have been cleaned using
#' \url{https://data.streampulse.org/clean}, it is a good idea to remove any
#' data points flagged as "Questionable" or "Bad Data". Set this argument to
#' 'none' to keep all flagged data points. Defaults to
#' \code{list('Questionable', 'Bad Data')}.
#' @param fillgaps a string specifying one of the imputation methods available
#' to \code{imputeTS::na.seasplit}, namely: 'interpolation', 'locf', 'mean',
#' 'random', 'kalman', or 'ma'. May also be 'none'. The
#' imputation method, if specified, will be attempted after seasonal decomposition.
#' Periodicity depends on the between-sample interval, and is determined
#' programmatically (see details for \code{interval}). If the desired
#' imputation method
#' fails, which sometimes occurs when series consist largely of NAs, basic
#' linear interpolation will be performed instead and the user will be
#' notified. See \code{maxhours}.
#' @param maxhours the maximum number of hours of consecutive NAs to impute.
#' @param zq_curve a list containing specifications for a rating curve, used
#' to estimate discharge from level or depth. Elements of this list may
#' include any of the following: Z (a vector of level or depth data), Q (a
#' vector of discharge data), a (the first parameter of an existing rating
#' curve), b (the second parameter of an existing rating curve),
#' sensor_height (the vertical distance between streambed and sensor,
#' in meters), fit
#' (the form of the rating curve to predict discharge from and, if Z and Q
#' supplied, to fit), ignore_oob_Z (if there are depth or level readings that
#' exceed the maximum measured Z value of the rating curve, whether to
#' replace these with NA), and plot (whether to plot the fitted curve, if
#' applicable, as well as predicted discharge). See details for more.
#' @param estimate_areal_depth logical;
#' Metabolism models expect that input depth time series represent depth
#' averaged
#' over an area delineated by the width of the stream and the approximate O2
#' turnover distance. Set to TRUE if you'd like to estimate this
#' average depth, or FALSE if your depth data already approximate it. For
#' example, if your depth data represent average depth over the aforementioned
#' area already, or average depth for a stream cross-section, you'd probably
#' want
#' to use FALSE. If your depth data represent only depth-at-sensor, or worse,
#' level-at-sensor, you might be
#' better off with TRUE, assuming you have discharge data to estimate areal
#' depth
#' from, or a rating curve by which to generate discharge data.
#' @param estimate_PAR logical; should Photosynthetically Active Radiation (PAR)
#' be estimated from geographic coordinates and time? Only use light data if
#' you're confident that your light sensors accurately represent light reaching
#' the upstream area defined by O2 turnover distance.
#' @param retrieve_air_pres logical; if some AirPres_kPa values are missing, should
#' they be retrieved from NCDC (NOAA)? Retrieval will happen automatically if air
#' pressure data are required and entirely missing.
#' @param ... additional arguments passed to \code{imputeTS::na.seasplit}.
#' @return returns an S4 object containing a \code{data.frame} formatted for
#' the model specified by \code{model} and \code{type}.
#' @seealso \code{\link{request_data}} for acquiring StreamPULSE data;
#' \code{\link{fit_metabolism}} for fitting models.
#' @export
#' @examples
#' query_available_data(region='all')
#'
#' streampulse_data = request_data(sitecode='NC_Eno',
#' startdate='2016-06-10', enddate='2016-10-23')
#'
#' fitdata = prep_metabolism(d=streampulse_data, type='bayes',
#' model='streamMetabolizer', interval='15 min',
#' rm_flagged=list('Bad Data', 'Questionable'), fillgaps=fillgaps,
#' zq_curve=list(sensor_height=NULL, Z=Z_data, Q=Q_data,
#' fit='power', plot=TRUE), estimate_areal_depth=TRUE)
prep_metabolism = function(d, model="streamMetabolizer", type="bayes",
interval=NA, rm_flagged=list('Bad Data', 'Questionable'),
fillgaps='interpolation', maxhours=3,
zq_curve=list(sensor_height=NULL, Z=NULL, Q=NULL, a=NULL, b=NULL,
fit='power', ignore_oob_Z=TRUE, plot=TRUE),
estimate_areal_depth=FALSE, estimate_PAR=TRUE, retrieve_air_pres=FALSE, ...){
# zq_curve=list(Z=NULL, Q=NULL, a=NULL, b=NULL), ...){
# type is one of "bayes" or "mle"
# model is one of "streamMetabolizer" or "BASE"
# interval is the desired gap between successive observations. should be a
# multiple of your sampling interval.
# rm_flagged is a list containing some combination of 'Bad Data',
# 'Questionable', or 'Interesting'. Values corresponding to flags of
# the specified type(s) will be replaced with NA, then interpolated
# if fillgaps is not 'none'. rm_flagged can also be set to 'none' to
# retain all flag information.
# fillgaps must be one of the imputation methods available to
# imputeTS::na.seasplit or 'none'
# ... passes additional arguments to na.seasplit
# get_windspeed and get_airpressure query NOAA's ESRL-PSD.
# units are m/s and pascals, respectively
# checks
if(model=="BASE"){
stop('BASE not yet supported', call.=FALSE) ###
type="bayes" #can't use mle mode with BASE
}
if(! is.na(interval) &&
! grepl('\\d+ (min|hour)', interval, perl=TRUE)){
stop(paste('Interval (if not NA) must be of the form "length [space] unit"\n\twhere',
'length is numeric and unit is either "min" or "hour".'), call.=FALSE)
}
if(!fillgaps %in% c('interpolation','locf','mean','random','kalman','ma',
'none')){
stop(paste0("fillgaps must be one of 'interpolation', 'locf', 'mean',",
"\n\t'random', 'kalman', 'ma', or 'none'"), call.=FALSE)
}
if(any(! rm_flagged %in% list('Bad Data','Questionable','Interesting')) &
any(rm_flagged != 'none')){
stop(paste0("rm_flagged must either be 'none' or a list containing any",
" of:\n\t'Bad Data', 'Questionable', 'Interesting'."), call.=FALSE)
}
if(any(rm_flagged != 'none') & ! 'flagtype' %in% colnames(d$data)){
stop(paste0('No flag data available.\n\t',
'Please visit https://data.streampulse.org/qaqc_sensordata\n\t',
'to supply flag (QA/QC) information.'), call.=FALSE)
}
if(! 'list' %in% class(zq_curve)){
stop('Argument "zq_curve" must be a list.', call.=FALSE)
}
if(type == 'mle'){
stop('MLE mode is not currently available. Please use type="bayes".',
call.=FALSE)
}
ab_supplied = zq_supplied = FALSE
using_zq_curve = !all(unlist(lapply(zq_curve[c('sensor_height',
'a','b','Z','Q')], is.null)))
if(using_zq_curve){
#unpack arguments supplied to zq_curve
sensor_height = Z = Q = a = b = fit = ignore_oob_Z = zqplot = NULL
if(!is.null(zq_curve$sensor_height)){
sensor_height = zq_curve$sensor_height
}
if(!is.null(zq_curve$Z)) Z = zq_curve$Z
if(!is.null(zq_curve$Q)) Q = zq_curve$Q
if(!is.null(zq_curve$a)) a = zq_curve$a
if(!is.null(zq_curve$b)) b = zq_curve$b
if(!is.null(zq_curve$fit)){
fit = zq_curve$fit
if(! fit %in% c('power', 'exponential', 'linear')){
stop(paste0('Argument to "fit" must be one of: "power", ',
'"exponential", "linear".'), call.=FALSE)
}
}
if(!is.null(zq_curve$ignore_oob_Z)) ignore_oob_Z = zq_curve$ignore_oob_Z
if(!is.null(zq_curve$plot)) zqplot = zq_curve$plot
# message(paste0('NOTE: You have specified arguments to zq_curve.\n\t',
# 'These are only needed if time-series data for discharge cannot',
# '\n\tbe found.'))
if(is.numeric(a) & is.numeric(b)){
ab_supplied = TRUE
}
if(length(Z) > 1 & length(Q) > 1){
zq_supplied = TRUE
}
if(ab_supplied & zq_supplied){
warning(paste0('Parameters (a, b) and data (Z, Q) supplied for ',
'rating curve.\n\tOnly one set needed, so ignoring data.'),
call.=FALSE)
zq_supplied = FALSE
} else {
if(!ab_supplied & !zq_supplied){
stop(paste0('Argument zq_curve must include either Z and Q as ',
'vectors of data\n\tor a and b as parameters of a rating ',
'curve.'), call.=FALSE)
}
}
}
# d2 <<- d
# stop('a')
# d = d2
#### Format data for models
cat(paste("Formatting data for ", model, ".\n", sep=""))
dd = d$data
#check for consistent sample interval (including cases where there are gaps
#between samples and where the underying sample pattern changes)
varz = unique(dd$variable)
ints_by_var = data.frame(var=varz, int=rep(NA, length(varz)))
for(i in 1:length(varz)){
#get lengths and values for successive repetitions of the same
#sample interval (using run length encoding)
dt_by_var = sort(unique(dd$DateTime_UTC[dd$variable == varz[i]]))
run_lengths = rle(diff(as.numeric(dt_by_var)))
if(length(run_lengths$lengths) != 1){
# if gaps or interval change, get mode interval
uniqv = unique(run_lengths$values)
input_int = as.numeric(names(which.max(tapply(run_lengths$lengths,
run_lengths$values, sum)))) / 60
if(any(uniqv %% min(uniqv) != 0)){ #if underlying pattern changes
warning(paste0('Sample interval is not consistent for ', varz[i],
'\n\tGaps will be introduced!\n\t',
'Using the most common interval: ',
as.character(input_int), ' mins.'), call.=FALSE)
} else {
message(paste0(length(run_lengths$lengths)-1,
' sample gap(s) detected in ', varz[i], '.'))
}
#store the (most common) sample interval for each variable
ints_by_var[i,2] = as.difftime(input_int, unit='mins')
} else {
# if consistent, just grab the diff between the first two times
ints_by_var[i,2] = difftime(dt_by_var[2], dt_by_var[1],
units='mins')
}
}
#will later coerce all vars to the longest sample interval
#unless `interval` is specified
if(length(unique(ints_by_var$int)) > 1){
input_int = max(ints_by_var$int)
if(! is.na(interval)){
message(paste0('Multiple sample intervals detected across variables (',
paste(unique(ints_by_var$int), collapse=' min, '),
' min).\n\tWill attempt to coerce all variables to desired ',
'interval: ', interval, '.'))
} else {
interval = paste(input_int, 'min')
message(paste0('Multiple sample intervals detected across variables (',
paste(unique(ints_by_var$int), collapse=' min, '),
' min).\n\tUsing ', input_int, ' so as not to introduce gaps.\n\t',
'You may control this behavior with the "interval" parameter.'))
}
} else {
input_int = ints_by_var$int[1] #all intervals equal
interval = paste(input_int, 'min')
}
#remove (replace with NA) flagged data if desired
if(any(rm_flagged != 'none')){
flags_to_remove = unique(unlist(rm_flagged))
dd$value[dd$flagtype %in% flags_to_remove] = NA
}
if('flagtype' %in% colnames(dd)){
dd = subset(dd, select=-c(flagtype, flagcomment))
}
# Use USGS level and discharge if missing local versions
#level currenty not used, but could be used in the absence of discharge and depth
if("USGSLevel_m" %in% dd$variable && !"Level_m" %in% dd$variable){
dd$variable[dd$variable=="USGSLevel_m"] = "Level_m"
}
if("USGSDischarge_m3s" %in% dd$variable && !"Discharge_m3s" %in% dd$variable){
dd$variable[dd$variable=="USGSDischarge_m3s"] = "Discharge_m3s"
}
vd = unique(dd$variable) # variables
dd = tidyr::spread(dd, variable, value) # spread out data
md = d$sites # metadata
#deal with "level"/"depth" naming issue. These measure the same thing.
#UPDATE: level=stage=gage_height=vertical distance from sensor to surface
#depth = vertical distance from bed to surface.
#can be depth-at-gage OR averaged across width OR
#averaged across area defined by width and O2 turnover distance.
#this last metric is what the model actually needs
#the code below may be party useful depending on what "depth" means for a given site
using_level = 'Level_m' %in% vd && ! 'Depth_m' %in% vd &&
!('Discharge_m3s' %in% vd && estimate_areal_depth)
if(using_level){
dd$Depth_m = dd$Level_m
vd = c(vd, 'Depth_m')
warning(paste0('Using supplied level data in place of missing depth data.',
'\n\tDepth would be more accurate!'), call.=FALSE)
}
# if('Level_m' %in% vd & 'Depth_m' %in% vd){ #use col with more data if both
# na_cnt = sapply(dd[,c('Level_m','Depth_m')], function(x) sum(is.na(x)))
# level_or_depth = names(which.min(na_cnt))
# dd$Depth_m = dd[,level_or_depth]
# warning(paste0('Both level and depth data found. These measure the ',
# 'same thing.\n\tUsing ', level_or_depth,
# ' because it has better coverage.'), call.=FALSE)
# }
#another test, now that depth has been acquired if available
#UPDATE: turns out rating curves can directly relate level and discharge,
#so depth is not needed at this point (though areal depth is needed, ultimately)
#and sensor height is not needed as a result. This would only be useful
#for correcting level to depth, which would be no better for buiding a rating curve
missing_depth = ! 'Depth_m' %in% vd
# missing_sens_height = !is.numeric(zq_curve$sensor_height)
# if(using_zq_curve & missing_depth & missing_sens_height){
# if(using_zq_curve & missing_depth){
# stop(paste0('Need either sensor_height or depth (level) time-series',
# ' data\n\tto compute discharge via rating curve.'),
# call.=FALSE)
# }
# check if desired interval is compatible with sample interval
int_parts = strsplit(interval, ' ')[[1]] #get num and str components
desired_int = as.difftime(as.numeric(int_parts[1]),
units=paste0(int_parts[2], 's')) #get desired interval as difftime
desired_int_num = as.double(desired_int, units='mins')
remainder = desired_int_num %% as.double(input_int)
if(! desired_int_num %in% ints_by_var$int &&
remainder != 0){
#warning doesn't bubble up properly if imputation error occurs,
#so using message instead
# message(paste0('Warning: Desired time interval (', interval,
# ') is not a multiple of sample interval (',
# as.character(as.numeric(input_int)),
# ' min).\n\tGaps will be introduced!'))
stop(paste0('Desired time interval (', interval,
') must conform to at least one\n\tof the supplied variables:\n',
paste0('\t\t', ints_by_var$var, ': ', ints_by_var$int, ' min',
collapse='\n'), '\n\tIt can also be a multiple of the ',
'largest interval (if you want to thin the dataset).'), call.=FALSE)
}
#convert user-specified interval to minutes if it's in hours (because
#fractional hours wont work with seq.POSIXct below)
if(int_parts[2] == 'hour'){
interval = paste(as.numeric(int_parts[1]) * 60, 'min')
}
# #coerce all variables to a particular interval if intervals differ
# if(length(unique(ints_by_var$int)) > 1 || ){
#this chunk thins the dataset if desired, and ensures no missing records.
#(i.e. not just NAs, but entirely missing rows for some dates).
#if some data are offset from the starting row, this may grab NAs instead
#of data, so it iterates until it finds the right starting row.
na_props = 1
starting_row = 1
while( (sum(na_props > 0.8) / length(na_props)) > 0.4){ #heuristic test
if(starting_row > 10){
stop(paste0('Unable to coerce data to desired time interval.',
'\n\tTry specifying a different interval.'), call.=FALSE)
}
alldates = data.frame(DateTime_UTC=seq.POSIXt(dd[starting_row,1],
dd[nrow(dd), 1], by=interval))
dd_temp = left_join(alldates, dd, by='DateTime_UTC')
#get new NA proportions for each column
na_props = apply(dd_temp[,-c(1:3)], 2,
function(x){ sum(is.na(x)) / length(x) })
starting_row = starting_row + 1
}
dd = dd_temp
# #if just one sample interval is found, just ensure no missing records
# } else {
# alldates = data.frame(DateTime_UTC=seq.POSIXt(dd[1,1],
# dd[nrow(dd),1], by=interval))
# dd = left_join(alldates, dd, by='DateTime_UTC')
# }
#acquire air pressure data if necessary or desired
missing_DOsat = all(! c('satDO_mgL','DOsat_pct') %in% vd)
# missing_waterTemp = ! 'WaterTemp_C' %in% vd
missing_airPres = ! 'AirPres_kPa' %in% vd
need_airPres_for_DOsat = missing_DOsat & missing_airPres
need_airPres_for_Q = using_zq_curve & missing_airPres & missing_depth #revisit "depth" here and elsewhere
got_airpres = TRUE
if(need_airPres_for_DOsat || need_airPres_for_Q || retrieve_air_pres){
# airpres = try(retrieve_air_pressure(md, dd), silent=TRUE)
#
# if(class(airpres) == 'try-error' || nrow(airpres) == 0) {
airpres = try(retrieve_air_pressure3(md, dd), silent=TRUE)
if(inherits(airpres, 'try-error')) {
warning(paste('Failed to retrieve air pressure data.'),
call.=FALSE)
got_airpres = FALSE
}
#just join airpres to dataframe if AirPres_kPA entirely missing
#if AirPres_kPA partly missing, fill in the blanks.
if(got_airpres){
if(missing_airPres){
dd = left_join(dd, airpres, by='DateTime_UTC')
} else {
colnames(airpres)[colnames(airpres) == 'AirPres_kPa'] = 'x'
dd = left_join(dd, airpres, by='DateTime_UTC')
missing_airpres = is.na(dd$AirPres_kPa)
dd$AirPres_kPa[missing_airpres] = dd$x[missing_airpres]
dd$x = NULL
}
#then impute any remaining blanks
dd$AirPres_kPa = zoo::na.approx(dd$AirPres_kPa,
na.rm=FALSE, rule=2)
}
}
airpres_coverage = sum(! is.na(dd$AirPres_kPa)) / nrow(dd)
if(! need_airPres_for_DOsat && ! need_airPres_for_Q && ! retrieve_air_pres &&
airpres_coverage < 0.5){
warning(paste0('Air pressure coverage is ',
round(airpres_coverage * 100, 1),
'%.\n\tIf you need it to estimate DO saturation or depth, ',
'streamMetabolizer may fail.\n\tYou may be able to solve this (or improve ',
'your results)\n\tby setting retrieve_air_pres=TRUE.'), call.=FALSE)
}
#correct any negative or 0 depth values (these break streamMetabolizer)
if('Depth_m' %in% vd){
if(any(na.omit(dd$Depth_m) <= 0)){
warning('Depth values <= 0 detected. Replacing with 0.01 m.',
call.=FALSE)
dd$Depth_m[dd$Depth_m <= 0] = 0.01
}
}
#estimate discharge from depth (or eventually level too) using Z-Q rating curve
#if arguments have been supplied to zq_curve.
if('Discharge_m3s' %in% vd & using_zq_curve){
warning(paste0('Arguments supplied to zq_curve, so ignoring available',
'\n\tdischarge time-series data.'), call.=FALSE)
}
if(zq_supplied){
cat(paste0('Modeling discharge from rating curve.\n\tCurve will be ',
'generated from supplied Z and Q samples.\n'))
depth_disch = estimate_discharge(Z=Z, Q=Q, sh=sensor_height,
dd=dd, fit=fit, ignore_oob_Z=ignore_oob_Z, plot=zqplot)
dd$Discharge_m3s = depth_disch$discharge
dd$Depth_m = depth_disch$depth
vd = c(vd, 'Discharge_m3s', 'Depth_m')
} else {
if(ab_supplied){
cat(paste0('Modeling discharge from rating curve determined by',
'\n\tsupplied a and b parameters.\n'))
depth_disch = estimate_discharge(a=a, b=b,
sh=sensor_height, dd=dd, fit=fit, ignore_oob_Z=ignore_oob_Z,
plot=zqplot)
dd$Discharge_m3s = depth_disch$discharge
dd$Depth_m = depth_disch$depth
vd = c(vd, 'Discharge_m3s', 'Depth_m')
}
# if(ab_supplied & missing_depth){
# cat(paste0('Modeling discharge from rating curve determined by',
# '\n\tsupplied a and b parameters.\n\tEstimating depth or level',
# ' from water pressure, air pressure, and sensor height.\n'))
# dd$Discharge_m3s = estimate_discharge(a=zq_curve$a, b=zq_curve$b,
# sh=zq_curve$sensor_height, dd=dd)
# } else {
# cat(paste0('Modeling discharge from rating curve determined by',
# '\n\tsupplied a and b parameters.\n\tUsing available ',
# 'time-series data for depth.\n'))
# dd$Discharge_m3s = estimate_discharge(a=zq_curve$a, b=zq_curve$b,
# dd=dd)
# }
}
# if(zq_supplied | ab_supplied){
# if('Discharge_m3s' %in% vd){
# warning(paste0('Discharge data available, so ignoring ',
# 'zq_curve argument.'), call.=FALSE)
# } else {
# if(any(! c('AirPres_kPa', 'WaterPres_kPa', 'WaterTemp_C') %in% vd)){
# if(! 'Discharge_m3s' %in% vd){
# dd2 <<- dd
# stop('a')
# dd = dd2
# }
#correct any negative or 0 discharge values
if('Discharge_m3s' %in% vd){
if(any(na.omit(dd$Discharge_m3s) <= 0)){
warning('Discharge values <= 0 detected. Replacing with 0.01.',
call.=FALSE)
dd$Discharge_m3s[dd$Discharge_m3s <= 0] = 0.01
}
}
# convert UTC to solar time
dd$solar.time = suppressWarnings(streamMetabolizer::convert_UTC_to_solartime(
date.time=dd$DateTime_UTC, longitude=md$lon[1], time.type="mean solar"))
# estimate par
apparentsolartime = suppressWarnings(
streamMetabolizer::convert_UTC_to_solartime(date.time=dd$DateTime_UTC,
longitude=md$lon[1], time.type="apparent solar"))
dd$light = suppressWarnings(streamMetabolizer::calc_solar_insolation(
app.solar.time=apparentsolartime, latitude=md$lat[1], format="degrees"))
if("Light_PAR" %in% vd && ! estimate_PAR){
message(paste0("Using supplied light data. Unless you're quite ",
"confident in these data,\n\tit may be preferable to set ",
"estimate_PAR=TRUE."))
dd$light[! is.na(dd$Light_PAR)] = dd$Light_PAR[! is.na(dd$Light_PAR)]
} else {
cat("Estimating PAR from latitude and time.\n")
}
dd$Light_PAR = NULL
dd$Light_lux = NULL
# impute missing data. code found in gapfill_functions.R (maxspan_days deprecated)
dd = select(dd, -c(region, site, DateTime_UTC))
if(fillgaps != 'none') dd = gap_fill(dd, maxspan_days=5, knn=3,
sint=desired_int, algorithm=fillgaps, maxhours, ...)
#rename variables
if("DO_mgL" %in% vd) dd$DO.obs = dd$DO_mgL
if("WaterTemp_C" %in% vd) dd$temp.water = dd$WaterTemp_C
if('Discharge_m3s' %in% vd) dd$discharge = dd$Discharge_m3s
#use discharge to estimate mean depth of area defined
#by stream width and O2 turnover distance (if necessary or desired)
if("Discharge_m3s" %in% vd & estimate_areal_depth){
dd$depth = streamMetabolizer::calc_depth(dd$discharge) #estimate mean areal depth
} else {
#otherwise just use depth directly.
if("Depth_m" %in% vd){
dd$depth = dd$Depth_m
# if(! using_level){
# warning(paste0('Passing "Depth_m" values from StreamPULSE database',
# ' directly to streamMetabolizer.\n\tMetabolism estimates will',
# ' be best if "Depth_m" represents mean depth\n\tfor the ',
# 'area defined by the width of the stream and the oxygen\n\t',
# 'turnover distance. "Depth_m" may also represent depth-at-gage',
# ' or,\n\tworse, level-at-gage. These may result in poor ',
# 'metabolism estimates.\n\tYou may want to use zq_curve ',
# 'or estimate_areal_depth.'),
# call.=FALSE)
# }
} else {
if(estimate_areal_depth){
stop(paste0('Missing discharge and depth data.\n\tNot enough ',
'information to proceed.\n\tMight parameter "zq_curve"',
' be of service?'), call.=FALSE)
} else {
stop(paste0('Missing depth data.\n\tNot enough information to ',
'proceed.\n\tTry setting estimate_areal_depth to TRUE.'),
call.=FALSE)
}
}
}
# kPa to atm
if(! 'DO_mgL' %in% vd){
stop(paste0('streamMetabolizer cannot',
' be run without DO_mgL.'), call.=FALSE)
}
if("AirPres_kPa" %in% vd) dd$atmo.pressure = dd$AirPres_kPa / 101.325
if(model == "streamMetabolizer"){
if("satDO_mgL" %in% vd){
dd$DO.sat = dd$satDO_mgL
} else {
if("DOsat_pct" %in% vd){
# define multiplicative factor, to catch if variable is
# fraction or percent... could do this on server first in future
if(quantile(dd$DOsat_pct, 0.9, na.rm=TRUE) > 10){
ff=0.01
} else { ff=1 }
dd$DOsat_pct[dd$DOsat_pct == 0] = 0.000001 #prevent NaNs
dd$DO.sat = dd$DO.obs / (dd$DOsat_pct*ff)
} else {
if(!all(c("temp.water","AirPres_kPa") %in% colnames(dd))){
stop(paste('Insufficient data to fit this model.\n\tNeed',
'either DO sat OR water temp and\n\tair ',
'pressure.'), call.=FALSE)
}
cat('Modeling DO.sat based on water temperature and',
'air pressure.\n')
dd$DO.sat = LakeMetabolizer::o2.at.sat.base(temp=dd$temp.water,
baro=dd$AirPres_kPa*10, salinity=0, model='garcia-benson')
}
}
}
# Select variables for model
if(model == "BASE"){
model_variables = c("solar.time","DO.obs","temp.water","light",
"atmo.pressure")
} else { # streamMetabolizer
model_variables = c("solar.time","DO.obs","DO.sat","depth",
"temp.water","light")
# if(type=="bayes") model_variables = c(model_variables,"discharge")
}
if(!all(model_variables %in% colnames(dd))){
missing = model_variables[which(! model_variables %in% colnames(dd))]
stop(paste0('Insufficient data to fit this model.\n\t',
'Missing variable(s): ', paste0(missing, collapse=', ')),
call.=FALSE)
}
if(model == 'streamMetabolizer' && type == 'bayes' &&
! 'discharge' %in% colnames(dd)){
warning(paste("Without discharge data (or estimates), you'll have to",
"\n\trun fit_metabolism with pool_K600='none'."), call.=FALSE)
}
# Structure data, add class for model name
if(model=="BASE"){ # rename variables for BASE
# fitdata = dd %>% select_(.dots=model_variables) %>%
fitdata = dd %>% select(model_variables) %>%
mutate(Date=as.Date(solar.time),
Time=strftime(solar.time, format="%H:%M:%S"), salinity=0) %>%
rename(I=light, tempC=temp.water, DO.meas=DO.obs) %>%
select(Date, Time, I, tempC, DO.meas, atmo.pressure, salinity)
BASE = setClass("BASE", contains="data.frame")
outdata = as(fitdata, "BASE")
}else if(model=="streamMetabolizer"){
if('discharge' %in% colnames(dd)){
# fitdata = select_(dd, .dots=c(model_variables, 'discharge'))
fitdata = select(dd, c(model_variables, 'discharge'))
} else {
fitdata = select(dd, model_variables)
}
# streamMetabolizer = create_sm_class()
outdata = as(fitdata, "streamMetabolizer")
outdata@type = type
}
# #extract start and end date from input data (will be used to determine
# #whether this model should be considered for
# #inclusion on the data portal as the best model for this site and year)
# mod_startdt = d$data$DateTime_UTC[1]
# mod_enddt = d$data$DateTime_UTC[nrow(d$data)]
# dtbounds = c(mod_startdt, mod_enddt)
# attr(dtbounds, 'tzone') = 'UTC'
#return list of specs passed on from request_date,
#as well as new list of specs passed into this function
out = list(data=outdata,
specs=append(d$specs,
list(model=model, type=type, interval=interval,
rm_flagged=paste0(unlist(rm_flagged), collapse=','),
fillgaps=fillgaps, used_rating_curve=using_zq_curve,
estimate_areal_depth=estimate_areal_depth)))
return(out)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.