#' Calculating summary statistics describing historical flow conditions.
#'
#' @description This function takes a time series of measured or modelled flows and uses a user-defined moving window to calculate a suite of time-varying flow statistics for one or more sites (stations). A smaller set of long-term statistics is also calculated. It is primarily designed to work with daily flows but can also be applied to time series data on a longer (e.g. 10-daily or monthly) time step. The data should be regularly spaced, and a common time step should be used for all sites.
#'
#' @usage calc_flowstats(data, site_col = "flow_site_id", date_col = "date", flow_col = "flow", imputed_col = "imputed", win_start = "1995-04-01", win_width = "6 months", win_step = "6 months", date_range = NULL, q_low = 95, q_high = 70, scaling = FALSE, ref_col = NULL)
#'
#' @param data Tibble or data frame containing the flow data to be processed. Must be in long format and have, as a minimum, separate columns containing site id, date and flow (e.g. as output by the import_flow() function). If flow estimates are available for different abstraction or climate scenarios, then these must be in separate columns (see `ref_col` argument).
#' @param site_col Name of column in data containing unique flow site id. Default = "flow_site_id".
#' @param date_col Name of column in data containing date of flow record. Default = "date". Dates must be in “yyyy-mm-dd” format.
#' @param flow_col Name of column in data containing flow data for processing (character). Default = "flow".
#' @param imputed_col Name of optional column in data specifying whether each flow value is measured (0) or imputed (1). Default = NULL.
#' @param win_start Start date of first time window (in yyyy-mm-dd format). Default = "1995-04-01".
#' @param win_width Width of the time window, in days, weeks, months or years (see ?seq.Date for options). Default = "6 months".
#' @param win_step The increment by which the time window moves, in days, weeks, months or years (see ?seq.Date for options). Default = "6 months".
#' @param q_low Qx flow threshold (between 1 and 99, as an integer) defining low flow events. Default = 95 (representing the long-term Q95 flow at each site).
#' @param q_high Qx flow threshold (between 1 and 99, as an integer) defining high flow events. Default = 70 (representing the long-term Q70 flow at each site).
#' @param date_range Optional vector of two dates (in yyyy-mm-dd format) defining the period of flow data to be analysed. Default = NULL . Flow records outside this range are excluded. For unbiased calculation of long-term flow statistics, it is advisable that this range spans a whole number of years (i.e c(01-01-2000, 31-12-2020))
#' @param scaling Should the time series flow data be scaled by the long-term mean flow at each site? Default = FALSE.
#' @param ref_col Name of column in dataset containing reference flow scenario against which selected flow statistics are z-score standardised. Default = NULL.
#' @details
#'
#' The function uses the `win_start`, `win_width` and `win_step` arguments to define a moving window, which divides the flow time series into a sequence of time periods. These time periods may be contiguous, non-contiguous or overlapping (see examples below). The sequence of time periods continues up to and including the present date, even when this extends beyond the period covered by the input flow dataset, as this facilitates the subsequent joining of flow statistics and ecology data by the `join_he` function. The sequence of time periods does not extend beyond the present date, however; for example, if calculating flow statistics for each calendar year, the time periods would stop at the end of the last complete year.
#'
#' For each time period, the function calculates a suite of flow statistics, listed below. With the exception of 7-day and 30-day minimum flow statistics that require daily data, all flow statistics are calculated regardless of the time step of the flow data. Caution should be exercised, however, when analysing flow data on a coarser (e.g. monthly) time step to ensure that the statistics are meaningful and interpretable (especially those relating to the number and timing of low and high flow events, which require reasonably high frequency flow data in order to discriminate sequences of lower and higher flows).
#'
#' The function requires a minimum number of records to calculate some statistics (detailed below), otherwise an `NA` result is returned. Meeting the minimum requirement does not, however, guarantee that a statistic has been estimated to an appropriate level of precision, and users may wish to manually filter the results to eliminate potentially unreliable estimates based on sparse data.
#'
#' To ensure that estimated statistics are comparable across time periods, the flow time series data should be as complete as possible (gaps can be infilled using the `impute_flows` function). Missing values (`NA`) are ignored when calculating all statistics, including those that count the number of events when flows exceed or fall below a certain flow threshold.
#'
#' To make some statistics more comparable across sites, the scaling argument optionally allows the flow time series data to be standardised by dividing by the site’s long-term mean flow. Scaling is performed after applying the `date_range` filter, so that the long-term mean flows can be calculated over a specified number of whole years. Because this eliminates absolute differences in mean flow from site to site, scaling is most useful when the focus is on statistics measured in flow units (e.g. sd, q5, low_magnitude, low_severity, volume, min, min_7day, min_30day, max).
#'
#' Additionally, certain statistics (denoted by a `_z` suffix) are standardised using the mean and standard deviation of the estimated statistics across time periods at a given site (e.g. `q5_z = (q5 – q5_mean) / q5_sd`)). These standardised statistics are dimensionless, and so comparable across sites. Standardisation provides an alternative to scaling (described above) when one wishes flow statistics for larger and smaller watercourses to be comparable. Be aware that these standardised variables are calculated regardless of whether or not the raw flow data have been scaled (via the `scale` argument)).
#'
#' The function also includes the facility to standardise the statistics for one flow scenario (specified via `flow_col`) using mean and standard deviation flow statistics from another scenario (specified via `ref_col`). For example, if `flow_col` = naturalised flows and `ref_col` = historical flows, then the resulting statistics can be input into a hydro-ecological model that has previously been calibrated using standardised historical flow statistics and used to make predictions of ecological status under naturalised flows.
#'
#' @return The function returns a list of two data frames. The first data frame contains a suite of time-varying flow statistics for every time period at every site. The columns are as follows:
#'
#' - `flow_site_id`: a unique site id
#' - `win_no`: an autonumber counting the sequence of flow time periods
#' - `start_date`: start date of the time period (in yyyy-mm-dd format)
#' - `end_date`: end date of the time period (in yyyy-mm-dd format)
#' - `n_data`: the number of records with valid flows (not NA)
#' - `n_missing`: the number of missing flow records (flow = NA)
#' - `n_total`: the total number of flow records (sum of n_data and n_missing)
#' - `prop_missing`: the proportion of missing flow records (n_data / n_total)
#' - `n_imputed`: the number of flow records that have been imputed (this is calculated only if the imputed_col argument is specified)
#' - `prop_imputed`: the proportion of flow records that have been imputed (calculated only if the imputed_col argument is specified)
#' - `mean`: mean flow (min. records required = 2)
#' - `sd`: the standard deviation of flows (min. records required = 2)
#' - `Q5`: the unstandardised Q5 flow (min. records required = 20)
#' - `Q10`: the unstandardised Q10 flow (min. records required = 10)
#' - `Q20`: the unstandardised Q20 flow (min. records required = 5)
#' - `Q25`: the unstandardised Q25 flow (min. records required = 4)
#' - `Q30`: the unstandardised Q30 flow (min. records required = 4)
#' - `Q50`: the unstandardised Q50 flow (min. records required = 2)
#' - `Q70`: the unstandardised Q70 flow (min. records required = 4)
#' - `Q75`: the unstandardised Q75 flow (min. records required = 4)
#' - `Q80`: the unstandardised Q80 flow (min. records required = 5)
#' - `Q90`: the unstandardised Q90 flow (min. records required = 10)
#' - `Q95`: the unstandardised Q95 flow (min. records required = 20)
#' - `Q99`: the unstandardised Q99 flow (min. records required = 100)
#' - `Q5z`: the Q5 flow, standardised using the mean and sd of the Q5 flows across all time periods for that site, i.e. q5_z = (q5 – q5mean) / Q5sd. If ref_col is not NULL, then the Q5 is estimated for the flow_col time series, but standardised using the mean and sd parameters for the ref_col time series (i.e. q50_z = (q50 - q50mean_ref) / q50sd_ref). (min. records required = 20)
#' - `Q10z`: as for Q5z (min. records required = 10)
#' - `Q20z`: as for Q5z (min. records required = 5)
#' - `Q25z`: as for Q5z (min. records required = 4)
#' - `Q30z`: as for Q5z (min. records required = 4)
#' - `Q50z`: as for Q5z (min. records required = 2)
#' - `Q70z`: as for Q5z (min. records required = 4)
#' - `Q75z`: as for Q5z (min. records required = 4)
#' - `Q80z`: as for Q5z (min. records required = 5)
#' - `Q90z`: as for Q5z (min. records required = 10)
#' - `Q95z`: as for Q5z (min. records required = 20)
#' - `Q99z`: as for Q5z (min. records required = 100)
#' - `dry_n`: number of records with zero flow (min. records required = 2)
#' - `dry_e`: number of events when flow drops to zero (min. records required = 28)
#' - `dry_start`: day of year (1-366) of first zero flow record (min. records required = 8)
#' - `dry_end`: day of year (1-366) of last zero flow record (min. records required = 28)
#' - `dry_mid`: mean day of year (1-366) of all zero flow records (min. records required = 28)
#' - `low_n`: number of records when flow is below the q_low threshold (min. records required = 2)
#' - `low_e`: number of events when flow drops below the q_low threshold (min. records required = 28)
#' - `low_start`: day of year (1-366) of first record below the q_low threshold (min. records required = 28)
#' - `low_end`: day of year (1-366) of first record below the q_low threshold (min. records required = 28)
#' - `low_mid`: circular mean day of year (1-366) of all records below the q_low threshold (min. records required = 28)
#' - `low_magnitude`: mean flow deficit below q_low (min. records required = 28)
#' - `low_severity`: cumulative flow deficit below q_low (low_n x low_magnitude) (min. records required = 28)
#' - `high_n`: number of records when flow is above the q_high threshold (min. records required = 2)
#' - `high_e`: number of events when flow exceeds the q_high threshold (min. records required = 28)
#' - `high_start`: day of year (1-366) of first record above the q_high threshold (min. records required = 28)
#' - `high_end`: day of year (1-366) of last record above the q_high threshold (min. records required = 28)
#' - `high_mid`: circular mean day of year (1-366) of all records above the q_high threshold (min. records required = 28)
#' - `e_above3xq50`: number of events when flow exceeds 3 x the long-term median (Q50) flow (min. records required = 28)
#' - `e_above5xq50`: number of events when flow exceeds 5 x the long-term median (Q50) flow (min. records required = 28)
#' - `e_above7xq50`: number of events when flow exceeds 7 x the long-term median (Q50) flow (min. records required = 28)
#' - `volume`: total volume discharged (sum of flows) (min. records required = 3)
#' - `volume_z`: as for q5z (min. records required = 3)
#' - `min`: minimum flow (min. records required = 3)
#' - `min_z`: as for q5z (min. records required = 3)
#' - `min_doy`: day of year (1-366) of minimum flow (min. records required = 3)
#' - `min_7day`: minimum 7-day mean flow (min. records required = 90)
#' - `min_7day_z`: as for q5z (min. records required = 90)
#' - `min_7day_doy`: day of year (1-366) of midpoint of 7-day minimum flow period (min. records required = 90)
#' - `min_30day`: minimum 30-day mean flow (min. records required = 180)
#' - `min_30day_z`: as for q5z (min. records required = 180)
#' - `min_30day_doy`: day of year of (1-366) of midpoint of 30-day minimum flow period (min. records required = 180)
#' - `max`: maximum flow (min. records required = 3)
#' - `max_z`: as for q5z (min. records required = 3)
#' - `max_doy`: day of year (1-366) of maximum flow (min. records required = 3)
#'
#' The second data table contains long-term flow statistics. The data are arranged in long format, with the following columns:
#' - `flow_site_id` (a unique site id)
#' - `start_date`: start date of the long-term time period (in yyyy-mm-dd format) for which the statistics are calculated
#' - `end_date`: end date of the long-term time period (in yyyy-mm-dd format) for which the statistics are calculated
#' - `parameter` (long-term minimum, maximum and mean flow; long-term flow duration curve percentiles (p1 to p99); long-term base flow index (bfi = 7-day minimum flow / mean flow); and long-term mean and standard deviation of the time-varying q5 to q99, minimum flow, maximum flow and 7-day minimum flow statistics)
#' - `value` (calculated statistic)
#'
#' @export
#'
#' @examples
#' ## calculate flow statistics for a contiguous series of summer (April to September) and winter (October to March) time periods:
#' calc_flowstats(data = flow_data,
#' win_start = "1995-04-01",
#' win_width = "6 months",
#' win_step = "6 months")
#'
#' ## calculate flow statistics for a non-contiguous series of summer (April to September) time periods:
#' calc_flowstats(data = flow_data,
#' win_start = "1995-04-01",
#' win_width = "6 months",
#' win_step = "1 year")
#'
#' ## calculate flow statistics for a series of overlapping 24 month time periods:
#' calc_flowstats(data = flow_data,
#' win_start = "1995-04-01",
#' win_width = "24 months",
#' win_step = "1 month")
#'
calc_flowstats <- function(data,
site_col = "flow_site_id",
date_col = "date",
flow_col = "flow",
imputed_col = NULL,
win_start = "1995-04-01",
win_width = "6 months",
win_step = "6 months",
date_range = NULL,
q_low = 95,
q_high = 70,
scaling = FALSE,
ref_col = NULL){
if(is.data.frame(data) == FALSE){stop("Data frame not found")}
if(isTRUE(date_col == "date") != TRUE && (date_col %in% colnames(data)) == FALSE)
{stop("Specified date column was not identified in data")}
if(isTRUE(date_col == "date") == TRUE && (date_col %in% colnames(data)) == FALSE)
{stop("Default 'date' column was not identified in data")}
if(isTRUE(site_col == "flow_site_id") != TRUE && (site_col %in% colnames(data)) == FALSE)
{stop("Specified site column was not identified in data")}
if(isTRUE(site_col == "flow_site_id") == TRUE && (site_col %in% colnames(data)) == FALSE)
{stop("Default 'flow_site_id' column was not identified in data")}
if(isTRUE(flow_col == "flow") != TRUE && (flow_col %in% colnames(data)) == FALSE)
{stop("Specified flow column was not identified in data")}
if(isTRUE(flow_col == "flow") == TRUE && (flow_col %in% colnames(data)) == FALSE)
{stop("Default 'flow' column was not identified in data")}
if(is.null(imputed_col) == FALSE && isTRUE(imputed_col %in% colnames(data)) == FALSE)
{stop("Specified imputed_col was not identified in data")}
if(is.null(ref_col) == FALSE && (ref_col %in% colnames(data)) == FALSE)
{stop("Specified ref_col was not identified in data")}
if(is.null(win_start) == FALSE && !is.na(lubridate::parse_date_time(win_start,orders="Ymd")) == FALSE)
{stop("win_start should be in YYYY-MM-DD format")}
if(is.null(win_start) == FALSE && lubridate::is.Date(win_start) == TRUE && isTRUE(win_start > Sys.Date()) == TRUE)
{stop("win_start is in the future")}
matches <- c("day", "month", "year", "week")
if(is.null(win_width) == FALSE && grepl(matches[1], win_width) == TRUE){""} else {
if(is.null(win_width) == FALSE && grepl(matches[2], win_width) == TRUE){""} else {
if(is.null(win_width) == FALSE && grepl(matches[3], win_width) == TRUE){""} else {
if(is.null(win_width) == FALSE && grepl(matches[4], win_width) == TRUE){""} else {
stop("win_width must be in day, month, or year format")
}
}
}
}
if(is.null(win_step) == FALSE && grepl(matches[1], win_step) == TRUE){""} else {
if(is.null(win_step) == FALSE && grepl(matches[2], win_step) == TRUE){""} else {
if(is.null(win_step) == FALSE && grepl(matches[3], win_step) == TRUE){""} else {
if(is.null(win_step) == FALSE && grepl(matches[4], win_step) == TRUE){""} else {
stop("win_step must be in day, month, or year format")
}
}
}
}
if(is.null(date_range) == FALSE && length(date_range) == 1){
stop("date_range is of length 1, date range must be of length 2") }
if(is.null(date_range) == FALSE && !is.na(lubridate::parse_date_time(date_range[1],orders="Ymd")) == FALSE){
stop("date_range should be in YYYY-MM-DD format") }
if(is.null(date_range) == FALSE && !is.na(lubridate::parse_date_time(date_range[2],orders="Ymd")) == FALSE){
stop("date_range should be in YYYY-MM-DD format") }
if(is.null(date_range) == FALSE && isTRUE(length(date_range) > 2) == TRUE)
{stop("date_range should be of maximum length 2")}
if(is.null(date_range) == FALSE && isTRUE(date_range[1] >= date_range[2]) == TRUE)
{stop("start date exceeds end date, please check date_range")}
if(is.null(date_range) == FALSE && isTRUE(date_range[1] > Sys.Date()) == TRUE)
{stop("date_range 1 is in the future")}
if(is.null(date_range) == FALSE && isTRUE(date_range[2] > Sys.Date()) == TRUE)
{stop("date_range 2 is in the future")}
if(is.logical(scaling) == FALSE){stop("'scaling' is not logical")}
if(is.null(q_low) == FALSE && dplyr::between(q_low, 1, 99) == FALSE)
{stop("q_low must be a value between 1 and 99")}
if(dplyr::between(q_low, 1, 99) == TRUE && testInteger(q_low) == FALSE)
{stop("q_low must be an integer between 1 and 99")}
if(is.null(q_high) == FALSE && dplyr::between(q_high, 1, 99) == FALSE)
{stop("q_high must be a value between 1 and 99")}
if(dplyr::between(q_high, 1, 99) == TRUE && testInteger(q_high) == FALSE)
{stop("q_high must be an integer between 1 and 99")}
# pull-in data
data_1 <- data
data_1$site <- dplyr::pull(data_1, site_col)
data_1$date <- dplyr::pull(data_1, date_col)
data_1$flow <- dplyr::pull(data_1, flow_col)
data_1 <- data_1 %>% dplyr::arrange(., site, date)
# Check flow is numeric
if(is.numeric(data_1$flow) == FALSE)
{stop("Specified flow_col is not numeric")}
# Check for duplicate dates in each site
duplicates<-data_1$date[duplicated(data.table::data.table(data_1),by=c("site","date"))]
if(length(duplicates) >= 1)
{ print(duplicates)
stop("Duplicate dates identified")}
# filter by date_range
if(is.null(date_range) == FALSE){
data_1 <- data_1 %>% dplyr::filter(date <= date_range[2] & date >= date_range[1])
}
# Create a large data frame with all data windows, for all sites, up to present day
sites <- data.frame(site = unique(data_1$site))
expand_dates <- expand.grid(start_date = seq(as.Date(win_start), Sys.Date(), paste("+", sep = "", win_step)), stringsAsFactors = FALSE)
# get numeric win_width
width <- data.frame(win_width)
width_sep <- tidyr::separate(width, win_width, c("no", "period"))
width_no <- as.numeric(width_sep$no)
# if win_width is in days
if(is.null(win_width) == FALSE && grepl("day", win_width) == TRUE){
all_dates <- expand_dates %>% dplyr::mutate(end_date = start_date %m+% lubridate::days(width_no) - days(1),
win_no = 1:nrow(.))
}
# if win_width is in weeks
if(is.null(win_width) == FALSE && grepl("week", win_width) == TRUE){
all_dates <- expand_dates %>% dplyr::mutate(end_date = start_date %m+% lubridate::weeks(width_no) - days(1),
win_no = 1:nrow(.))
}
# if win_width is in months
if(is.null(win_width) == FALSE && grepl("month", win_width) == TRUE){
all_dates <- expand_dates %>% dplyr::mutate(end_date = start_date %m+% months(width_no) - days(1),
win_no = 1:nrow(.))
}
# if win_width is in years
if(is.null(win_width) == FALSE && grepl("year", win_width) == TRUE){
all_dates <- expand_dates %>% dplyr::mutate(end_date = start_date %m+% lubridate::years(width_no) - days(1),
win_no = 1:nrow(.))
}
# assign full set of moving windows to each site
win_dates <- dplyr::full_join(all_dates, sites, by = character())
win_dates$win_no <- as.character(win_dates$win_no)
all_win_dates <- win_dates %>% dplyr::rowwise() %>%
dplyr::do(data.frame(site =.$site,
date = seq(.$start_date, .$end_date, by="days"),
win_no =.$win_no,
win_start_date = .$start_date,
win_end_date = .$end_date))
# if scaling = TRUE, standardise flow data by the mean (for each site)
if(scaling == TRUE){
data_1 %>%
dplyr::group_by(site) %>%
dplyr::mutate(mean_f = (mean(flow, na.rm = TRUE))) %>%
dplyr::mutate(flow = flow / mean_f) %>%
dplyr::select(-mean_f)
}
# join window dates to flow data for CalcFlowStats
my_data_cf <- dplyr::left_join(all_win_dates, data_1, by = c("site", "date"))
# drop columns not required for flow calcs (these will be re-joined at the end)
my_data_cf_2 <- my_data_cf %>% dplyr::select(site, date, win_no, flow)
my_data_lt <- data_1 %>% dplyr::select(site, date, flow)
# calculate time-varying flow statistics
STATS1 <- suppressWarnings(my_data_cf_2 %>% CalcFlowStats())
# calculate long-term flow statistics
long_data <- suppressWarnings(CreateLongData(flow.data = my_data_lt, statsData = STATS1))
# calculate remaining time-varying flow statistics (durations/events, 7/30-day mins)
flow_stats <- suppressWarnings(CreateFlowStats(STATS1, long_data, my_data_cf_2, data_1, q_high = q_high, q_low = q_low))
# join time-varying statistics with win_dates, rename site column and drop unwanted columns
df1 <- dplyr::left_join(win_dates, flow_stats, by=c("site", "win_no")) %>%
dplyr::rename(flow_site_id = site) %>%
dplyr::select(-Q5mean, -Q5sd, -Q10mean, -Q10sd, -Q20mean, -Q20sd,
-Q25mean, -Q25sd, -Q30mean, -Q30sd, -Q50mean, -Q50sd,
-Q70mean, -Q70sd, -Q75mean, -Q75sd, -Q80mean, -Q80sd, -Q90mean, -Q90sd,
-Q95mean, -Q95sd, -Q99mean, -Q99sd, -min_7day_mean, -min_7day_sd,
-min_30day_mean, -min_30day_sd, -min_mean, -min_sd, -max_mean, -max_sd,
-vol_mean, -vol_sd)
# re-order columns
if(is.null(imputed_col) == TRUE){
col_order <- c("flow_site_id", "win_no", "start_date", "end_date", "n_data", "n_missing", "n_total", "prop_missing", "mean", "sd", "Q5", "Q10", "Q20", "Q25", "Q30", "Q50", "Q70", "Q75", "Q80", "Q90", "Q95", "Q99", "Q5z", "Q10z", "Q20z", "Q25z", "Q30z", "Q50z", "Q70z", "Q75z", "Q80z", "Q90z", "Q95z", "Q99z", "dry_n", "dry_e", "dry_start", "dry_end", "dry_mid", "low_n", "low_e", "low_start", "low_end", "low_mid", "low_magnitude", "low_severity", "high_n", "high_e", "high_start", "high_end", "high_mid", "e_above3xq50", "e_above5xq50", "e_above7xq50", "volume", "vol_z", "min", "min_z", "min_doy", "min_7day", "min_7day_z", "min_7day_doy", "min_30day", "min_30day_z", "min_30day_doy", "max", "max_z", "max_doy")
df1 <- df1[, col_order]
}
if(is.null(imputed_col) == FALSE){
# pull-in imputed col
my_data <- my_data_cf
my_data$imputed <- dplyr::pull(my_data, imputed_col)
test_impute <- data_1
test_impute$imputed <- dplyr::pull(test_impute, imputed_col)
if(isTRUE(all(test_impute$imputed == as.integer(test_impute$imputed))) == FALSE)
{stop ("imputed_col values must be integer value of 0 or 1")}
if(isTRUE(range(test_impute$imputed)[2]>1) == TRUE)
{stop ("imputed_col values must be integer value of 0 or 1")}
if(isTRUE(range(test_impute$imputed)[1]<0) == TRUE)
{stop ("imputed_col values must be integer value of 0 or 1")}
# count imputed data
count_imputed <- my_data %>%
dplyr::group_by(site, win_no) %>%
dplyr::count(., imputed) %>%
dplyr::rename(n_imputed = n,
flow_site_id = site)
# add in count of imputed flow records
df1 <- dplyr::left_join(df1, count_imputed, by = c("flow_site_id", "win_no")) %>%
dplyr::mutate(n_imputed = ifelse(imputed == 1, n_imputed, 0)) %>%
dplyr::select(-imputed) %>%
dplyr::mutate(prop_imputed = (n_imputed/n_data))
# re-order cols
col_order <- c("flow_site_id", "win_no", "start_date", "end_date", "n_data", "n_missing", "n_total", "prop_missing", "n_imputed", "prop_imputed", "mean", "sd", "Q5", "Q10", "Q20", "Q25", "Q30", "Q50", "Q70", "Q75", "Q80", "Q90", "Q95", "Q99", "Q5z", "Q10z", "Q20z", "Q25z", "Q30z", "Q50z", "Q70z", "Q75z", "Q80z", "Q90z", "Q95z", "Q99z", "dry_n", "dry_e", "dry_start", "dry_end", "dry_mid", "low_n", "low_e", "low_start", "low_end", "low_mid", "low_magnitude", "low_severity", "high_n", "high_e", "high_start", "high_end", "high_mid", "e_above3xq50", "e_above5xq50", "e_above7xq50", "volume", "vol_z", "min", "min_z", "min_doy", "min_7day", "min_7day_z", "min_7day_doy", "min_30day", "min_30day_z", "min_30day_doy", "max", "max_z", "max_doy")
df1 <- df1[, col_order]
}
# add min & max dates to long_data, to create df2 for final function output
# get start and end dates for long term stats
minmax_dates <- data_1 %>% dplyr::group_by(site) %>%
dplyr::summarise(start_date = min(date),
end_date = max(date))
# join with long_data
long_data_2 <- long_data %>% dplyr::full_join(minmax_dates, by = "site") %>%
dplyr::rename(flow_site_id = site) %>%
dplyr::select(-win_no)
# re-order cols
cols_order <- c("flow_site_id", "start_date", "end_date", "parameter", "value")
long_data_2 <- long_data_2[, cols_order]
# rename
df2 <- long_data_2
# Remov flow stats where the minimum data requirements have not been met
df1.1 <- df1 %>%
dplyr::mutate(mean = ifelse(n_data >= 2, mean, NA),
sd = ifelse(n_data >= 2, sd, NA),
Q5 = ifelse(n_data >= 20, Q5, NA),
Q10 = ifelse(n_data >= 10, Q10, NA),
Q20 = ifelse(n_data >= 5, Q20, NA),
Q25 = ifelse(n_data >= 4, Q25, NA),
Q30 = ifelse(n_data >= 4, Q30, NA),
Q50 = ifelse(n_data >= 2, Q50, NA),
Q70 = ifelse(n_data >= 4, Q70, NA),
Q75 = ifelse(n_data >= 4, Q75, NA),
Q80 = ifelse(n_data >= 5, Q80, NA),
Q90 = ifelse(n_data >= 10, Q90, NA),
Q95 = ifelse(n_data >= 20, Q95, NA),
Q99 = ifelse(n_data >= 100, Q99, NA),
Q5z = ifelse(n_data >= 20, Q5z, NA),
Q10z = ifelse(n_data >= 10, Q10z, NA),
Q20z = ifelse(n_data >= 5, Q20z, NA),
Q25z = ifelse(n_data >= 4, Q25z, NA),
Q30z = ifelse(n_data >= 4, Q30z, NA),
Q50z = ifelse(n_data >= 2, Q50z, NA),
Q70z = ifelse(n_data >= 4, Q70z, NA),
Q75z = ifelse(n_data >= 4, Q75z, NA),
Q80z = ifelse(n_data >= 5, Q80z, NA),
Q90z = ifelse(n_data >= 10, Q90z, NA),
Q95z = ifelse(n_data >= 20, Q95z, NA),
Q99z = ifelse(n_data >= 100, Q99z, NA),
dry_n = ifelse(n_data >= 2, dry_n, NA),
dry_start = ifelse(n_data >= 28, dry_start, NA),
dry_end = ifelse(n_data >= 28, dry_end, NA),
dry_mid = ifelse(n_data >= 28, dry_mid, NA),
low_n = ifelse(n_data >= 2, low_n, NA),
low_start = ifelse(n_data >= 28, low_start, NA),
low_end = ifelse(n_data >= 28, low_end, NA),
low_mid = ifelse(n_data >= 28, low_mid, NA),
low_magnitude = ifelse(n_data >= 28, low_magnitude, NA),
low_severity = ifelse(n_data >= 28, low_severity, NA),
high_n = ifelse(n_data >= 2, high_n, NA),
high_start = ifelse(n_data >= 28, high_start, NA),
high_end = ifelse(n_data >= 28, high_end, NA),
high_mid = ifelse(n_data >= 28, high_mid, NA),
e_above3xq50 = ifelse(n_data >= 28, e_above3xq50, NA),
e_above5xq50 = ifelse(n_data >= 28, e_above5xq50, NA),
e_above7xq50 = ifelse(n_data >= 28, e_above7xq50, NA),
volume = ifelse(n_data >= 3, volume, NA),
vol_z = ifelse(n_data >= 3, vol_z, NA),
min = ifelse(n_data >= 3, min, NA),
min_z = ifelse(n_data >= 3, min_z, NA),
min_doy = ifelse(n_data >= 3, min_doy, NA),
min_7day = ifelse(n_data >= 90, min_7day, NA),
min_7day_z = ifelse(n_data >= 90, min_7day_z, NA),
min_7day_doy = ifelse(n_data >= 90, min_7day_doy, NA),
min_30day = ifelse(n_data >= 180, min_30day, NA),
min_30day_z = ifelse(n_data >= 180, min_30day_z, NA),
min_30day_doy = ifelse(n_data >= 180, min_30day_doy, NA),
max = ifelse(n_data >= 3, max, NA),
max_z = ifelse(n_data >= 3, max_z, NA),
max_doy = ifelse(n_data >= 3, max_doy, NA))
df1.1$win_no <- as.numeric(df1.1$win_no)
df1.1 <- arrange(df1.1, flow_site_id, win_no)
### create final data to return
final_list <- list(df1.1, df2)
### if a ref_col is specified
# calculate ref values using ref_col
if(is.null(ref_col) == FALSE){
# pull-in data
data_ref <- data
data_ref$site <- dplyr::pull(data_ref, site_col)
data_ref$date <- dplyr::pull(data_ref, date_col)
data_ref$flow <- dplyr::pull(data_ref, ref_col)
if(is.numeric(data_ref$flow) == FALSE)
{stop("Specified ref_col is not numeric")}
# filter by date_range
if(is.null(date_range) == FALSE){
data_ref <- data_ref %>% dplyr::filter(date <= date_range[2] & date >= date_range[1])
}
# join window dates to ref data
my_data_ref <- dplyr::left_join(all_win_dates, data_ref, by = c("site", "date"))
# drop columns not required for flow calcs
my_data_ref_2 <- my_data_ref %>% dplyr::select(site, date, win_no, flow)
# calculate time varying flow statistics
STATS1_ref <- suppressWarnings(CalcFlowStats(my_data_ref_2))
# Get QXz_adj
# select the data we need from _ref dataset and rename
ref_data <- STATS1_ref %>%
dplyr::select(site, win_no, Q5mean, Q5sd, Q10mean, Q10sd, Q20mean, Q20sd,
Q25mean, Q25sd, Q30mean, Q30sd, Q50mean, Q50sd, Q70mean, Q70sd,
Q75mean, Q75sd, Q80mean, Q80sd, Q90mean, Q90sd, Q95mean, Q95sd,
Q99mean, Q99sd, min_mean, min_sd, max_mean, max_sd,
min_30day_mean, min_30day_sd, min_7day_mean, min_7day_sd,
vol_mean, vol_sd)
ref_data <- ref_data %>% dplyr::rename(flow_site_id = site,
Q5mean_ref = Q5mean,
Q5sd_ref = Q5sd,
Q10mean_ref = Q10mean,
Q10sd_ref = Q10sd,
Q20mean_ref = Q20mean,
Q20sd_ref = Q20sd,
Q25mean_ref = Q25mean,
Q25sd_ref = Q25sd,
Q30mean_ref = Q30mean,
Q30sd_ref = Q30sd,
Q50mean_ref = Q50mean,
Q50sd_ref = Q50sd,
Q70mean_ref = Q70mean,
Q70sd_ref = Q70sd,
Q75mean_ref = Q75mean,
Q75sd_ref = Q75sd,
Q80mean_ref = Q80mean,
Q80sd_ref = Q80sd,
Q90mean_ref = Q90mean,
Q90sd_ref = Q90sd,
Q95mean_ref = Q95mean,
Q95sd_ref = Q95sd,
Q99mean_ref = Q99mean,
Q99sd_ref = Q99sd,
minmean_ref = min_mean,
minsd_ref = min_sd,
maxmean_ref = max_mean,
maxsd_ref = max_sd,
volmean_ref = vol_mean,
volsd_ref = vol_sd,
min_day30mean_ref = min_30day_mean,
min_day30sd_ref = min_30day_sd,
min_day7mean_ref = min_7day_mean,
min_day7sd_ref = min_7day_sd)
# calculate _adj values
df1_ALL <- dplyr::left_join(x = df1, y = ref_data,
by = c("flow_site_id", "win_no"))
# Get Q5z_adj
Q5 <- df1_ALL$Q5
Q5mean_ref <- df1_ALL$Q5mean_ref
Q5sd_ref <- df1_ALL$Q5sd_ref
df1_ALL$Q5z_ref <- (Q5 - Q5mean_ref) / Q5sd_ref
# Get Q10z_adj
Q10 <- df1_ALL$Q10
Q10mean_ref <- df1_ALL$Q10mean_ref
Q10sd_ref <- df1_ALL$Q10sd_ref
df1_ALL$Q10z_ref <- (Q10 - Q10mean_ref) / Q10sd_ref
# Get Q20z_adj
Q20 <- df1_ALL$Q20
Q20mean_ref <- df1_ALL$Q20mean_ref
Q20sd_ref <- df1_ALL$Q20sd_ref
df1_ALL$Q20z_ref <- (Q20 - Q20mean_ref) / Q20sd_ref
# Get Q25z_adj
Q25 <- df1_ALL$Q25
Q25mean_ref <- df1_ALL$Q25mean_ref
Q25sd_ref <- df1_ALL$Q25sd_ref
df1_ALL$Q25z_ref <- (Q25 - Q25mean_ref) / Q25sd_ref
# Get Q30z_adj
Q30 <- df1_ALL$Q30
Q30mean_ref <- df1_ALL$Q30mean_ref
Q30sd_ref <- df1_ALL$Q30sd_ref
df1_ALL$Q30z_ref <- (Q30 - Q30mean_ref) / Q30sd_ref
# Get Q50z_adj
Q50 <- df1_ALL$Q50
Q50mean_ref <- df1_ALL$Q50mean_ref
Q50sd_ref <- df1_ALL$Q50sd_ref
df1_ALL$Q50z_ref <- (Q50 - Q50mean_ref) / Q50sd_ref
# Get Q70z_adj
Q70 <- df1_ALL$Q70
Q70mean_ref <- df1_ALL$Q70mean_ref
Q70sd_ref <- df1_ALL$Q70sd_ref
df1_ALL$Q70z_ref <- (Q70 - Q70mean_ref) / Q70sd_ref
# Get Q75z_adj
Q75 <- df1_ALL$Q75
Q75mean_ref <- df1_ALL$Q75mean_ref
Q75sd_ref <- df1_ALL$Q75sd_ref
df1_ALL$Q75z_ref <- (Q75 - Q75mean_ref) / Q75sd_ref
# Get Q80z_adj
Q80 <- df1_ALL$Q80
Q80mean_ref <- df1_ALL$Q80mean_ref
Q80sd_ref <- df1_ALL$Q80sd_ref
df1_ALL$Q80z_ref <- (Q80 - Q80mean_ref) / Q80sd_ref
# Get Q90z_adj
Q90 <- df1_ALL$Q90
Q90mean_ref <- df1_ALL$Q90mean_ref
Q90sd_ref <- df1_ALL$Q90sd_ref
df1_ALL$Q90z_ref <- (Q90 - Q90mean_ref) / Q90sd_ref
# Get Q95z_adj
Q95 <- df1_ALL$Q95
Q95mean_ref <- df1_ALL$Q95mean_ref
Q95sd_ref <- df1_ALL$Q95sd_ref
df1_ALL$Q95z_ref <- (Q95 - Q95mean_ref) / Q95sd_ref
# Get Q99z_adj
Q99 <- df1_ALL$Q99
Q99mean_ref <- df1_ALL$Q99mean_ref
Q99sd_ref <- df1_ALL$Q99sd_ref
df1_ALL$Q99z_ref <- (Q99 - Q99mean_ref) / Q99sd_ref
# Get vol_z_adj
vol <- df1_ALL$volume
volmean_ref <- df1_ALL$volmean_ref
volsd_ref <- df1_ALL$volsd_ref
df1_ALL$vol_z_ref <- (vol - volmean_ref) / volsd_ref
# Get min_z_adj
min <- df1_ALL$min
minmean_ref <- df1_ALL$minmean_ref
minsd_ref <- df1_ALL$minsd_ref
df1_ALL$min_z_ref <- (min - minmean_ref) / minsd_ref
# Get max_z_adj
max <- df1_ALL$max
maxmean_ref <- df1_ALL$maxmean_ref
maxsd_ref <- df1_ALL$maxsd_ref
df1_ALL$max_z_ref <- (max - maxmean_ref) / maxsd_ref
# Get min_day7_z_adj
min_7day <- df1_ALL$min_7day
min_day7mean_ref <- df1_ALL$min_day7mean_ref
min_day7sd_ref <- df1_ALL$min_day7sd_ref
df1_ALL$min_7day_z_ref <- (min_7day - min_day7mean_ref) / min_day7sd_ref
# Get min_day30_z_adj
min_30day <- df1_ALL$min_30day
min_day30mean_ref <- df1_ALL$min_day30mean_ref
min_day30sd_ref <- df1_ALL$min_day30sd_ref
df1_ALL$min_30day_z_ref <- (min_30day - min_day30mean_ref) / min_day30sd_ref
df1_ALL <- df1_ALL %>%
dplyr::select(-Q5mean_ref, -Q5sd_ref, -Q10mean_ref, -Q10sd_ref, -Q20mean_ref,
-Q20sd_ref, -Q25mean_ref, -Q25sd_ref, -Q30mean_ref, -Q30sd_ref,
-Q50mean_ref, -Q50sd_ref, -Q70mean_ref, -Q70sd_ref, -Q75mean_ref,
-Q75sd_ref, -Q80mean_ref, -Q80sd_ref, -Q90mean_ref, -Q90sd_ref,
-Q95mean_ref, -Q95sd_ref,-Q99mean_ref, -Q99sd_ref, -minmean_ref,
-minsd_ref, -maxmean_ref, -maxsd_ref, -volmean_ref, -volsd_ref,
-min_day30mean_ref, -min_day30sd_ref, -min_day7mean_ref, -min_day7sd_ref)
df1_ALL.1 <- df1_ALL %>%
dplyr::mutate(mean = ifelse(n_data >= 2, mean, NA),
sd = ifelse(n_data >= 2, sd, NA),
Q5 = ifelse(n_data >= 20, Q5, NA),
Q10 = ifelse(n_data >= 10, Q10, NA),
Q20 = ifelse(n_data >= 5, Q20, NA),
Q25 = ifelse(n_data >= 4, Q25, NA),
Q30 = ifelse(n_data >= 4, Q30, NA),
Q50 = ifelse(n_data >= 2, Q50, NA),
Q70 = ifelse(n_data >= 4, Q70, NA),
Q75 = ifelse(n_data >= 4, Q75, NA),
Q80 = ifelse(n_data >= 5, Q80, NA),
Q90 = ifelse(n_data >= 10, Q90, NA),
Q95 = ifelse(n_data >= 20, Q95, NA),
Q99 = ifelse(n_data >= 100, Q99, NA),
Q5z = ifelse(n_data >= 20, Q5z, NA),
Q10z = ifelse(n_data >= 10, Q10z, NA),
Q20z = ifelse(n_data >= 5, Q20z, NA),
Q25z = ifelse(n_data >= 4, Q25z, NA),
Q30z = ifelse(n_data >= 4, Q30z, NA),
Q50z = ifelse(n_data >= 2, Q50z, NA),
Q70z = ifelse(n_data >= 4, Q70z, NA),
Q75z = ifelse(n_data >= 4, Q75z, NA),
Q80z = ifelse(n_data >= 5, Q80z, NA),
Q90z = ifelse(n_data >= 10, Q90z, NA),
Q95z = ifelse(n_data >= 20, Q95z, NA),
Q99z = ifelse(n_data >= 100, Q99z, NA),
dry_n = ifelse(n_data >= 2, dry_n, NA),
dry_start = ifelse(n_data >= 28, dry_start, NA),
dry_end = ifelse(n_data >= 28, dry_end, NA),
dry_mid = ifelse(n_data >= 28, dry_mid, NA),
low_n = ifelse(n_data >= 2, low_n, NA),
low_start = ifelse(n_data >= 28, low_start, NA),
low_end = ifelse(n_data >= 28, low_end, NA),
low_mid = ifelse(n_data >= 28, low_mid, NA),
low_magnitude = ifelse(n_data >= 28, low_magnitude, NA),
low_severity = ifelse(n_data >= 28, low_severity, NA),
high_n = ifelse(n_data >= 2, high_n, NA),
high_start = ifelse(n_data >= 28, high_start, NA),
high_end = ifelse(n_data >= 28, high_end, NA),
high_mid = ifelse(n_data >= 28, high_mid, NA),
e_above3xq50 = ifelse(n_data >= 28, e_above3xq50, NA),
e_above5xq50 = ifelse(n_data >= 28, e_above5xq50, NA),
e_above7xq50 = ifelse(n_data >= 28, e_above7xq50, NA),
volume = ifelse(n_data >= 3, volume, NA),
vol_z = ifelse(n_data >= 3, vol_z, NA),
min = ifelse(n_data >= 3, min, NA),
min_z = ifelse(n_data >= 3, min_z, NA),
min_doy = ifelse(n_data >= 3, min_doy, NA),
min_7day = ifelse(n_data >= 90, min_7day, NA),
min_7day_z = ifelse(n_data >= 90, min_7day_z, NA),
min_7day_doy = ifelse(n_data >= 90, min_7day_doy, NA),
min_30day = ifelse(n_data >= 180, min_30day, NA),
min_30day_z = ifelse(n_data >= 180, min_30day_z, NA),
min_30day_doy = ifelse(n_data >= 180, min_30day_doy, NA),
max = ifelse(n_data >= 3, max, NA),
max_z = ifelse(n_data >= 3, max_z, NA),
max_doy = ifelse(n_data >= 3, max_doy, NA),
Q5z_ref = ifelse(n_data >= 20, Q5z_ref, NA),
Q10z_ref = ifelse(n_data >= 10, Q10z_ref, NA),
Q20z_ref = ifelse(n_data >= 5, Q20z_ref, NA),
Q25z_ref = ifelse(n_data >= 4, Q25z_ref, NA),
Q30z_ref = ifelse(n_data >= 4, Q30z_ref, NA),
Q50z_ref = ifelse(n_data >= 2, Q50z_ref, NA),
Q70z_ref = ifelse(n_data >= 4, Q70z_ref, NA),
Q75z_ref = ifelse(n_data >= 4, Q75z_ref, NA),
Q80z_ref = ifelse(n_data >= 5, Q80z_ref, NA),
Q90z_ref = ifelse(n_data >= 10, Q90z_ref, NA),
Q95z_ref = ifelse(n_data >= 20, Q95z_ref, NA),
Q99z_ref = ifelse(n_data >= 100, Q99z_ref, NA),
vol_z_ref = ifelse(n_data >= 3, vol_z_ref, NA),
min_z_ref = ifelse(n_data >= 3, min_z_ref, NA),
max_z_ref = ifelse(n_data >= 3, max_z_ref, NA),
min_7day_z_ref = ifelse(n_data >= 90, min_7day_z_ref, NA),
min_30day_z_ref = ifelse(n_data >= 180, min_30day_z_ref, NA)
)
df1_ALL.1$win_no <- as.numeric(df1_ALL.1$win_no)
df1_ALL.1 <- arrange(df1_ALL.1, flow_site_id, win_no)
# Create a list of output_1 (with adj QXz values) and output_2
final_list_ref <- list(df1_ALL.1, df2)
}
if(is.null(ref_col) == TRUE){return(final_list)}
if(is.null(ref_col) == FALSE){return(final_list_ref)}
}
####################################################################
## CalcFlowStats
# Required as part of calc_flowstats (main function)
# calculates Qstats, min, max, volume, min_7day, min_30day
# calculates mean & sd for all parameters, then calculates z-stats
# Output passed to CreateFlowData
CalcFlowStats <- function (flowts) {
# calculate missing data before NAs are removed
MISSING <- flowts %>%
dplyr::group_by(site, win_no) %>%
dplyr::summarise(n_data = sum(!is.na(flow)),
n_missing = sum(is.na(flow)),
n_total = n_missing + n_data,
prop_missing = n_missing / n_total)
# remove NAs and calculate flow stats for each site by time period (win_no)
flowts <- flowts %>%
dplyr::group_by(site, win_no) %>%
dplyr::filter(!is.na(flow)) %>%
dplyr::summarise(n_data = length(flow),
mean = mean(flow),
sd = sd(flow),
Q5 = quantile(flow, probs=0.95, na.rm=TRUE),
Q10 = quantile(flow, probs=0.9, na.rm=TRUE),
Q20 = quantile(flow, probs=0.8, na.rm=TRUE),
Q25 = quantile(flow, probs=0.75, na.rm=TRUE),
Q30 = quantile(flow, probs=0.7, na.rm=TRUE),
Q50 = quantile(flow, probs=0.5,na.rm=TRUE),
Q70 = quantile(flow, probs=0.3,na.rm=TRUE),
Q75 = quantile(flow, probs=0.25, na.rm=TRUE),
Q80 = quantile(flow, probs=0.2, na.rm=TRUE),
Q90 = quantile(flow, probs=0.1,na.rm=TRUE),
Q95 = quantile(flow, probs=0.05,na.rm=TRUE),
Q99 = quantile(flow, probs=0.01,na.rm=TRUE),
volume = sum(flow),
min = min(flow, na.rm = TRUE),
min_7day = min(zoo::rollmean(flow, 7), na.rm = TRUE),
min_30day = min(zoo::rollmean(flow, 30), na.rm = TRUE),
max = max(flow, na.rm = TRUE),
e_above3xq50 = riisbiggs2(flow, Q50, 3),
e_above5xq50 = riisbiggs2(flow, Q50, 5),
e_above7xq50 = riisbiggs2(flow, Q50, 7)) %>%
dplyr::ungroup() %>%
dplyr::group_by(site) %>%
dplyr::mutate(Q5mean = mean(Q5),
Q10mean = mean(Q10),
Q20mean = mean(Q20),
Q25mean = mean(Q25),
Q30mean = mean(Q30),
Q50mean = mean(Q50),
Q70mean = mean(Q70),
Q75mean = mean(Q75),
Q80mean = mean(Q80),
Q90mean = mean(Q90),
Q95mean = mean(Q95),
Q99mean = mean(Q99),
vol_mean = mean(volume),
min_mean = mean(min),
min_7day_mean = mean(min_7day),
min_30day_mean = mean(min_30day),
max_mean = mean(max),
Q5sd = sd(Q5),
Q10sd = sd(Q10),
Q20sd = sd(Q20),
Q25sd = sd(Q25),
Q30sd = sd(Q30),
Q50sd = sd(Q50),
Q70sd = sd(Q70),
Q75sd = sd(Q75),
Q80sd = sd(Q80),
Q90sd = sd(Q90),
Q95sd = sd(Q95),
Q99sd = sd(Q99),
vol_sd = sd(volume),
min_sd = sd(min),
min_7day_sd = sd(min_7day),
min_30day_sd = sd(min_30day),
max_sd = sd(max),
Q5z = (Q5-Q5mean)/Q5sd,
Q10z = (Q10-Q10mean)/Q10sd,
Q20z = (Q20-Q20mean)/Q20sd,
Q25z = (Q25-Q25mean)/Q25sd,
Q30z = (Q30-Q30mean)/Q30sd,
Q50z = (Q50-Q50mean)/Q50sd,
Q70z = (Q70-Q70mean)/Q70sd,
Q75z = (Q75-Q75mean)/Q75sd,
Q80z = (Q80-Q80mean)/Q80sd,
Q90z = (Q90-Q90mean)/Q90sd,
Q95z = (Q95-Q95mean)/Q95sd,
Q99z = (Q99-Q99mean)/Q99sd,
vol_z = (volume - vol_mean) / vol_sd,
min_z = (min - min_mean) / min_sd,
min_7day_z = (min_7day - min_7day_mean) / min_7day_sd,
min_30day_z = (min_30day - min_30day_mean) / min_30day_sd,
max_z = (max - max_mean) / max_sd) %>%
dplyr::full_join(MISSING)
return(flowts)
}
###########################################################
## CreateLongData
# Required as part of calc_flowstats (main function)
# output then used in CreateFlowData
# calculates BFI
# calculates flow duration curve
# calculates 'long_data' required for CreateFlowData
CreateLongData <- function(flow.data, statsData) {
# rename data
flow_data <- flow.data
QSTATS1 <- statsData
# calculate base flow index, if any stations have flows with 0's, make that bfi NA
# if there are no 0s in the flow data then use calc_bfi to calculate base flow index
bfi <- flow_data %>%
dplyr::filter(!is.na(flow)) %>%
dplyr::mutate(x=flow) %>%
dplyr::group_by(site) %>%
#dplyr::summarise(bfi = if(0 %in% x == FALSE ){ bfi= calc_bfi(.$x) }else{ print ("flow contains 0's, returning NA")
# bfi=NA}) %>%
dplyr::summarise(bfi = calc_bfi(x)) %>%
dplyr::mutate(win_no = "Annual") %>%
dplyr::ungroup() %>%
tidyr::gather(-site, -win_no, key = parameter, value = value)
# calculate long-term flow duration curve percentile 1:99
# returns a warning that calculation is ignoring missing values- even when NA are removed
FlowDurationCurve <- flow_data %>%
dplyr::filter(!is.na(flow)) %>%
#dplyr::group_by(site) %>% ## group_by and group_modify were not grouping & applying calc_longterm_percentile() to each site
#dplyr::group_modify(~ ( ## instead, have added 'groups = site' to calc_longterm_percentile()
fasstr::calc_longterm_percentile(groups = site, dates = date, values = flow, percentiles=c(1:99), transpose = TRUE) %>%
#dplyr::ungroup() %>%
dplyr::mutate(win_no="Annual") %>%
setNames(., c("site", "parameter", "value", "win_no")) %>%
dplyr::mutate(parameter=as.character(parameter))
# create a long dataframe containing just means and sd for each Q value, BFI and flow duration curve
long_data1 <- QSTATS1 %>%
dplyr::filter(!is.na("Q30")) %>%
# make dataframe long
tidyr::gather(-site, -win_no, key = stat, value = stat_value) %>% # select just the stats we want
dplyr::filter(stat== "Q5"| stat== "Q10"| stat== "Q20"| stat== "Q25"| stat== "Q30"| stat== "Q50" | stat== "Q70" | stat== "Q75" |stat== "Q80"|stat== "Q90"| stat== "Q95"| stat== "Q99"| stat == "vol"| stat== "min"| stat== "min_7day"| stat== "min_7day"| stat== "max") %>%
dplyr::group_by(stat, site) %>%
# calculate the mean and sd for each statistic by site
dplyr::summarise(mean = mean(stat_value, na.rm=TRUE), sd = sd(stat_value, na.rm = TRUE)) %>%
dplyr::ungroup() %>%
tidyr::gather(sd, mean, key = id, value = value) %>% # gather mean and sd columns into a column with values and key column (ie mean or sd)
dplyr::do(within(., parameter <- paste(stat, id, sep=""))) %>% # concatenate stat (ie Q10) and id (ie mean or sd) columns (example: Q10mean, Q70mean)
dplyr::select(-stat, -id) %>% # remove the stat and id columns
dplyr::mutate(win_no = "Q_Annual")
# create a long dataframe containing overall mean, min and max flows, an merge with above datasets
long_data <- QSTATS1 %>%
dplyr::filter(!is.na("Q30")) %>%
# make dataframe long
tidyr::gather(-site, -win_no, key = stat, value = stat_value) %>% # select just the stats we want
dplyr::filter(stat== "min"| stat== "max" | stat== "mean") %>%
dplyr::group_by(stat, site) %>%
# calculate the mean each statistic by site
dplyr::summarise(mean = mean(stat_value, na.rm=TRUE)) %>%
dplyr::ungroup() %>%
tidyr::gather(mean, key = id, value = value) %>% # gather mean and sd columns into a column with values and key column (ie mean or sd)
dplyr::rename(parameter = stat) %>%
dplyr::select(-id) %>%
dplyr::mutate(win_no = "Q_Annual") %>%
dplyr::bind_rows(long_data1) %>%
dplyr::bind_rows(bfi) %>%
dplyr::bind_rows(FlowDurationCurve)
return(long_data)
}
#######################################################################################
## CreateFlowStats
# Required for calc_flowstats (main function)
# calculates durations above/below qhigh/qlow, and zero flow durations
# Calculates the minimum/maximum flow date as DOY, plus mean 30/70day_min DOY
CreateFlowStats <- function(stats_data, long.data, station_data, original_data, q_high, q_low) {
# rename inputs
QSTATS1 <- stats_data
long_data <- long.data
flow_data <- station_data
q_high <- as.numeric(q_high); q_low <- as.numeric(q_low)
data_1 <- original_data
# filter data for calculating zero events, events above/below q_high/q_low
thres_data <- flow_data %>%
dplyr::filter(!is.na(flow))
# find zero flow days
zeros <- thres_data %>%
dplyr::group_by(site, win_no) %>%
dplyr::do(zero_eventsDuration(.))
colnames(zeros) <- c("site", "win_no", "dry_n", "dry_e", "dry_start", "dry_end", "dry_mid") # rename columns
# calculate QHigh and QLow
# needed to calculate durations above and below threshold value
high <- (100 - q_high)
low <- (100 - q_low)
Q_high <- paste("P", sep = "", high)
Q_low <- paste("P", sep = "", low)
Qhigh <- long_data %>%
dplyr::filter(parameter == Q_high)
Qlow <- long_data %>%
dplyr::filter(parameter == Q_low)
Qvals <- rbind(Qhigh, Qlow)
Qvals2<- Qvals %>%
tidyr::spread(., key=parameter, value=value) %>%
dplyr::select(-win_no)
colnames(Qvals2) <- c("site", "qLow", "qHigh")
thres_dataQ <- dplyr::left_join(thres_data, Qvals2, by = "site")
# calculate duration/events above q_low and q_high
duration_above <- thres_dataQ %>%
dplyr::group_by(site, win_no, qHigh) %>%
dplyr::do(find_eventDuration(., unique(.$qHigh), "high")) %>%
ungroup() %>%
dplyr::select(-qHigh)
colnames(duration_above) <- c("site", "win_no", "high_n", "high_e", "high_start", "high_end", "high_mid") # rename columns
duration_below <- thres_dataQ %>%
dplyr::group_by(site, win_no, qLow) %>%
dplyr::do(find_eventDuration(., unique(.$qLow), "low")) %>%
ungroup %>%
dplyr::select(-qLow)
colnames(duration_below) <- c("site", "win_no", "low_n", "low_e", "low_start", "low_end", "low_mid", "low_magnitude", "low_severity") # rename columns
Durations <- dplyr::full_join(duration_above, duration_below, by=c("site", "win_no"))
if(isTRUE(1 %in% diff.Date(data_1$date)) == TRUE){
# calculate min, max, min 7/30day rolling mean DOY
# min plus 7day_min
doy1 <- thres_data %>%
dplyr::group_by(site, win_no) %>%
dplyr::do(find_doy(., "low", 7))
colnames(doy1) <- c("site", "win_no", "min_doy", "min_7day_doy")
# max plus 30day_min
doy2 <- thres_data %>%
dplyr::group_by(site, win_no) %>%
dplyr::do(find_doy(., "high", 30))
colnames(doy2) <- c("site", "win_no", "max_doy", "min_30day_doy")
minmax_doy <- dplyr::full_join(doy1, doy2, by=c("site", "win_no"))
} else {
minmax_doy <- thres_data %>% dplyr::group_by(site, win_no) %>%
dplyr::summarise(min_doy = NA,
min_7day_doy = NA,
max_doy = NA,
min_30day_doy = NA)
}
# Join QSTATS1 with Duration (dataframe with durations/events above/below)
QSTATS1.2<- dplyr::full_join(QSTATS1, Durations, by=c("site", "win_no"))
# Then join with zeros
QSTATS1.3<- dplyr::full_join(QSTATS1.2, zeros, by=c("site", "win_no"))
# Then join with minmax_doy
QSTATS1.4<- dplyr::full_join(QSTATS1.3, minmax_doy, by=c("site", "win_no"))
# create final dataset
QSTATS_final<- QSTATS1.4
return(QSTATS_final)
}
##############################################################################
############### FUNCTIONS USED IN CalcFlowStats ##############################
## riisbiggs2
# Required for CalcFlowStats
# Calculates the number of times a flow threshold is exceeded
riisbiggs2 <- function(datavector, threshold, multiplier) {
# code checked 26/6/2008. Numbers of times threshold exceeded are all turned into annual averages. Lets leave it like that for now
# original riis / biggs work was all based on annual values
# if we are looking at different time windows in the same analysis, this would need some more thought
# difference between 182 and 183 days probably not worth worrying about
# in the future, absolute values might be more useful in some cases.
# re-checked 21/10/2013
nyears <- length(datavector)/365.25
medvec5 <- as.numeric(datavector > multiplier*threshold) # create vector of TRUE / FALSE depending on whether flow is > threshold
medvec5 <- medvec5[!is.na(medvec5)] # NAs should have been removed already
test <- rle(medvec5)
# length(test$lengths[test$values==1])/nyears
length(test$lengths[test$values==1])
}
############### FUNCTIONS USED IN CreateLongData ##############################
###############################################
## calc_bfi
# Required for CreateLongData
calc_bfi <- function(x) {
## original function calculates rolling 7-day means then takes the minimum of these
#day7mean <- zoo::rollmean(x, 7, align = "right")
#min7day <- min(day7mean)
#meanflow <- mean(x)
#calc_bfi <- min7day/meanflow
## but it should probably be the other way around: deriving rolling minima to get an estimate of underlying baseflows then taking the mean
day7min <- RcppRoll::roll_min(x, 7, align = "right")
mean_BF <- mean (day7min)
meanflow <- mean(x)
calc_bfi <- mean_BF/meanflow
return(calc_bfi)
}
###############################################
############### FUNCTIONS USED IN CreateFlowStats ##############################
###############################################
## find_eventDuration
# Required for CreateFlowStats
# Calculates the duration in days above or below a flow threshold, plus magnitude and severity of flow deficits below a given threshold
find_eventDuration <- function(x, threshold ,type, pref) {
# find events above/below a given threshold
# find high or low flow events
flowEvents <- find_events(x, threshold=threshold, type=type)
flowEvents <- stats::na.omit(flowEvents) # remove NAs
# find number of records
n_records <- length(flowEvents$event)
# find number of events
n_events <- length(unique(flowEvents$event))
# if n_events > 0, find start, end & mid dates, otherwise skip & assign NAs
if(n_events > 0){
flowEvents$yday <- lubridate::yday(flowEvents$date)
# first low/high flow day
n_start <- flowEvents %>% dplyr::arrange(date) %>% dplyr::slice(1)
n_start <- n_start$yday
# last low/high flow day
n_end <- flowEvents %>% dplyr::arrange(date)
n_end <- utils::tail(n_end, n=1)
n_end <- n_end$yday
# average low flow day
conv <- 2*pi/365 ## doy -> radians
n_mid1 <- circ.mean(conv*(flowEvents$yday-1))/conv
n_mid <- round((n_mid1 + 365) %% 365,0)
} else {
n_start <- NA; n_end <- NA; n_mid <- NA
}
find_eventDurations <- data.frame(n_records, n_events, n_start, n_end, n_mid)
# calculate mean and cumulative deficit for low flow events
if(type == "low" && n_events > 0){
flowEvents$deficit <- (threshold - flowEvents$flow)
mean_deficit <- mean(flowEvents$deficit, na.rm = TRUE)
cumulative_deficit <- sum(flowEvents$deficit, na.rm = TRUE)
find_eventDurations <- data.frame(n_records, n_events, n_start, n_end, n_mid, mean_deficit, cumulative_deficit)
}
return(find_eventDurations)
}
#################################################
## find_events
# Required for find_eventDuration
# Finds events above/below qhigh/qlow
# find_events(x,threshold)
find_events <- function(flow_data, threshold, type="high") {
x <- flow_data
x <- x %>% dplyr::filter(!is.na(flow))
if(type=="high")
{
#Calculate flows in excess of threshold
#And define high flow as positive number
x$event <- ifelse(x$flow > threshold,T,F)
} else {
x$event <- ifelse(x$flow < threshold,T,F)
###Remove first "event" since the time at strt of year before flood is not between 2 floods
}
#Calculate run lengths of T of F values to classify events
runLengths <- rle(x$event)
runLengths <- data.frame(lengths = runLengths$lengths,
values = runLengths$values,
eventNum = NA)
#Make sequence of numbers to number events
events <- 1:sum(runLengths$values==T)
#Number events
runLengths$eventNum[runLengths$values==T] <- events
eventVector <- rep(runLengths$eventNum,runLengths$lengths)
#Bind events to X
x$event <- eventVector
flowEvents <- x
return(flowEvents)
}
#################################################
## zero_eventsDuration
# same as find_eventDuration, but for zero flows
# Calculates the duration in days that flow are 0
# used in CreateFlowStats
zero_eventsDuration <- function(flow_data) {
# find events above/below a given threshold
zeroEvents <- zero_events(flow_data)
zeroEvents <- stats::na.omit(zeroEvents) # remove NAs
# find number of records
n_records <- length(zeroEvents$event)
# find number of events
n_events <- length(unique(zeroEvents$event))
zeroEvents$yday <- lubridate::yday(zeroEvents$date)
# first low flow day
dry_start <- zeroEvents %>% dplyr::arrange(date) %>% dplyr::slice(., 1)
dry_start <- dry_start$yday
# last low flow day
dry_end <- zeroEvents %>% dplyr::arrange(date)
dry_end <- utils::tail(dry_end, n=1)
dry_end <- dry_end$yday
# average low flow day
dry_mid <- mean(zeroEvents$yday)
if(n_events == 0){
dry_start <- NA
dry_end <- NA
dry_mid <- NA
}
find_zeroDurations <- data.frame(n_records, n_events, dry_start, dry_end, dry_mid)
return(find_zeroDurations)
}
#####################################################################
## zero_events
# effectively the same as find_events, but for 0 flows
# finds zero-flow events
# required for zero_eventsDuration
zero_events <- function(flow_data) {
x <- flow_data
x <- x %>% dplyr::filter(!is.na(flow))
x$Obs_0 <- ifelse(x$flow == 0, T, F)
#Calculate run lengths of T of F values to classify events
runLengths <- rle(x$Obs_0)
runLengths <- data.frame(lengths = runLengths$lengths,
values = runLengths$values,
eventNum = NA)
#Make sequence of numbers to number events
events <- 1:sum(runLengths$values==T)
#Number events
runLengths$eventNum[runLengths$values==T] <- events
eventVector <- rep(runLengths$eventNum,runLengths$lengths)
#Bind events to X
x$Obs_0 <- eventVector
zeroEvents <- x
return(zeroEvents)
}
#####################################################################
## find_doy
# used in CreateFlowStats
# finds the minimum or maximum flow date as DOY
# finds day of year (1-366, as midpoint) of 7-day or 30-day minimum flow period
find_doy <- function(flow_data, type, nday) {
# pull-in data
x <- flow_data
x2 <- x %>% dplyr::filter(!is.na(flow))
## find min or max flow
if(type == "low"){
flow_day <- x2 %>% dplyr::filter(flow == min(x2$flow))}
if(type == "high"){
flow_day <- x2 %>% dplyr::filter(flow == max(x2$flow))}
# find first low or high flow day
flow_day2 <- flow_day %>% dplyr::arrange(date) %>% head(., n = 1)
# convert to DOY
flow_day2$yday <- lubridate::yday(flow_day2$date)
# first low or high flow day
m_day <- as.numeric(unique(flow_day2$yday))
## find 7day or 30day rolling mean
if(nday == 7){
if(length(x2) > 7) {
NAs <- c(NA, NA, NA, NA, NA, NA)
minNday <- data.frame(roll_mean = c(NAs, zoo::rollmean(x2$flow, 7)))
x2$roll_mean <- minNday$roll_mean
}
if(length(x2) < 7) {
min_day <- NA
}
}
if(nday == 30){
if(length(x2) > 30) {
NAs <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
minNday <- data.frame(roll_mean = c(NAs, zoo::rollmean(x2$flow, 30)))
x2$roll_mean <- minNday$roll_mean
}
if(length(x2) < 30) {
min_day <- NA
}
}
if(length(x2) > 7) {
# filter to find date of lowest mean flow
low_flow <- x2 %>% dplyr::filter(roll_mean == min(roll_mean, na.rm = TRUE)) %>%
dplyr::arrange(date) %>% head(., n = 1)
# convert to DOY
if(nday == 7){
low_flow$yday <- lubridate::yday(low_flow$date - lubridate::days(3))
}
if(nday == 30){
low_flow$yday <- lubridate::yday(low_flow$date) - 14.5
if(isTRUE(-1 %in% sign(low_flow$yday)) == TRUE) {
low_flow$yday <- 365 - abs(low_flow$yday)
}
}
# first mean low flow day
min_day <- as.numeric(low_flow$yday)
}
# combine low/high flow DOY and 7/30day mean min flow DOY
find_doy <- data.frame(m_day, min_day)
return(find_doy)
}
##############################################################
## check whether a column if formatted as date
# IsDate <- function(mydate, date.format = "%d/%m/%y") {
# tryCatch(!is.na(as.Date(mydate, date.format)),
# error = function(err) {FALSE})
# }
##############################################################
## find circular mean
circ.mean <- function (x) {
sinr <- sum(sin(x))
cosr <- sum(cos(x))
circmean <- atan2(sinr, cosr)
circmean
}
##############################################################
## test if integer
testInteger <- function(x){
test <- all.equal(x, as.integer(x), check.attributes = FALSE)
if(test == TRUE){ return(TRUE) }
else { return(FALSE) }
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.