calc_flowstats: Calculating summary statistics describing historical flow...

View source: R/calc_flowstats.R

calc_flowstatsR Documentation

Calculating summary statistics describing historical flow conditions.

Description

This function takes a time series of measured or modelled flows and uses a user-defined moving window to calculate a suite of time-varying flow statistics for one or more sites (stations). A smaller set of long-term statistics is also calculated. It is primarily designed to work with daily flows but can also be applied to time series data on a longer (e.g. 10-daily or monthly) time step. The data should be regularly spaced, and a common time step should be used for all sites.

Usage

calc_flowstats(data, site_col = "flow_site_id", date_col = "date", flow_col = "flow", imputed_col = "imputed", win_start = "1995-04-01", win_width = "6 months", win_step = "6 months", date_range = NULL, q_low = 95, q_high = 70, scaling = FALSE, ref_col = NULL)

Arguments

data

Tibble or data frame containing the flow data to be processed. Must be in long format and have, as a minimum, separate columns containing site id, date and flow (e.g. as output by the import_flow() function). If flow estimates are available for different abstraction or climate scenarios, then these must be in separate columns (see ref_col argument).

site_col

Name of column in data containing unique flow site id. Default = "flow_site_id".

date_col

Name of column in data containing date of flow record. Default = "date". Dates must be in “yyyy-mm-dd” format.

flow_col

Name of column in data containing flow data for processing (character). Default = "flow".

imputed_col

Name of optional column in data specifying whether each flow value is measured (0) or imputed (1). Default = NULL.

win_start

Start date of first time window (in yyyy-mm-dd format). Default = "1995-04-01".

win_width

Width of the time window, in days, weeks, months or years (see ?seq.Date for options). Default = "6 months".

win_step

The increment by which the time window moves, in days, weeks, months or years (see ?seq.Date for options). Default = "6 months".

date_range

Optional vector of two dates (in yyyy-mm-dd format) defining the period of flow data to be analysed. Default = NULL . Flow records outside this range are excluded. For unbiased calculation of long-term flow statistics, it is advisable that this range spans a whole number of years (i.e c(01-01-2000, 31-12-2020))

q_low

Qx flow threshold (between 1 and 99, as an integer) defining low flow events. Default = 95 (representing the long-term Q95 flow at each site).

q_high

Qx flow threshold (between 1 and 99, as an integer) defining high flow events. Default = 70 (representing the long-term Q70 flow at each site).

scaling

Should the time series flow data be scaled by the long-term mean flow at each site? Default = FALSE.

ref_col

Name of column in dataset containing reference flow scenario against which selected flow statistics are z-score standardised. Default = NULL.

Details

The function uses the win_start, win_width and win_step arguments to define a moving window, which divides the flow time series into a sequence of time periods. These time periods may be contiguous, non-contiguous or overlapping (see examples below). The sequence of time periods continues up to and including the present date, even when this extends beyond the period covered by the input flow dataset, as this facilitates the subsequent joining of flow statistics and ecology data by the join_he function. The sequence of time periods does not extend beyond the present date, however; for example, if calculating flow statistics for each calendar year, the time periods would stop at the end of the last complete year.

For each time period, the function calculates a suite of flow statistics, listed below. With the exception of 7-day and 30-day minimum flow statistics that require daily data, all flow statistics are calculated regardless of the time step of the flow data. Caution should be exercised, however, when analysing flow data on a coarser (e.g. monthly) time step to ensure that the statistics are meaningful and interpretable (especially those relating to the number and timing of low and high flow events, which require reasonably high frequency flow data in order to discriminate sequences of lower and higher flows).

The function requires a minimum number of records to calculate some statistics (detailed below), otherwise an NA result is returned. Meeting the minimum requirement does not, however, guarantee that a statistic has been estimated to an appropriate level of precision, and users may wish to manually filter the results to eliminate potentially unreliable estimates based on sparse data.

To ensure that estimated statistics are comparable across time periods, the flow time series data should be as complete as possible (gaps can be infilled using the impute_flows function). Missing values (NA) are ignored when calculating all statistics, including those that count the number of events when flows exceed or fall below a certain flow threshold.

To make some statistics more comparable across sites, the scaling argument optionally allows the flow time series data to be standardised by dividing by the site’s long-term mean flow. Scaling is performed after applying the date_range filter, so that the long-term mean flows can be calculated over a specified number of whole years. Because this eliminates absolute differences in mean flow from site to site, scaling is most useful when the focus is on statistics measured in flow units (e.g. sd, q5, low_magnitude, low_severity, volume, min, min_7day, min_30day, max).

Additionally, certain statistics (denoted by a ⁠_z⁠ suffix) are standardised using the mean and standard deviation of the estimated statistics across time periods at a given site (e.g. ⁠q5_z = (q5 – q5_mean) / q5_sd⁠)). These standardised statistics are dimensionless, and so comparable across sites. Standardisation provides an alternative to scaling (described above) when one wishes flow statistics for larger and smaller watercourses to be comparable. Be aware that these standardised variables are calculated regardless of whether or not the raw flow data have been scaled (via the scale argument)).

The function also includes the facility to standardise the statistics for one flow scenario (specified via flow_col) using mean and standard deviation flow statistics from another scenario (specified via ref_col). For example, if flow_col = naturalised flows and ref_col = historical flows, then the resulting statistics can be input into a hydro-ecological model that has previously been calibrated using standardised historical flow statistics and used to make predictions of ecological status under naturalised flows.

Value

The function returns a list of two data frames. The first data frame contains a suite of time-varying flow statistics for every time period at every site. The columns are as follows:

  • flow_site_id: a unique site id

  • win_no: an autonumber counting the sequence of flow time periods

  • start_date: start date of the time period (in yyyy-mm-dd format)

  • end_date: end date of the time period (in yyyy-mm-dd format)

  • n_data: the number of records with valid flows (not NA)

  • n_missing: the number of missing flow records (flow = NA)

  • n_total: the total number of flow records (sum of n_data and n_missing)

  • prop_missing: the proportion of missing flow records (n_data / n_total)

  • n_imputed: the number of flow records that have been imputed (this is calculated only if the imputed_col argument is specified)

  • prop_imputed: the proportion of flow records that have been imputed (calculated only if the imputed_col argument is specified)

  • mean: mean flow (min. records required = 2)

  • sd: the standard deviation of flows (min. records required = 2)

  • Q5: the unstandardised Q5 flow (min. records required = 20)

  • Q10: the unstandardised Q10 flow (min. records required = 10)

  • Q20: the unstandardised Q20 flow (min. records required = 5)

  • Q25: the unstandardised Q25 flow (min. records required = 4)

  • Q30: the unstandardised Q30 flow (min. records required = 4)

  • Q50: the unstandardised Q50 flow (min. records required = 2)

  • Q70: the unstandardised Q70 flow (min. records required = 4)

  • Q75: the unstandardised Q75 flow (min. records required = 4)

  • Q80: the unstandardised Q80 flow (min. records required = 5)

  • Q90: the unstandardised Q90 flow (min. records required = 10)

  • Q95: the unstandardised Q95 flow (min. records required = 20)

  • Q99: the unstandardised Q99 flow (min. records required = 100)

  • Q5z: the Q5 flow, standardised using the mean and sd of the Q5 flows across all time periods for that site, i.e. q5_z = (q5 – q5mean) / Q5sd. If ref_col is not NULL, then the Q5 is estimated for the flow_col time series, but standardised using the mean and sd parameters for the ref_col time series (i.e. q50_z = (q50 - q50mean_ref) / q50sd_ref). (min. records required = 20)

  • Q10z: as for Q5z (min. records required = 10)

  • Q20z: as for Q5z (min. records required = 5)

  • Q25z: as for Q5z (min. records required = 4)

  • Q30z: as for Q5z (min. records required = 4)

  • Q50z: as for Q5z (min. records required = 2)

  • Q70z: as for Q5z (min. records required = 4)

  • Q75z: as for Q5z (min. records required = 4)

  • Q80z: as for Q5z (min. records required = 5)

  • Q90z: as for Q5z (min. records required = 10)

  • Q95z: as for Q5z (min. records required = 20)

  • Q99z: as for Q5z (min. records required = 100)

  • dry_n: number of records with zero flow (min. records required = 2)

  • dry_e: number of events when flow drops to zero (min. records required = 28)

  • dry_start: day of year (1-366) of first zero flow record (min. records required = 8)

  • dry_end: day of year (1-366) of last zero flow record (min. records required = 28)

  • dry_mid: mean day of year (1-366) of all zero flow records (min. records required = 28)

  • low_n: number of records when flow is below the q_low threshold (min. records required = 2)

  • low_e: number of events when flow drops below the q_low threshold (min. records required = 28)

  • low_start: day of year (1-366) of first record below the q_low threshold (min. records required = 28)

  • low_end: day of year (1-366) of first record below the q_low threshold (min. records required = 28)

  • low_mid: circular mean day of year (1-366) of all records below the q_low threshold (min. records required = 28)

  • low_magnitude: mean flow deficit below q_low (min. records required = 28)

  • low_severity: cumulative flow deficit below q_low (low_n x low_magnitude) (min. records required = 28)

  • high_n: number of records when flow is above the q_high threshold (min. records required = 2)

  • high_e: number of events when flow exceeds the q_high threshold (min. records required = 28)

  • high_start: day of year (1-366) of first record above the q_high threshold (min. records required = 28)

  • high_end: day of year (1-366) of last record above the q_high threshold (min. records required = 28)

  • high_mid: circular mean day of year (1-366) of all records above the q_high threshold (min. records required = 28)

  • e_above3xq50: number of events when flow exceeds 3 x the long-term median (Q50) flow (min. records required = 28)

  • e_above5xq50: number of events when flow exceeds 5 x the long-term median (Q50) flow (min. records required = 28)

  • e_above7xq50: number of events when flow exceeds 7 x the long-term median (Q50) flow (min. records required = 28)

  • volume: total volume discharged (sum of flows) (min. records required = 3)

  • volume_z: as for q5z (min. records required = 3)

  • min: minimum flow (min. records required = 3)

  • min_z: as for q5z (min. records required = 3)

  • min_doy: day of year (1-366) of minimum flow (min. records required = 3)

  • min_7day: minimum 7-day mean flow (min. records required = 90)

  • min_7day_z: as for q5z (min. records required = 90)

  • min_7day_doy: day of year (1-366) of midpoint of 7-day minimum flow period (min. records required = 90)

  • min_30day: minimum 30-day mean flow (min. records required = 180)

  • min_30day_z: as for q5z (min. records required = 180)

  • min_30day_doy: day of year of (1-366) of midpoint of 30-day minimum flow period (min. records required = 180)

  • max: maximum flow (min. records required = 3)

  • max_z: as for q5z (min. records required = 3)

  • max_doy: day of year (1-366) of maximum flow (min. records required = 3)

The second data table contains long-term flow statistics. The data are arranged in long format, with the following columns:

  • flow_site_id (a unique site id)

  • start_date: start date of the long-term time period (in yyyy-mm-dd format) for which the statistics are calculated

  • end_date: end date of the long-term time period (in yyyy-mm-dd format) for which the statistics are calculated

  • parameter (long-term minimum, maximum and mean flow; long-term flow duration curve percentiles (p1 to p99); long-term base flow index (bfi = 7-day minimum flow / mean flow); and long-term mean and standard deviation of the time-varying q5 to q99, minimum flow, maximum flow and 7-day minimum flow statistics)

  • value (calculated statistic)

Examples

## calculate flow statistics for a contiguous series of summer (April to September) and winter (October to March) time periods:
calc_flowstats(data = flow_data,
               win_start =  "1995-04-01",
               win_width = "6 months",
               win_step =  "6 months")

## calculate flow statistics for a non-contiguous series of summer (April to September) time periods:
calc_flowstats(data = flow_data,
               win_start =  "1995-04-01",
               win_width = "6 months",
               win_step =  "1 year")

## calculate flow statistics for a series of overlapping 24 month time periods:
calc_flowstats(data = flow_data,
               win_start =  "1995-04-01",
               win_width = "24 months",
               win_step =  "1 month")


APEM-LTD/hetoolkit documentation built on Feb. 8, 2025, 9:16 a.m.