join_he: Linking biology samples with time-varying flow statistics for...

View source: R/join_he.R

join_heR Documentation

Linking biology samples with time-varying flow statistics for paired biology and flow sites.

Description

This function joins biology sample data with time-varying flow statistics for one or more antecedent (lagged) time periods (as calculated by the calc_flowstats function) to create a combined dataset for hydro-ecological modelling.

Usage

join_he(biol_data, flow_stats, mapping = NULL, method = "A" , lags = 0, join_type = "add_flows")

Arguments

biol_data

Data frame or tibble containing the processed biology data. Must contain the following columns: biol_site_id and date (in date format).

flow_stats

Data frame or tibble containing the calculated time-varying flow statistics, by site and time period and win_no (as produced by the calc_flowstats function ). Must contain the following columns: flow_site_id, start_date and end_date. The function joins all the variables in flow_stats, so it is advisable to manually drop any flow statistics which are not of interest before applying the function.

mapping

Data frame or tibble containing paired biology sites IDs and flow site IDs. Must contain columns named biol_site_id and flow_site_id. These columns must not contain any NAs. Default = NULL, which assumes that paired biology and flow sites have identical ids, so mapping is not required.

method

Choice of method for linking biology samples to flow statistics for antecedent time periods. Using method = "A" (default), lag 0 is defined for each biology sample as the most recently finished flow time period; using method = "B", lag 0 is defined as the most recently started flow time period. See below for details.

lags

Vector of lagged flow time periods of interest. Values must be zero or positive, with larger values representing longer time lags (i.e. an increasing time gap between the flow time period and the biology sample date). Default = 0. See below for details.

join_type

To add flow statistics to each biology sample, choose "add_flows" (default); this produces a dataset of biology metrics (response variables) and flow statistics (predictor variables) for hydro-ecological modelling. To add biology sample data to flow statistics for each time period, choose "add_biol"; this produces a time series of flow statistics with associated biological metrics which can be used, for example, to assess the coverage of historical flow conditions using the plot_rngflows function.

Details

biol_data and flow_stats may contain more sites than listed in mapping, but any sites not listed in mapping will be filtered out. If mapping = NULL, then biology site and flow sites with matching ids will be paired automatically.

The calc_flowstats function uses a moving window approach to calculate a time-varying flow statistics for a sequence of time periods which can be either: (i) contiguous (i.e. each time period is followed immediately by the next one), (ii) non-contiguous (i.e. there is a gap between one time period at the next), or (iii) over-lapping (i.e. the next time period stats before the previous one has finished).

To describe the antecedent flow conditions prior to each biology sample, the time periods are labelled relative to the date of the biology sample, with lag 0 representing either the most recently finished (method = "A") or most recently started (method = "B") flow time period. The time period immediately prior to the Lag 0 time period is the Lag 1 period, and the period immediately prior to that is the Lag 2 period, and so on.

As an example, suppose we have a biology sample dated 15 September 2020 and that flow statistics are available for a sequence of contiguous 1 month periods (each one a calendar month). Using method = "A", the Lag 0 period for that biology sample would be August 2020 (the most recently finished time period), the Lag 1 period would be July 2020, the Lag 2 period would be June 2020, and so on. Similarly, using method = "B", the Lag 0 period for that biology sample would be September 2020 (the most recently started time period), the Lag 1 period would be August 2020, the Lag 2 period would be July 2020, and so on.

As a second example, suppose we again have a biology sample dated 15 September 2020 and that flow statistics are available for a sequence of overlapping 6 month periods (i.e. February to July 2020, March to August 2020, April to September 2020, and so on). Using method = "A", the Lag 0 period for that biology sample would be March to August 2020 (the most recently finished time period), the Lag 1 period would be February to July 2020, the Lag 2 period would be January to June 2020, and so on. Similarly, using method ="B", the Lag 0 period for that biology sample would be September 2000 to February 2021 (the most recently started time period), the Lag 1 period would be 1 August 2000 to January 2021, the Lag 2 period would be July to December 2020, and so on.

Note that if using join_type = "add_biol", a flow period becomes replicated if it has 2+ biology samples within it. To avoid this happening, summarise (e.g. average) the replicate biology samples within each time window before applying join_he. See below for an example.

Value

join_he returns a tibble containing the linked biology data and flow statistics.

Examples



# create flow stats from synthetic flow data
set.seed(123)
flow_data <- data.frame(flow_site_id = rep("A0001", 365),
                        date = seq(as.Date("2021-01-01"), as.Date("2021-12-31"), by = "1 day"),
                        flow = rnorm(365, 10, 2))
flow_stats <- calc_flowstats(data = flow_data,
                             site_col = "flow_site_id",
                             date_col = "date",
                             flow_col = "flow",
                             win_start =  "2021-01-01",
                             win_width = "1 month",
                             win_step =  "1 month")[[1]] %>%
  dplyr::select(flow_site_id, win_no, start_date, end_date, Q95z)

# create synthetic biology data
biol_data <- data.frame(biol_site_id = rep("A0001", 2),
                        date = as.Date(c("2021-04-15", "2021-09-15")),
                        metric = c(0.8, 0.7))

# view data
flow_stats; biol_data

# add flow statistics to each biology sample using method A
# mapping = NULL because biology and flow sites have identical ids
join_he(biol_data = biol_data,
        flow_stats = flow_stats,
        mapping = NULL,
        method = "A",
        lags = c(0,1),
        join_type = "add_flows")

# add flow statistics to each biology sample using method B
# mapping = NULL because biology and flow sites have identical ids
join_he(biol_data = biol_data,
        flow_stats = flow_stats,
        mapping = NULL,
        method = "B",
        lags = c(0,1),
        join_type = "add_flows")

# add biology sample data to flow statistics for each time period using method A
join_he(biol_data = biol_data,
        flow_stats = flow_stats,
        mapping = NULL,
        method = "A",
        lags = c(0,1),
        join_type = "add_biol")

# add biology sample data to flow statistics for each time period using method B
join_he(biol_data = biol_data,
        flow_stats = flow_stats,
        mapping = NULL,
        method = "B",
        lags = c(0,1),
        join_type = "add_biol")

# using join_type = "add_biol", a flow period becomes replicated if it has 2+ biology samples
biol_data2 <- data.frame(biol_site_id = rep("A0001", 3),
                         date = as.Date(c("2021-04-15", "2021-09-15", "2021-09-17")),
                         metric = c(0.8, 0.7, 0.6))
join_he(biol_data = biol_data2,
        flow_stats = flow_stats,
        mapping = NULL,
        method = "A",
        lags = c(0,1),
        join_type = "add_biol")

# average replicate biology samples within each time window before using join_type = "add_biol"
biol_data3 <- biol_data2 %>%
  mutate(month = lubridate::month(date)) %>%
  dplyr::group_by(biol_site_id, month) %>%
  dplyr::summarise_all(mean)
join_he(biol_data = biol_data3,
        flow_stats = flow_stats,
        mapping = NULL,
        method = "A",
        lags = c(0,1),
        join_type = "add_biol")


APEM-LTD/hetoolkit documentation built on Feb. 8, 2025, 9:16 a.m.