apply_gam: Apply GAM to time series and assess emerging status

View source: R/apply_gam.R

apply_gamR Documentation

Apply GAM to time series and assess emerging status

Description

This function applies generalized additive models (GAM) to assess emerging status for a certain time window.

Usage

apply_gam(
  df,
  y_var,
  eval_years,
  year = "year",
  taxonKey = "taxonKey",
  type_indicator = "observations",
  baseline_var = NULL,
  p_max = 0.1,
  taxon_key = NULL,
  name = NULL,
  df_title = NULL,
  x_label = "year",
  y_label = "Observations",
  saveplot = FALSE,
  dir_name = NULL,
  verbose = FALSE
)

Arguments

df

df. A dataframe containing temporal data.

y_var

character. Name of column containing variable to model. It has to be passed as string, e.g. "occurrences".

eval_years

numeric. Temporal value(s) when emerging status has to be evaluated.

year

character. Name of column containing temporal values. It has to be passed as string, e.g. "time". Default: "year".

taxonKey

character. Name of column containing taxon IDs. It has to be passed as string, e.g. "taxon". Default: "taxonKey".

type_indicator

character. One of "observations", "occupancy". Used in title of the output plot. Default: "observations".

baseline_var

character. Name of the column containing values to use as additional covariate. Such covariate is introduced in the model to correct research effort bias. Default: NULL. If NULL internal variable method_em = "basic", otherwise method_em = "correct_baseline". Value of method_em will be part of title of output plot.

p_max

numeric. A value between 0 and 1. Default: 0.1.

taxon_key

numeric, character. Taxon key the timeseries belongs to. Used exclusively in graph title and filename (if saveplot = TRUE). Default: NULL.

name

character. Species name the timeseries belongs to. Used exclusively in graph title and filename (if saveplot = TRUE). Default: NULL.

df_title

character. Any string you would like to add to graph titles and filenames (if saveplot = TRUE). The title is always composed of: "GAM" + type_indicator + method_em + taxon_key + name + df_title separated by underscore ("_"). Default: NULL.

x_label

character. x-axis label of output plot. Default: "year".

y_label

character. y-axis label of output plot. Default: "number of observations".

saveplot

logical. If TRUE the plots are authomatically saved. Default: FALSE.

dir_name

character. Path of directory where saving plots. If path doesn't exists, directory will be created. Example: "./output/graphs/". If NULL, plots are saved in current directory. Default: NULL.

verbose

logical. If TRUE status of processing and possible issues are returned. Default: FALSE.

Details

The GAM modelling is performed using the mgcvb::gam(). To use this function, we pass:

  • a formula

  • a family object specifying the distribution

  • a smoothing parameter estimation method

    For more information about all other arguments, see ⁠[mgcv::gam()]⁠.

    If no covariate is used (baseline_var = NULL), the GAM formula is: n ~ s(year, k = maxk, m = 3, bs = "tp"). Otherwise the GAM formula has a second term, s(n_covariate) and so the GAM formula is n ~ s(year, k = maxk, m = 3, bs = "tp") + s(n_covariate).

    Description of the parameters present in the formula above:

  • k: dimension of the basis used to represent the smooth term, i.e. the number of knots used for calculating the smoother. We #' set k to maxk, which is the number of decades in the time series. If less than 5 decades are present in the data, maxk is #' set to 5.

  • bs indicates the basis to use for the smoothing: we uses the default penalized thin plate regression splines.

  • m specifies the order of the derivatives in the thin plate spline penalty. We use m = 3, the default value.

    We use ⁠[mgcv::nb()]⁠, a negative binomial family to perform the GAM.

    The smoothing parameter estimation method is set to REML (Restricted maximum likelihood approach). If the P-value of the GAM smoother(s) is/are above threshold value p_max, GAM is not performed and the next warning is returned: "GAM output cannot be used: p-values of all GAM smoothers are above {p_max}" where p_max is the P-value used as threshold as defined by argument p_max.

    If the mgcv::gam() returns an error or a warning, the following message is returned to the user: "GAM ({method_em}) cannot be performed or cannot converge.", where method_em is one of "basic" or "correct_baseline". See argument baseline_var.

    The first and second derivatives of the smoother is calculated using function gratia::derivatives() with the following hard coded arguments:

  • type: the type of finite difference used. Set to "central".

  • order: 1 for the first derivative, 2 for the second derivative

  • level: the confidence level. Set to 0.8

  • eps: the finite difference. Set to 1e-4.

    For more details, please check derivatives.

    The sign of the lower and upper confidence levels of the first and second derivatives are used to define a detailed emergency status (em) which is internally used to return the emergency status, em_status, which is a column of the returned data.frame em_summary.

    | ucl-1 | lcl-1 | ucl-2 | lcl-2 | em | em_status | | — | — | — | — | — | — | | + | + | + | + | 4 | 3 (emerging) | | + | + | + | - | 3 | 3 (emerging) | | + | + | - | - | 2 | 2 (potentially emerging) | | - | + | + |

    • | 1 | 2 (potentially emerging) | | + | - | + | - | 0 | 1 (unclear) | | + | - | - | - | -1 | 0 (not emerging) | | - | - | + | + | -2 | 0 (not emerging) | |

  • | - | + | - | -3 | 0 (not emerging) | | - | - | - | - | -4 | 0 (not emerging) |

Value

list with six slots:

  1. em_summary: df. A data.frame summarizing the emerging status outputs. em_summary contains as many rows as the length of input variable eval_year. So, if you evaluate GAM on three years, em_summary will contain three rows. It contains the following columns:

    • "taxonKey": column containing taxon ID. Column name equal to value of argument taxonKey.

    • "year": column containing temporal values. Column name equal to value of argument year. Column itself is equal to value of argument eval_years. So, if you evaluate GAM on years 2017, 2018 (eval_years = c(2017, 2018)), you will get these two values in this column.

    • em_status: numeric. Emerging statuses, an integer between 0 and 3.

    • growth: numeric. Lower limit of GAM confidence interval for the first derivative, if positive. It represents the lower guaranteed growth.

    • method: character. GAM method, One of: "correct_baseline" and "basic". See details above in description of argument use_baseline.

  2. model: gam object. The model as returned by gam() function. NULL if GAM cannot be applied.

  3. output: df. Complete data.frame containing more details than the summary em_summary. It contains the following columns:

    • all columns in df.

    • method: character. GAM method, One of: "correct_baseline" and "basic". See details above in description of argument use_baseline.

    • fit: numeric. Fit values.

    • ucl: numeric. The upper confidence level values.

    • lcl: numeric. The lower confidence level values.

    • em1: numeric. The emergency value for the 1st derivative. -1, 0 or +1.

    • em2: numeric. The emergency value for the 2nd derivative: -1, 0 or +1.

    • em: numeric. The emergency value: from -4 to +4, based on em1 and em2. See Details.

    • em_status: numeric. Emerging statuses, an integer between 0 and 3. See Details.

    • growth: numeric. Lower limit of GAM confidence interval for the first derivative, if positive. It represents the lower guaranteed growth.

  4. first_derivative: df. Data.frame with details of first derivatives. It contains the following columns:

    • smooth: smoooth identifier. Example: s(year).

    • derivative: numeric. Value of first derivative.

    • se: numeric. Standard error of derivative.

    • crit: numeric. Critical value required such that derivative + (se * crit) and derivative - (se * crit) form the upper and lower bounds of the confidence interval on the first derivative of the estimated smooth at the specific confidence level. In our case the confidence level is hard-coded: 0.8. Then crit <- qnorm(p = (1-0.8)/2, mean = 0, sd = 1, lower.tail = FALSE).

    • lower_ci: numeric. Lower bound of the confidence interval of the estimated smooth.

    • upper_ci: numeric. Upper bound of the confidence interval of the estimated smooth.

    • value of argument year: column with temporal values.

    • value of argument baseline_var: column with the fitted values for the baseline. If baseline_var is NULL, this column is not present.

  5. second_derivative: df. Data.frame with details of second derivatives. Same columns as first_derivatives.

  6. plot: a ggplot2 object. Plot of observations with GAM output and emerging status. If emerging status cannot be assessed only observations are plotted.

Examples

## Not run: 
library(dplyr)
df_gam <- tibble(
  taxonKey = rep(3003709, 24),
  canonicalName = rep("Rosa glauca", 24),
  year = seq(1995, 2018),
  n = c(
    1, 1, 0, 0, 0, 2, 0, 0, 1, 3, 1, 2, 0, 5, 0, 5, 4, 2, 1,
    1, 3, 3, 8, 10
  ),
  n_class = c(
    229, 555, 1116, 939, 919, 853, 442, 532, 623, 1178, 732, 371, 1053,
    1001, 1550, 1142, 1076, 1310, 922, 1773, 1637,
    1866, 2234, 2013
  )
)
# apply GAM to n without baseline as covariate
apply_gam(df_gam,
  y_var = "n",
  eval_years = 2018,
  taxon_key = 3003709,
  name = "Rosa glauca",
  verbose = TRUE
)
# apply GAM using baseline data in column n_class as covariate
apply_gam(df_gam,
  y_var = "n",
  eval_years = 2018,
  baseline_var = "n_class",
  taxon_key = 3003709,
  name = "Rosa glauca",
  verbose = TRUE
)
# apply GAM using n as occupancy values, evaluate on two years. No baseline
apply_gam(df_gam,
  y_var = "n",
  eval_years = c(2017, 2018),
  taxon_key = 3003709,
  type_indicator = "occupancy",
  name = "Rosa glauca",
  y_label = "occupancy",
  verbose = TRUE
)
# apply GAM using n as occupancy values and n_class as covariate (baseline)
apply_gam(df_gam,
  y_var = "n",
  eval_years = c(2017, 2018),
  baseline_var = "n_class",
  taxon_key = 3003709,
  type_indicator = "occupancy",
  name = "Rosa glauca",
  y_label = "occupancy",
  verbose = TRUE
)
# How to use other arguments
apply_gam(df_gam,
  y_var = "n",
  eval_years = c(2017, 2018),
  baseline_var = "n_class",
  p_max = 0.3,
  taxon_key = "3003709",
  type_indicator = "occupancy",
  name = "Rosa glauca",
  df_title = "Belgium",
  x_label = "time (years)",
  y_label = "area of occupancy (km2)",
  saveplot = TRUE,
  dir_name = "./data/",
  verbose = TRUE
)
# warning returned if GAM cannot be applied and plot with only observations
df_gam <- tibble(
taxonKey = rep(3003709, 24),
canonicalName = rep("Rosa glauca", 24),
year = seq(1995, 2018),
obs = c(
  1, 1, 0, 0, 0, 2, 0, 0, 1, 3, 1, 2, 0, 5, 0, 5, 4, 2, 1,
  1, 3, 3, 8, 10
),
cobs = rep(0, 24)
)

# if GAM cannot be applied a warning is returned and the plot mention it
\dontrun{
no_gam_applied <- apply_gam(df_gam,
                            y_var = "obs",
                            eval_years = 2018,
                            taxon_key = 3003709,
                            name = "Rosa glauca",
                            baseline_var = "cobs",
                            verbose = TRUE
)
no_gam_applied$plot
}

## End(Not run)

trias-project/trias documentation built on April 20, 2024, 8:27 p.m.