Nothing
#' Modeling of indicator responses to single pressures with GAMs
#'
#' \code{model_gam} applies Generalized Additive Models (GAMs) to each IND~pressure
#' combination created in \code{\link{ind_init}} and returns a tibble with
#' IND~pressure-specific GAM outputs.
#'
#' @param init_tbl The output tibble of the \code{\link{ind_init}} function.
#' @param k Choice of knots (for the smoothing function \code{\link{s}}); the default is 5.
#' @param family A description of the error distribution and link to be used in the GAM.
#' This needs to be defined as a family function (see also \code{\link{family}}). All
#' standard family functions can be used as well some of the distribution families in
#' the mgcv package (see \code{\link[mgcv]{family.mgcv}}; e.g.\code{\link[mgcv]{negbin}}
#' or \code{\link[mgcv]{nb}}).
#' @param excl_outlier A list of values identified as outliers in specific
#' IND~pressure GAMs, which should be excluded in this modeling step
#' (the output tibble of this function includes the variable
#' `pres_outlier`, which is a column-list containing
#' all indices of values with cook`s distance > 1 (see below). The function
#' can be re-run again, then excluding all these outliers provided in
#' \code{$pres_outlier} from the the first run (see example)).
#'
#' @details
#' To evaluate the IND`s sensitivity and robustness time series of the IND are
#' modeled as a smoothing function of one single pressure variable (using a subset
#' of the data as training dataset, e.g. excluding the years of the annual time series).
#' The GAMs are build using the default settings in the \code{gam} function and
#' the smooth term function \code{\link[mgcv]{s}}). However, the user can adjust
#' the distribution and link by modifying the family argument as well as the
#' maximum level of non-linearity by setting the number of knots:
#'
#' \code{gam(ind ~ s(press, k = k), family = family, data = training_data)}
#'
#' In the presence of significant temporal auto-correlation, GAMs should be extended to
#' Generalized Additive Mixed Models (GAMMs) by including auto-regressive error structures
#' to correct for the auto-correlation (Pinheiro and Bates, 2000). This is implemented in
#' the function \code{\link{model_gamm}}.
#'
#' The returned tibble contains various model outputs needed for scoring the sensitivity
#' and robustness subcriteria:
#' \itemize{
#' \item \code{p_val} to identify whether an IND responds to a specific pressure
#' \item \code{r_sq} for the strength of the IND response
#' \item \code{edf} for the non-linearity of the IND response
#' \item \code{nrmse} for the robustness of the established IND~pressure relationship
#' }
#'
#' The robustness of the modeled pressure relationship based on the training data
#' is evaluated by measuring how well the model prediction matches the test dataset,
#' e.g. the last years. This is quantified by computing the absolute value of the
#' normalized root mean square error (NRMSE) on the test dataset. The normalization
#' to the mean of the observed test data allows for comparisons and a general scoring
#' of the model robustness across INDs with different scales or units.
#'
#'
#' @return
#' The function returns a \code{\link[tibble]{tibble}}, which is a trimmed down version of
#' the data.frame(), including the following elements:
#' \describe{
#' \item{\code{id}}{Numerical IDs for the IND~press combinations.}
#' \item{\code{ind}}{Indicator names.}
#' \item{\code{press}}{Pressure names.}
#' \item{\code{model_type}}{Specification of the model type; at this stage containing only
#' "gam" (Generalized Additive Model).}
#' \item{\code{corrstruc}}{Specification of the correlation structure; at this stage
#' containing only "none".}
#' \item{\code{aic}}{AIC of the fitted models}
#' \item{\code{edf}}{Estimated degrees of freedom for the model terms.}
#' \item{\code{p_val}}{The p values for the smoothing term (the pressure).}
#' \item{\code{signif_code}}{The significance codes for the p-values.}
#' \item{\code{r_sq}}{The adjusted r-squared for the models. Defined as the proportion
#' of variance explained, where original variance and residual variance are
#' both estimated using unbiased estimators. This quantity can be negative
#' if your model is worse than a one parameter constant model, and can be
#' higher for the smaller of two nested models.}
#' \item{\code{expl_dev}}{The proportion of the null deviance explained by the models.}
#' \item{\code{nrmse}}{Absolute values of the root mean square error normalized by the
#' standard deviation (NRMSE).}
#' \item{\code{ks_test}}{The p-values from a Kolmogorov-Smirnov Test applied on the model
#' residuals to test for normal distribution. P-values > 0.05 indicate
#' normally distributed residuals.}
#' \item{\code{tac}}{logical; indicates whether temporal autocorrelation (TAC) was detected
#' in the residuals. TRUE if model residuals show TAC. NAs in the time series
#' due to real missing values, test data extraction or exclusion of outliers
#' are explicitly considered. The test is based on the following condition:
#' if any of the acf \strong{and} pacf values of lag 1 - 5 are greater than 0.4
#' or lower than -0.4, a TRUE is returned.}
#' \item{\code{pres_outlier}}{A list-column with all indices of values identified as outliers
#' in each model (i.e.cook`s distance > 1). The indices present the position in
#' the training data, including NAs.}
#' \item{\code{excl_outlier}}{A list-column listing all outliers per model that have been
#' excluded in the GAM fitting}
#' \item{\code{model}}{A list-column of IND~press-specific gam objects that contain additionally
#' the logical vector indicating missing values (\code{$train_na}).}
#' }
#'
#' @references
#' Pinheiro, J.C., Bates, D.M. (2000) Mixed-Effects Models in S and S-Plus.
#' Springer, New York, 548pp.
#'
#' @seealso \code{\link[tibble]{tibble}} and the \code{vignette("tibble")} for more
#' informations on tibbles,
#' \code{\link[mgcv]{gam}} for more information on GAMs, and
#' \code{\link{plot_diagnostics}} for assessing the model diagnostics
#' @family IND~pressure modeling functions
#'
#' @export
#'
#' @examples
#' # Using the Baltic Sea demo data in this package
#' dat_init <- ind_init(
#' ind_tbl = ind_ex[, c("Sprat", "Cod")],
#' press_tbl = press_ex[, c("Tsum", "Swin", "Fcod", "Fher")],
#' time = ind_ex[ ,1])
#' gam_tbl <- model_gam(dat_init)
#' # Any outlier?
#' gam_tbl$pres_outlier
#' # Exclude outliers by passing this list as input:
#' gam_tbl_out <- model_gam(dat_init, excl_outlier = gam_tbl$pres_outlier)
#'
#' \donttest{
#' # Using another error distribution
#' ind_sub <- round(exp(ind_ex[ ,c(2,8,9)]),0) # to unlog data and convert to integers
#' ind_tbl2 <- ind_init(ind_sub, press_ex, time = ind_ex$Year)
#' model_gam(ind_tbl2, family = poisson(link="log"))
#' }
model_gam <- function(init_tbl, k = 5, family = stats::gaussian(),
excl_outlier = NULL) {
# Data input validation -----------------------
if (missing(init_tbl)) {
stop("Argument init_tbl is missing.")
}
# Check input tibble
init_tbl <- check_input_tbl(init_tbl, tbl_name = "init_tbl",
parent_func = "ind_init()", var_to_check = c("id",
"ind", "press", "ind_train", "press_train",
"time_train", "ind_test", "press_test",
"time_test", "train_na"), dt_to_check = c("integer",
"character", "character", rep("list", 7)))
# Check family class
if (!"family" %in% class(family)) {
stop("The specified family is not a family object. You need to provide the family function, e.g. family = poisson()")
}
# ----------------
# Exclude outliers given in excl_outliers
if (!is.null(excl_outlier)) {
ind_train_sub <- purrr::map2(init_tbl$ind_train,
excl_outlier, ~replace(.x, .y, NA)) # (NULLs will not be replaced)
press_train_sub <- purrr::map2(init_tbl$press_train,
excl_outlier, ~replace(.x, .y, NA))
# consider excluded outliers also as NAs (=TRUE) in
# the train_na vector (if randomly selected test
# data, the correct years need to be chosen):
train_na <- init_tbl$train_na
for (i in seq_along(excl_outlier)) {
if (!is.null(excl_outlier[[i]])) {
train_na[[i]][match(init_tbl$time_train[[i]][excl_outlier[[i]]],
as.numeric(names(init_tbl$train_na[[i]])))] <- TRUE
}
}
} else {
ind_train_sub <- init_tbl$ind_train
press_train_sub <- init_tbl$press_train
train_na <- init_tbl$train_na
}
gam_tab <- init_tbl[, 1:3]
# Add gamm relevant columns for later merges
gam_tab$model_type <- "gam"
gam_tab$corrstruc <- "none"
# Add models
gam_tab$model <- vector(length = nrow(init_tbl),
mode = "list")
# Compute the IND~Press GAMs using this helper
# function:
apply_gam <- function(ind_name, press_name, ind_ts,
press_ts, train_na_ts, k, family) {
dat <- data.frame(ind = ind_ts, press = press_ts)
names(dat) <- c(ind_name, press_name)
model <- mgcv::gam(stats::as.formula(paste0(names(dat)[1],
" ~ 1 + s(", names(dat)[2], ", k = ", k,
")")), na.action = "na.omit", family = family,
data = dat)
# Save train_na in model
model$train_na <- train_na_ts
return(model)
}
apply_gam_safe <- purrr::safely(apply_gam, otherwise = NA)
temp_mod <- purrr::pmap(.l = list(ind_name = init_tbl$ind,
press_name = init_tbl$press, ind_ts = ind_train_sub,
press_ts = press_train_sub, train_na_ts = train_na),
.f = apply_gam_safe, k = k, family = family) %>%
purrr::transpose()
gam_tab$model <- temp_mod$result
if (all(is.na(temp_mod$result))) {
stop("No IND~pressure GAM could be fitted! Check if you chose the correct error distribution (default is gaussian()).")
} else {
# Save summary for each gam (cannot handle NAs,
# hence use of possibly())
gam_smy <- suppressWarnings(gam_tab$model %>%
purrr::map(.f = purrr::possibly(mgcv::summary.gam,
NA_real_)))
# Get some output from the summary
gam_tab$edf <- get_sum_output(sum_list = gam_smy,
varname = "edf")
gam_tab$r_sq <- get_sum_output(sum_list = gam_smy,
varname = "r.sq")
gam_tab$expl_dev <- get_sum_output(sum_list = gam_smy,
varname = "dev.expl")
gam_tab$p_val <- get_sum_output(sum_list = gam_smy,
varname = "s.table", cell = 4)
# Apply the significant code using external helper
# function
gam_tab$signif_code <- get_signif_code(gam_tab$p_val)
# Save AIC value for each gam (cannot handle NAs,
# hence use of possibly())
gam_tab$aic <- gam_tab$model %>% purrr::map_dbl(.f = purrr::possibly(stats::AIC,
NA_real_))
# Calculate nrmse using external helper function
gam_tab$nrmse <- calc_nrmse(
press = init_tbl$press_test, ind = init_tbl$ind_test,
model = gam_tab$model)
# Get residuals (cannot handle NAs, hence use of
# possibly())
res <- gam_tab$model %>% purrr::map(.f = purrr::possibly(mgcv::residuals.gam,
NA_real_), type = "deviance")
# Test for normality in residual distribution
# (requires self-made safely function if res = NA)
norm_test <- function(x, family) {
p <- stats::ks.test(x = x, "pnorm", mean(x,
na.rm = TRUE), stats::sd(x, na.rm = TRUE))$p.value
return(p)
}
norm_test_safe <- purrr::safely(norm_test,
otherwise = NA_real_)
suppressWarnings(gam_tab$ks_test <- res %>%
purrr::map(.f = norm_test_safe) %>% purrr::transpose() %>%
.$result %>% unlist() %>% round(., 4))
# Test for TAC - needs NA`s in residuals!
res_new <- vector(mode = "list", length = nrow(init_tbl))
for (i in seq_along(res_new)) {
res_new[[i]] <- rep(NA, length(train_na[[i]]))
res_new[[i]][!train_na[[i]]] <- res[[i]]
}
gam_tab$tac <- test_tac(res_new)$tac
# Check for outlier (cook`s distance > 1) in
# residuals (cannot handle NAs, hence use of
# possibly())
cooks_dist <- purrr::map(gam_tab$model, .f = purrr::possibly(stats::cooks.distance,
NA))
warn <- purrr::map_lgl(cooks_dist, ~any(. >
1, na.rm = TRUE))
outlier <- purrr::map2(warn, cooks_dist, ~if (.x ==
TRUE)
which(.y > 1))
# Get correct position of outlier in ind_train
# (important if NAs present) and save in tibble
gam_tab$pres_outlier <- purrr::map(1:length(ind_train_sub),
~if (warn[[.]] == TRUE)
which(!is.na(ind_train_sub[[.]]))[outlier[[.]]])
gam_tab$excl_outlier <- excl_outlier
# end of if-statement (temp_mod$result not NA)
}
# Sort variables
gam_tab <- sort_output_tbl(gam_tab)
# Sort rows
gam_tab <- dplyr::arrange(gam_tab, !!rlang::sym("id"))
# Warning if some models were not fitted
if (any(!purrr::map_lgl(temp_mod$error, .f = is.null))) {
sel <- !purrr::map_lgl(temp_mod$error, .f = is.null)
miss_mod <- gam_tab[sel, 1:3]
miss_mod$error_message <- purrr::map(temp_mod$error,
.f = as.character) %>% purrr::flatten_chr()
message("NOTE: For the following IND~pressure GAMs fitting procedure failed:")
print(miss_mod, n = Inf, tibble.width = Inf)
}
### END OF FUNCTION
return(gam_tab)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.