knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The influ2
package is based on the original influ
package which was
developed to generate influence statistics and associated plots. The influ
package has functions that use outputs from the glm
function. The influ2
package was developed to use outputs from brms
and relies heavily on the R
packages in the tidyverse
suite of packages. This vignette showcases the
influ2
package.
In this vignette I use a simulated catch per unit effort (CPUE) data set to
model the number of lobsters caught per pot. This data set has a response
variable labelled lobsters
and includes the dependent variables year
(from
2000 to 2017), month
, depth
(in meters), and soak
(the soak time of each
pot in hours). No step-wise variable selection or model selection is done in
this vignette.
The main function for exploring data in influ2
is plot_bubble
which presents
the number of records by two user defined groups within a data set. The bubble
size in plot_bubble
represents the number of records across the entire data
set, and bubbles can either be a single colour or coloured by a third variable
in the data set.
library(knitr) library(tidyverse) library(reshape2) library(brms) library(influ2) library(bayesplot) theme_set(theme_bw()) options(mc.cores = 2)
The simulated data set:
data(lobsters_per_pot) nrow(lobsters_per_pot) head(lobsters_per_pot)
plot_bubble(df = lobsters_per_pot, group = c("year", "month")) + labs(x = "Month", y = "Year")
plot_bubble(df = lobsters_per_pot, group = c("year", "month"), fill = "month") + labs(x = "Month", y = "Year") + theme(legend.position = "none")
Two different models are fitted in brms
. The first model assumes a Poisson
distributioin and includes the
explanatory variables year
, month
, and a 3rd order polynomial for soak
time. The second model assumes a negative binomial distribution and
includes the explanatory variables year
, month
as a
random effect, and a spline for depth
with 3 degrees of freedom.
fit1 <- brm(lobsters ~ year + month + poly(soak, 3), data = lobsters_per_pot, family = poisson, chains = 2, iter = 1500, refresh = 0, seed = 42, file = "fit1", file_refit = "never") fit2 <- brm(lobsters ~ year + (1 | month) + s(depth, k = 3), data = lobsters_per_pot, family = negbinomial(), control = list(adapt_delta = 0.99), chains = 2, iter = 3000, refresh = 0, seed = 1, file = "fit2", file_refit = "never")
criterion <- c("loo", "waic", "bayes_R2") fit1 <- add_criterion(x = fit1, criterion = criterion, file = "fit1") fit2 <- add_criterion(x = fit2, criterion = criterion, file = "fit2")
Fit a model using glm and generate influence statistics using the original influ
package
fit_glm <- glm(lobsters ~ year + month + poly(soak, 3), data = lobsters_per_pot, family = poisson)
myInfl <- Influence$new(model = fit_glm) myInfl$init() myInfl$calc() myInfl$cdiPlot(term = "month")
cdi_month <- plot_bayesian_cdi(fit = fit1, xfocus = "month", yfocus = "year", xlab = "Month", ylab = "Year") cdi_month
myInfl$stepPlot()
The influ2
package cannot generate steps plots automatically like influ
. Each step must
be modelled separately and then a list of models defined as below.
fits <- list(fit1, fit2) plot_step(fits = fits)
plot_index(fit = fit2)
myInfl$stanPlot()
plot_influ(fit = fit1)
myInfl$influPlot()
i1 <- myInfl$influences %>% pivot_longer(cols = -level, names_to = "variable") %>% rename(year = level) i_month <- get_influ(fit = fit1, group = c("year", "month")) %>% group_by(year) %>% summarise(value = mean(delta)) %>% mutate(variable = "month") # i_soak <- get_influ(fit = fit1, group = c("year", "soak")) %>% # group_by(year) %>% # summarise(value = mean(delta)) %>% # mutate(variable = "poly(soak, 3)") # i2 <- bind_rows(i_month, i_soak) # # ggplot(data = i2, aes(x = year, y = exp(value))) + # geom_point(data = i1) + # geom_line(aes(colour = variable, group = variable)) + # facet_wrap(variable ~ ., ncol = 1) + # labs(x = "Year", y = "Influence") + # theme(legend.position = "none")
Here I evaluate model fit using loo and waic. Other options include kfold, loo_subsample, bayes_R2, loo_R2, and marglik.
fit1$criteria$loo fit1$criteria$waic
loo_compare(fit1, fit2, criterion = "loo") %>% kable(digits = 1)
loo_compare(fit1, fit2, criterion = "waic") %>% kable(digits = 1)
Bayesian R2 bayes_R2
for the three model runs
# get_bayes_R2(fits)
# cdi_month
plot_bayesian_cdi(fit = fit2, xfocus = "month", yfocus = "year", xlab = "Month", ylab = "Year")
# plot_bayesian_cdi(fit = fit1, xfocus = "soak", yfocus = "year", # xlab = "Soak time (hours)", ylab = "Year")
# plot_bayesian_cdi(fit = fit2, xfocus = "depth", yfocus = "year", # xlab = "Depth (m)", ylab = "Year")
fits <- list(fit1, fit2) plot_compare(fits = fits)
plot_index(fit = fit1)
The influ2
package is based on the original influ
package which was developed to generate influence plots. The influ
package has functions that use outputs from the glm
function. The influ2
package was developed to use outputs from brms
and relies heavily on the R packages ggplot2
and dplyr
. This vignette showcases the influ2
package with a focus on model diagnostics including posterior predictive distributions, residuals, and quantile-quantile plots.
yrep <- posterior_predict(object = fit1, ndraws = 100)
# ppc_dens_overlay(y = lobsters_per_pot$lobsters, yrep = yrep) + # labs(x = "CPUE", y = "Density") ppc_bars(y = lobsters_per_pot$lobsters, yrep = yrep) + labs(x = "CPUE", y = "Density")
ppc_ecdf_overlay(y = lobsters_per_pot$lobsters, yrep = yrep, discrete = TRUE) + coord_cartesian(xlim = c(0, 10))
# loo <- loo(fit1, save_psis = TRUE) # psis <- loo$psis_object # lw <- weights(psis) # yrep <- posterior_predict(fit1) # ppc_loo_pit_qq(y = lobsters_per_pot$lobsters, yrep = yrep, lw = lw)
# plot_qq(fit = fit1)
# plot_predicted_residuals(fit = fit1, trend = "loess")
# plot_predicted_residuals(fit = fit1, trend = "lm") + # facet_wrap(year ~ .)
# A new style of residual plot fit <- fit1 # Extract predicted values pred <- fitted(fit) %>% data.frame() names(pred) <- paste0("pred.", names(pred)) # Extract residuals resid <- residuals(fit) %>% data.frame() names(resid) <- paste0("resid.", names(resid)) df <- bind_cols(resid, pred, lobsters_per_pot) ggplot(data = df, aes(x = year, y = .data$resid.Estimate)) + geom_pointrange(aes(ymin = .data$resid.Q2.5, ymax = .data$resid.Q97.5), alpha = 0.75) + facet_wrap(month ~ .) + geom_hline(yintercept = 0, linetype = "dashed") + # geom_errorbarh(aes(xmax = .data$pred.Q2.5, xmin = .data$pred.Q97.5, height = 0), alpha = 0.75) + # labs(x = "Predicted values", y = "Residuals") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Implied coefficients (points) are calculated as the normalised fishing year coefficient (grey line) plus the mean of the standardised residuals in each year for each category of a variable. These values approximate the coefficients obtained when an area x year interaction term is fitted, particularly for those area x year combinations which have a substantial proportion of the records. The error bars indicate one standard error of the standardised residuals. The information at the top of each panel identifies the plotted category, provides the correlation coefficient (rho) between the category year index and the overall model index, and the number of records supporting the category.
# plot_implied_residuals(fit = fit2, data = lobsters_per_pot, # year = "year", groups = "month") + # theme(axis.text.x = element_text(angle = 45, hjust = 1))
# plot_implied_residuals(fit = fit2, data = lobsters_per_pot, # year = "year", groups = "depth") + # theme(axis.text.x = element_text(angle = 45, hjust = 1))
Finally, in order to fit to the data in a stock assessment model then indices
can be produced using the get_index
function. The Estimate
and Est.Error
columns are what should be used in stock assessment.
unstd <- get_unstandarsied(fit = fit2) cpue0 <- get_index(fit = fit2) cpue0 %>% mutate(Unstandardised = unstd$Median) %>% select(Year, Unstandardised, Mean, SD, Qlower, Median, Qupper) %>% kable(digits = 3)
ggplot(data = cpue0, aes(x = Year)) + geom_pointrange(aes(y = Median, ymin = Qlower, ymax = Qupper)) + geom_line(aes(y = Mean, group = 1), colour = "red") + scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) + labs(x = NULL, y = "CPUE") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
cpue1 <- cpue0 %>% mutate(Type = "Estimate") cpue2 <- cpue1 %>% mutate(Type = "Simulated wrong") cpue3 <- cpue1 %>% mutate(Type = "Simulated") for (ii in 1:nrow(cpue1)) { sdd <- cpue1$SD[ii] # this is wrong r1 <- rlnorm(n = 5000, log(cpue1$Mean[ii]), sdd) cpue2$Median[ii] <- median(r1) cpue2$Qlower[ii] <- quantile(r1, probs = 0.025) cpue2$Qupper[ii] <- quantile(r1, probs = 0.975) sdd <- log(1 + cpue1$SD[ii] / cpue1$Mean[ii]) r1 <- rlnorm(n = 5000, log(cpue1$Median[ii]), sdd) cpue3$Median[ii] <- median(r1) cpue3$Qlower[ii] <- quantile(r1, probs = 0.025) cpue3$Qupper[ii] <- quantile(r1, probs = 0.975) } ggplot(data = bind_rows(cpue1, cpue2, cpue3), aes(x = Year, y = Median, colour = Type)) + geom_pointrange(aes(ymin = Qlower, ymax = Qupper), position = position_dodge(width = 0.5)) + scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) + labs(x = NULL, y = "CPUE") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.