In inbo/inlatools: Diagnostic Tools for INLA Models

library(knitr)
opts_chunk$set(
  fig.width = 6,
  collapse = TRUE,
  comment = "#>"
)
set.seed(20181011)
no_stored_data <- interactive() ||
  Sys.getenv("WERCKER") == "true" ||
  !file.exists("distribution.Rda")
if (!no_stored_data) {
  load("distribution.Rda")
}

Generate data

We created generate_data() to manufacture a basic data set with variables from different distributions: Poisson, negative binomial, binomial, zero inflated Poisson and zero inflated negative binomial. All variables share the common latent variable eta which is defined as $\eta_{ij} = a + b_i$ where $b_i \sim \mathcal{N}(0, \sigma_r ^ 2)$.

library(tibble)
library(dplyr)
library(ggplot2)
library(INLA)
library(inlatools)

intercept <- 2
sigma_random <- 0.75
nb_size <- 0.5

dataset <- generate_data(
  a = intercept, sigma_random = sigma_random, nb_size = nb_size
)

glimpse(dataset)
summary(dataset)

Overdispersion

Fitting the models

Data sets in ecology often exhibit overdispersion. Let's take the negbin variable from the simulate data. This one has a negative binomial distribution. First we fit the models. Each model has the same covariate structure and the covariate structure is a perfect match with the data generating process. Note that we have to add control.compute = list(config = TRUE) to the inla() call for distribution_check() and control.predictor = list(compute = TRUE) for the fast_distribution_check(). More details on the distinction between those functions will be in the following section.

We fit the model with three different distributions: 1) a negative binomial distribution, 2) a Poisson distribution, and 3) a Poisson distribution and a so called "observation level random effect" (OLRE).

negbin_negbin_model <- inla(
  negbin ~ f(group_id, model = "iid"),
  data = dataset,
  family = "nbinomial",
  control.predictor = list(compute = TRUE)
)
negbin_poisson_model <- inla(
  negbin ~ f(group_id, model = "iid"),
  data = dataset,
  family = "poisson",
  control.compute = list(config = TRUE),
  control.predictor = list(compute = TRUE)
)
bind_rows(
  negbin_negbin_model$summary.fixed %>%
    rownames_to_column("parameter") %>%
    mutate(distribution = "negative binomial"),
  negbin_poisson_model$summary.fixed %>%
    rownames_to_column("parameter") %>%
    mutate(distribution = "Poisson")
) %>%
  select(
    distribution, parameter, mean, lcl = `0.025quant`, ucl = `0.975quant`
  ) -> summary_fixed
bind_rows(
  negbin_negbin_model$summary.hyperpar %>%
    rownames_to_column("parameter") %>%
    mutate(distribution = "negative binomial"),
  negbin_poisson_model$summary.hyperpar %>%
    rownames_to_column("parameter") %>%
    mutate(distribution = "Poisson")
) %>%
  select(
    distribution, parameter, mean, lcl = `0.025quant`, ucl = `0.975quant`
  ) -> summary_hyper

Once a model is fit, we can look at the summary. Both the negative binomial and the Poisson model do a good job when we look at the fixed effects estimates. Note that the true intercept is r intercept. Adding an observation level random effect to the Poisson model to compensate for overdispersion badly affects the fixed effects estimates.

kable(
  summary_fixed,
  digits = 2,
  caption = "Estimates of the fixed effect for the different models"
)

We have set $\sigma_r = r sigma_random$. This is equivalent with a precision of r signif(sigma_random ^ -2, 4). All models seem to recover this precision reasonably well.

kable(
  summary_hyper,
  digits = 2,
  caption = "Hyperparameters of the different models"
)

Distribution checks

The main difference between distribution_check() and fast_distribution_check() is that the latter is based on the fitted values while the former takes the marginal distribution of the fitted values into account. Hence, distribution_check() is more accurate but takes about 10 to 20 times slower. Often the output of fast_distribution_check() is sufficient to highlight the worst problems.

system.time(
  negbin_poisson_dc <- distribution_check(negbin_poisson_model)
)
system.time(
  negbin_poisson_fdc <- fast_distribution_check(negbin_poisson_model)
)

Both distribution_check() and fast_distribution_check return a data.frame with the empirical cumulative distribution of the raw data (ecdf) and the envelope of empirical cumulative distribution from data simulated from the model (mean, lcl and ucl). n is the number of observations with response x.

glimpse(negbin_poisson_fdc)

There is a plot() method for the distribution check data. It shows the ratio of the ecdf of the raw data divided by the ecdf of the simulated data based on the model. If the model makes sense, then the observed ecdf is within the envelope of the simulated ecdf. Hence the ratio observed / expected would be near to 1. This is the reference line depicted in the plot.

plot(negbin_poisson_dc)

The plot from the distribution_check() is very similar to that of the fast_distribution_check(). Notice that the ratio observed / expected is (much) larger than 1 for low values of x. The model with Poisson distribution doesn't capture the low response values very well.

plot(negbin_poisson_fdc)

Let's see what happens if we apply fast_distribution_check() on the model with negative binomial distribution and Poisson distribution with OLRE. The distribution check of the model with negative binomial distribution yields a plot where the envelope of the ratio observed versus expected is always very close to the reference (1).

negbin_negbin_fdc <- fast_distribution_check(negbin_negbin_model)

plot(negbin_negbin_fdc)

Dispersion checks

The classics dispersion measure is to sum the squared Pearson residuals and divide them by the number of observations minus the degrees of freedom. This value should ideally be close to one. Values above one indicate overdispersion, while values below one indicate underdispersion.

The only agreement there is on number the degrees of freedom of mixed models, is that is hard to define them in a generic way. Therefore we take the simulation approach. Suppose we calculate the dispersion $D$ as the sum of the squared Pearson residuals $r_i$ divided by the number of observations $N$ and some degrees of freedom $x$. We calculate this value for the observed data ($D|data$).

$$D = \frac{\sum r_i^2}{N - x}$$

Both $D|data$ and $D|model$ use the same values for $N$ and $x$. Redefining $D$ as the average squared Pearson residuals $D = \sum r_i^2/N$, has no effect on the value of $P(D|data > D|model)$. Hence can omit $x$ from the formula and no longer need to worry about its "true" value.

dispersion_check() returns a list with two items: the dispersion value based on the data (a single value in data) and a vector of dispersion values for a number of simulated data sets (model). The plot() function as a method to handle this dispersion check object. It displays the density of the dispersion of the simulated data sets. The vertical dashed line shows the dispersion of the original data. The title contains $P(D|data > D|model)$.

negbin_negbin_disp <- dispersion_check(negbin_negbin_model)

glimpse(negbin_negbin_disp)
plot(negbin_negbin_disp)

The dispersion check of the model with the negative binomial distribution indicates no over- or underdispersion. The model with the Poisson distribution has clearly overdispersion.

negbin_poisson_disp <- dispersion_check(negbin_poisson_model)

plot(negbin_poisson_disp)

Zero inflation

Let's see what happens if we do the same things for a zero inflated Poisson variable. We'll fit the variable using a Poisson, negative binomial, zero inflated Poisson and zero inflated negative binomial distribution.

pcprior <- list(theta = list(prior = "pc.prec", param = c(1, 0.01)))
zip_poisson_model <- inla(
  zipoisson ~ f(group_id, model = "iid", hyper = pcprior),
  data = dataset,
  family = "poisson",
  control.predictor = list(compute = TRUE)
)
zip_negbin_model <- inla(
  zipoisson ~ f(group_id, model = "iid", hyper = pcprior),
  data = dataset,
  family = "nbinomial",
  control.predictor = list(compute = TRUE)
)
zip_zipoisson_model <- inla(
  zipoisson ~ f(group_id, model = "iid", hyper = pcprior),
  data = dataset,
  family = "zeroinflatedpoisson1",
  control.predictor = list(compute = TRUE)
)
zip_zinegbin_model <- inla(
  zipoisson ~ f(group_id, model = "iid", hyper = pcprior),
  data = dataset,
  family = "zeroinflatednbinomial1",
  control.predictor = list(compute = TRUE)
)
bind_rows(
  zip_negbin_model$summary.fixed %>%
    rownames_to_column("parameter") %>%
    mutate(
      distribution = "negative binomial",
      zeroinflation = FALSE
    ),
  zip_poisson_model$summary.fixed %>%
    rownames_to_column("parameter") %>%
    mutate(
      distribution = "Poisson",
      zeroinflation = FALSE
    ),
  zip_zinegbin_model$summary.fixed %>%
    rownames_to_column("parameter") %>%
    mutate(
      distribution = "negative binomial",
      zeroinflation = TRUE
    ),
  zip_zipoisson_model$summary.fixed %>%
    rownames_to_column("parameter") %>%
    mutate(
      distribution = "Poisson",
      zeroinflation = TRUE
    )
) %>%
  select(
    distribution, zeroinflation, parameter,
    mean, lcl = `0.025quant`, ucl = `0.975quant`
  ) -> summary_zip_fixed
bind_rows(
  zip_negbin_model$summary.hyperpar %>%
    rownames_to_column("parameter") %>%
    mutate(
      distribution = "negative binomial",
      zeroinflation = FALSE
    ),
  zip_poisson_model$summary.hyperpar %>%
    rownames_to_column("parameter") %>%
    mutate(
      distribution = "Poisson",
      zeroinflation = FALSE
    ),
  zip_zinegbin_model$summary.hyperpar %>%
    rownames_to_column("parameter") %>%
    mutate(
      distribution = "negative binomial",
      zeroinflation = TRUE
    ),
  zip_zipoisson_model$summary.hyperpar %>%
    rownames_to_column("parameter") %>%
    mutate(
      distribution = "Poisson",
      zeroinflation = TRUE
    )
) %>%
  select(
    distribution, zeroinflation, parameter,
    mean, lcl = `0.025quant`, ucl = `0.975quant`
  ) %>%
  arrange(parameter, distribution, zeroinflation) -> summary_zip_hyper

Using a distribution without zero inflation adds a downward bias to the intercept.

kable(
  summary_zip_fixed,
  digits = 2,
  caption = "Estimates of the fixed effect for the different models"
)

We have set $\sigma_r = r sigma_random$. This is equivalent with a precision of r signif(sigma_random ^ -2, 4). All distributions lead to a large precision, so they all shrink the group effects. Note that the analyses are based on a very small dataset of only 200 observations and about half of them stem from the zero inflation part. The zero inflation probability is recovered very well. The size parameter of the negative binomial is very small when assuming no zero inflation. This indicates that it handles the zero inflation by increasing the variance. The zero inflated negative binomial has a very large size indicating that there is no overdispersion. Which is the case with a zero inflated Poisson response.

kable(
  summary_zip_hyper,
  digits = 2,
  caption = "Hyperparameters of the different models"
)

Dispersion checks

dispersion_check(zip_poisson_model) -> zip_poisson_disp
dispersion_check(zip_negbin_model) -> zip_negbin_disp
dispersion_check(zip_zipoisson_model) -> zip_zipoisson_disp
dispersion_check(zip_zinegbin_model) -> zip_zinegbin_disp

Using the Poisson distribution we get strong indications for overdispersion.

plot(zip_poisson_disp)

The negative binomial distribution has a weak indication for underdispersion. More important, the range of dispersion values based on the simulation data is quite large.

plot(zip_negbin_disp)

The dispersion checks on both the zero inflation Poisson and the zero inflated negative binomial models give a go-ahead. The observed dispersion is well within the range of the dispersion of the simulated data and that range is quite narrow.

plot(zip_zipoisson_disp)

plot(zip_zinegbin_disp)

Distribution checks

Another feature of the distribution checks is that they can handle a list of models. The resulting data.frame gains a model variables which contains the name of the list (or their index in case of an unnamed list). Plotting the resulting check will create a subplot for each model (using facet_wrap()). By default all subplots will have the same scale. The strong ratio observed / expected for the Poisson model will swamp all information of the other models. Therefore we used scales = "free", resulting in a dedicated x and y axis for each subplot.

list(
  poisson = zip_poisson_model,
  negbin = zip_negbin_model,
  zipoisson = zip_zipoisson_model,
  zinegbin = zip_zinegbin_model
) %>%
  fast_distribution_check() -> zip_fdc

The plot for the Poisson model clearly indicates a problem with the (near) zero values, the data has much more of those than the model predicts. The negative binomial models is better at modelling the large number of zero's but it fails on the small, non-zero values which are overrepresented by the model. Both the zero inflated Poisson and the zero inflated negative binomial models pass the distribution check with flying colours.

glimpse(zip_fdc)
plot(zip_fdc, scales = "free")

Underdispersion

generate_data() has a binomial variable. We use this as an example for underdispersed data. The dispersion check clearly indicates underdispersion. The distribution check has a ratio smaller than 1 has low values.

binom_poisson_model <- inla(
  binom ~ f(group_id, model = "iid"),
  data = dataset,
  family = "nbinomial",
  control.predictor = list(compute = TRUE)
)
binom_poisson_disp <- dispersion_check(binom_poisson_model)
binom_poisson_fdc <- fast_distribution_check(binom_poisson_model)

plot(binom_poisson_disp)

plot(binom_poisson_fdc)

save(
  dataset, summary_fixed, summary_hyper, sigma_random, intercept, nb_size,
  negbin_negbin_fdc, negbin_negbin_disp,
  negbin_poisson_dc, negbin_poisson_fdc, negbin_poisson_disp,
  summary_zip_fixed, summary_zip_hyper, zip_fdc,
  zip_poisson_disp, zip_negbin_disp, zip_zipoisson_disp, zip_zinegbin_disp,
  binom_poisson_fdc, binom_poisson_disp,
  file = "distribution.Rda")

inbo/inlatools documentation built on Sept. 17, 2022, 2:13 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com