Analyzing overdispersed count data with the mixpoissonreg package

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Description of the dataset

We will consider the dataset studied in Barreto-Souza and Simas (2016). The motivation for this study was to assess the attendance of high school students with respect to their gender, math score and which program they are enrolled. The data consists on 314 high school juniors from two urban high schools. The response variable, denoted by y, and the covariates of interest are:

Model fitting and inference

It is well-known (see, for instance, https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/) that the Poisson regression is not adequate for fitting this dataset. They also conclude that, when compared to the Poisson regression model, the Negative Binomial is more adequate. Therefore, this indicates that the data is overdispersed.

We will assume that the response variables $Y_i\sim MP(\mu_i, \phi_i)$, that is, that each $Y_i$ follows a mixed possion distribution with mean $\mu_i$ and precision parameter $\phi_i$.

We assume the following regression structures for the mean and precision parameters:

$$\log \mu_i = \beta_0 + \beta_1 {\tt gender}_i + \beta_2 {\tt math}_i + \beta_3 {\tt academic}_i + \beta_4 {\tt vocational}_i $$ and

$$\log \phi_i = \alpha_0 + \alpha_1 {\tt gender}_i + \alpha_2 {\tt math}_i + \alpha_3 {\tt academic}_i + \alpha_4 {\tt vocational}_i.$$ Let us fit this model under the assumption of Negative Binomial (NB) and Poisson Inverse Gaussian (PIG) distributions. In this vignette (for CRAN), we fit using direct maximization of the likelihood to reduce computational cost.

library(mixpoissonreg)

fit_nb_full <- mixpoissonregML(daysabs ~ gender + math + prog | gender + math + prog,
                             model = "NB", data = Attendance)

fit_pig_full <- mixpoissonregML(daysabs ~ gender + math + prog | gender + math + prog,
                             model = "PIG", data = Attendance)

The summary for the NB-regression fitting is:

summary(fit_nb_full)

For the PIG-regressionm, the summary is

summary(fit_pig_full)

These summaries suggest that gender and math covariates are not significant to explain the precision parameter. So we will obtain fits for the NB and PIG regressions without these variables for the precision paramters, then we will perform Wald and likelihood-ratio tests to confirm that indeed they are not significant.

fit_nb_red <- mixpoissonregML(daysabs ~ gender + math + prog | prog,
                             model = "NB", data = Attendance)

fit_pig_red <- mixpoissonregML(daysabs ~ gender + math + prog | prog,
                             model = "PIG", data = Attendance)

Let us check the summaries for these regressions:

summary(fit_nb_red)

and

summary(fit_pig_red)

Now, let us perform a Wald test:

lmtest::waldtest(fit_nb_red, fit_nb_full)

and

lmtest::waldtest(fit_pig_red, fit_pig_full)

Now, let us perform likelihood tests:

lmtest::lrtest(fit_nb_red, fit_nb_full)

and

lmtest::lrtest(fit_pig_red, fit_pig_full)

The above tests indicate that, indeed, gender and math are not significant for the precision parameter. So we will work with the reduced models.

Let us also build a small data frame containing the estimated mean, $\widehat{\mu}$, for some combinations of the covariates. First for the NB regression model:

library(tidyr)
library(dplyr)

gender <- c("female", "male")
prog <- c("Academic", "General", "Vocational")
math <- c(0, 99)

new_cov <- crossing(gender, prog, math)

pred_values_nb <- predict(fit_nb_red, newdata = new_cov)

bind_cols(new_cov, "Predicted_Means_NB" = pred_values_nb)

Now for the PIG regression model:

pred_values_pig <- predict(fit_pig_red, newdata = new_cov)

bind_cols(new_cov, "Predicted_Means_PIG" = pred_values_pig)

Residual analysis

Let us fit the reduced models with envelopes. We will fit these models with the Pearson residual.

set.seed(2021) 
# We are fixing the seed for reproducibility
fit_nb_red_env <- mixpoissonregML(daysabs ~ gender + math + prog | prog,
                             model = "NB", data = Attendance, envelope = 20)

fit_pig_red_env <- mixpoissonregML(daysabs ~ gender + math + prog | prog,
                             model = "PIG", data = Attendance, envelope = 20)

Since this is an illustration, we are using envelope = 50 to reduce computational cost.

Let us check the coverage of the envelopes for the NB fit:

summary(fit_nb_red_env)

and for the PIG fit:

summary(fit_pig_red_env)

By looking at both coverages, the assumption of the response following a mixed Poisson distribution seems reasonable. Furthermore, both regressions seem to provide adequate fit.

Also, notice that, even though the fits seem adequate, their Efron's $R^2$ are low, around 18\%. This is due to the high overdispersion of the responses, together with the fact that the mean is not a good predictor for the response when using counting data. Therefore, for datasets with low Efron's pseudo-$R^2$, such as this one, we recommend to use the mode. We plan on providing a function from prediction based on the mode in the future.

Let us check the diagnostic plots of the NB fit

autoplot(fit_nb_red_env, which = c(1,2))

and for the PIG fit

autoplot(fit_pig_red_env, which = c(1,2))

Notice that in Residuals vs. Obs. number, the residuals appear to be randomly distributed, skewed, and with no noticeable trend. Thus, also suggesting an adequate fit, together with the quantile-quantile plot with simulated envelopes.

Influence analysis

Let us begin with the global influence analysis. First, for the NB regression model:

autoplot(fit_nb_red, which = c(3,4,5))

Second, for the PIG regression model:

autoplot(fit_pig_red, which = c(3,4,5))

Let us check the impact on the estimates of the removal, for example, of case # 94, which was significant for both models, as well as it was the most significant considering both the Cook's distance and generalized Cook's distance. First, for NB regression:

# Relative change for mean-related coefficients
(influence(fit_nb_red)$coefficients.mean[94,] - 
   coefficients(fit_nb_red, parameters = "mean"))/
  coefficients(fit_nb_red, parameters = "mean")

# Relative change for precision-related coefficients
(influence(fit_nb_red)$coefficients.precision[94,] - 
    coefficients(fit_nb_red, parameters = "precision"))/
  coefficients(fit_nb_red, parameters = "precision")

Now, for the PIG regression:

# Relative change for mean-related coefficients
(influence(fit_pig_red)$coefficients.mean[94,] - 
   coefficients(fit_pig_red, parameters = "mean"))/
  coefficients(fit_pig_red, parameters = "mean")

# Relative change for precision-related coefficients
(influence(fit_pig_red)$coefficients.precision[94,] - 
    coefficients(fit_pig_red, parameters = "precision"))/
  coefficients(fit_pig_red, parameters = "precision")

We see that case # 94 mainly impacts the precision parameter estimates. Notice that for PIG regression models, the impact is less severe.

Let us check now the local influence.

First for the NB regression:

local_influence_autoplot(fit_nb_red)

Now, for PIG regression:

local_influence_autoplot(fit_pig_red)

Let us check the influential observations along with their associated covariates and response values. First for the NB regression:

inf_nb_tbl <- tidy_local_influence(fit_nb_red) %>% mutate(.index = row_number()) %>% 
  pivot_longer(!.index, names_to = "perturbation", values_to = "curvature") 
bench_nb_tbl <- local_influence_benchmarks(fit_nb_red) %>% 
  pivot_longer(everything(), names_to = "perturbation", values_to = "benchmark")

inf_nb_bench_tbl <- left_join(inf_nb_tbl, bench_nb_tbl, by = "perturbation") %>% 
  mutate(influential = curvature > benchmark) %>% filter(influential == TRUE) %>%
  select(-influential, -benchmark, -curvature)

data_nb_tbl <- augment(fit_nb_red) %>% mutate(.index = row_number()) %>% 
  select(.index, daysabs, gender, math, prog)

influential_nb <- left_join(inf_nb_bench_tbl, data_nb_tbl, by = ".index")

influential_nb

Let us check the number of unique influential observations:

influential_nb %>% select(.index) %>% unique() %>% count()

Now for the PIG regression:

inf_pig_tbl <- tidy_local_influence(fit_pig_red) %>% mutate(.index = row_number()) %>% 
  pivot_longer(!.index, names_to = "perturbation", values_to = "curvature") 
bench_pig_tbl <- local_influence_benchmarks(fit_pig_red) %>% 
  pivot_longer(everything(), names_to = "perturbation", values_to = "benchmark")

inf_pig_bench_tbl <- left_join(inf_pig_tbl, bench_pig_tbl, by = "perturbation") %>% 
  mutate(influential = curvature > benchmark) %>% filter(influential == TRUE) %>%
  select(-influential, -benchmark, -curvature)

data_pig_tbl <- augment(fit_pig_red) %>% mutate(.index = row_number()) %>% 
  select(.index, daysabs, gender, math, prog)

influential_pig <- left_join(inf_pig_bench_tbl, data_pig_tbl, by = ".index")

influential_pig

Now, the number of unique influential observations for the PIG regression:

influential_pig %>% select(.index) %>% unique() %>% count()

Let us check the informations of case # 94 for the NB and PIG regressions:

influential_nb %>% filter(.index == 94)

influential_pig %>% filter(.index == 94)

So, notice that case # 94 is a female, enrolled in the general program, with a high number of absences but with an unexpected high math score for this number of absences.

It is also noteworthy that case # 94 was considered influential for all local perturbation schemes.

Let us check the observations that were found influential for both models:

ind_nb <- influential_nb %>% select(.index) %>% unique()

ind_pig <- influential_pig %>% select(.index) %>% unique()

ind_common <- intersect(ind_nb, ind_pig)

influential_nb %>% filter(.index %in% ind_common$.index) %>% select(-perturbation) %>% unique()

Now, let us check the influential observations for the NB regression that were not influential for the PIG regression:

ind_nb_pig <- setdiff(ind_nb, ind_pig)

influential_nb %>% filter(.index %in% ind_nb_pig$.index) %>% select(-perturbation) %>% unique()

Finally, the influential observations for the PIG regression that were not influential for the NB regression:

ind_pig_nb <- setdiff(ind_pig, ind_nb)

influential_pig %>% filter(.index %in% ind_pig_nb$.index) %>% select(-perturbation) %>% unique()


Try the mixpoissonreg package in your browser

Any scripts or data that you put into this service are public.

mixpoissonreg documentation built on March 11, 2021, 1:07 a.m.