In thomas-fung/mpcmp: Mean-Parametrized Conway-Maxwell Poisson (COM-Poisson) Regression

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction {#intro}

Conway–Maxwell–Poisson (CMP or COM-Poisson) distributions have seen a recent resurgence in popularity for the analysis of dispersed counts (e.g., @Shmueli2005; @Lord2008; @Lord2010a; @Sellers2010; @sellersConwayMaxwellPoisson). Key features of CMP distributions include the ability to handle both over- and under-dispersion while containing the classical Poisson distribution as a special case. See the corresponding Wikipedia page for a quick summary or @Shmueli2005 for a more detailed overview of the history, features and applications of CMP distributions.

The R (@R-base) mpcmp package of @R-mpcmp provides the functionality for estimating a mean-parametrized Conway-Maxwell-Poisson generalized linear models for dispersed count data. The package is available from the Comprehensive R Archive Network (CRAN) here. @Huang2017b provides the theoretical development of the model that this package bases on.

Conway-Maxwell-Poisson Distributions {#cmpintro}

The CMP distribution was first used by @Conway1962 as a model for a queuing system with dependent service times. A random variable is said to have a (standard) CMP distribution with rate parameter $\lambda$ and dispersion parameter $\nu$ if its probability mass function (pmf) is given by $$ P(Y =y|\lambda,\nu)= \frac{\lambda^y}{(y!)^{\nu}}\frac{1}{Z(\lambda,\nu)}, \quad y =0,1,2,..., $$ where $$ Z(\lambda,\nu)= \sum^{\infty}_{y=0}\frac{\lambda^y}{(y!)^{\nu}}, $$ is a normalizing constant. The CMP includes Poisson ($\nu = 1$), geometric ($\nu = 0$, $\lambda < 1$) and Bernoulli ($\nu \to \infty$ with probability $\lambda/(1+\lambda)$) as its special cases.

One of the major limitations of CMP distributions is that it is not directly parametrized via the mean as it does not have closed-form expression for its moments in terms of the parameters $\lambda$ and $\nu$. For the first two moments, approximations can be obtained as $$ \begin{aligned} E(Y) &\approx \lambda^{\frac{1}{\nu}}-\frac{\nu-1}{2\nu}\quad \text{and} \quad Var(Y) \approx \frac{1}{\nu}E(Y), \end{aligned} $$ and they can be particularly accurate for $\nu\leq 1$ or $\lambda>10^{\nu}$ (see @Shmueli2005). If $\nu < 1$, CMP is overdispersed. In reverse, CMP is underdispersed when $\nu>1$. Here is a plot of the density for a few CMP distributions with mean $\mu = 5$:

library(ggplot2)
dat <- data.frame(x = rep(0:20, 3), 
                  density = c(mpcmp::dcomp(0:20, 5, 0.2), mpcmp::dcomp(0:20, 5, 1), 
                              mpcmp::dcomp(0:20, 5, 2)),
                  nu = rep(c("nu = 0.5", "nu = 1", "nu = 1.5"), c(21, 21, 21)),
                  yend = 0)
ggplot(dat, aes(x=x, y= density, xend = x, yend= yend)) + 
  geom_point()+
  geom_segment() + 
  facet_wrap(~nu)

Conway-Maxwell-Poisson Generalized Linear Models {#glm}

As CMP is one of a few distributions that can handle both under- and over-dispersion, the aim is to extend the GLM formulation to the CMP case so that one can model the relationship between $Y$ and the predictors $X$. Given a set of covariates $X \in R^q$, @Sellers2010 proposed a GLM for count response $Y$ that can be specified via $$ Y|X ∼ CMP(\lambda,\nu), \quad \text{s.t.} \quad \log\lambda = X^{\top}\beta $$ where $\beta \in R^q$ is a vector of regression coefficients. This model, however, does not provide a closed-form relationship between $E(Y)$ and the linear predictor, making it incompatible with other commonly used log-linear models.

As it is more convenient and interpretable to model the mean $\mu = E(Y)>0$ of the distribution directly, @Huang2017b proposed to parametrize the CMP distribution via the mean: $$ P(Y = y|\mu, \nu) = \frac{\lambda(\mu,\nu)^{y}}{(y!)^{\nu}}\frac{1}{Z(\lambda(\mu,\nu),\nu)}, \quad y = 0,1,2,\ldots, $$ where the rate $\lambda(\mu,\nu)$ is defined as the solution to the mean constraint: $$ \mu= \sum^{\infty}{y=0} y\frac{\lambda^y}{(y!)^{\nu}}\frac{1}{Z(\lambda,\nu)}. $$ We shall denote this as CMP${\mu}(\mu, \nu)$ distribution to distinguish it from the original/standard one.

A GLM that based on CMP${\mu}$ can then be specified via $$ Y|X ∼ CMP{\mu}(\mu(X^{\top}\beta),\nu), $$ where $$ E(Y|X) = \mu(X^{\top}\beta) = \exp(X^{\top}\beta). $$ Note that GLM that based on CMP$_{\mu}$ is a genuine GLM, so all the familiar key features of GLMs (e.g., @mccullagh1989generalized, Chapter 2) are retained.

The mean-dispersion specification makes CMP$_{\mu}$ directly comparable and compatible other commonly used log-linear regression models for counts. In particular, the mean $\mu = \exp(X^{\top}\beta)$ is functionally independent of the dispersion parameter $\nu$, making it similar in structure to the familiar Negative Binomial regression model for overdispersed counts.

The model can also be extended to allow varying dispersion, i.e. $\nu$ itself is modelled via a regression: $$ \nu = \exp(\tilde{X}^{\top}\gamma), $$ where $\tilde{X}$ is some covariates.

The `mpcmp` package

The main modelling function in our mpcmp package is glm.cmp(). The optimisation is done using Fisher Scoring updates to take advantage of the fact the CMP$_{\mu}$ belongs to the exponential family. Notice that this is a constrained optimisation problem as we have to maintain mean constraints at all times, which makes the implementation is a bit more challenging.

The package implements similar methods to those related to glm objects. If you used glm() previously, you should feel right at home. Once a fitted model object has been obtained, there are assessor functions available to extract the coefficients (coef(), or its alias coefficients()), the fitted values (fitted() or its alias fitted.values()), the residuals (residuals() or its alias resid()), the model frame (model.frame()), the number of observations (nobs()), the log-likelihood (logLik()), and the AIC (AIC()).

You can also use plot() or autoplot() to obtain diagnostic plots of the fitted model.

Let's go through some examples to show you what our package can do!

Constant Overdispersion Example: `attendance`

The attendance dataset (originally from https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/) examines the relationship between the number of days absent from school and the gender, maths score and academic program of 314 students from two urban high schools.

library(mpcmp)
data(attendance)
library(ggplot2)
ggplot(attendance, aes(daysabs, fill = prog)) + geom_histogram(binwidth = 1) + 
  facet_grid(prog ~ ., margins = TRUE, scales = "free")

It appears students enrolled into the General program tend to miss more days of school than those enrolled into the Academic or the Vocational program. There is some overdispersion in this dataset. We can use glm.cmp() to fit the mean-parametrized CMP model to try to account for that overdispersion.

M.attendance <- glm.cmp(daysabs~ gender+math+prog, data=attendance)
summary(M.attendance)

beta <- abs(round(coef(M.attendance)[5], 3))
mul_factor <- round(exp(beta),1)

As mpcmp is directly comparable and compatible with other commonly used log-linear regression models for counts, interpreting parameters is straight forward. Our model estimates that students in the General program (the reference level) are expected to miss exp(+r beta) = r mul_factor times more days of school compared to students in the Vocational program.

To see whether a simpler Poisson model, i.e. $\nu = 1$ is adequate, we can run the model through the LRTnu() function:

LRTnu(M.attendance)

As the P-value is small, we conclude that dispersion is present in this data set.

One of the key features of the mpcmp package is that it provides a range of diagnostic plots.

autoplot(M.attendance)

Please refer to ?autoplot to see what diagnostic plots are available for a cmp class object.

The mpcmp package also supports broom tidier method.

tidy(M.attendance)

This means "cmp" objects can take advantage of any packages that required broom support such as the modelsummary package:

M_poisson <- glm(daysabs~ gender+math+prog, data=attendance, family = poisson)
M_nb <- MASS::glm.nb(daysabs~ gender+math+prog, data=attendance)
modelsummary::modelsummary(list(mpcmp = M.attendance, 
                                Poisson = M_poisson, 
                                Neg.Bin = M_nb))

We also implemented a bunch of commonly used methods for "cmp" objects:

methods(class = "cmp")

Constant Underdispersion Example: `takeoverbids`

One of the key strength of the CMP distribution is that it can handle underdispersion. Here is an example demonstrating that. A dataset from @Cameron1997a that gives the number of bids received by 126 US firms that were successful targets of tender offers during the period 1978-85. The dataset comes with a set of explanatory variables such as defensive actions taken by management of target firm, firm-specific characteristics and intervention by federal regulators.

If we fit a Poisson distribution to the marginal distribution of the number of bids, we have:

data("takeoverbids")
ggplot(takeoverbids) + 
  geom_bar(aes(x= numbids, y = ..count../sum(..count..)))+
  geom_pointrange(aes(x = numbids,  
                        y = dpois(numbids, mean(numbids)), 
                        ymin = 0, 
                        ymax = dpois(numbids, mean(numbids))),
                  colour = "red") + 
  labs(title = "Number of bids with Poisson fit", y = "Probability")

We can see that the response variable is more concentrated than a Poisson fit, which is a sign of underdispersion.

Here, we can use glm.cmp() to fit the mean-parametrized CMP model to try to account for that underdispersion.

M.bids <- glm.cmp(numbids ~ leglrest + rearest + finrest + whtknght
    + bidprem + insthold + size + sizesq + regulatn, data=takeoverbids)
tidy(M.bids)
LRTnu(M.bids)

From the likelihood ratio test, we can see that the dispersion parameter is significantly different to 1, suggesting the model is trying to account for that underdispersion.

We can also compare the results with a Poisson model.

M.bids.pois <- glm(numbids ~ leglrest + rearest + finrest + whtknght
    + bidprem + insthold + size + sizesq + regulatn, data=takeoverbids,
    family = poisson)
modelsummary::modelsummary(list(mpcmp = M.bids, Poisson = M.bids.pois),
                           statistic_vertical = FALSE, stars = TRUE)

If we compare the two models, we can see that the CMP${\mu}$ model fitted the data better. We can also see that all the estimated standard errors under CMP${\mu}$ model are smaller which in turn makes the p-values for both leglrest and bidprem to drop below 0.05 instead of 0.1 under the Poisson model.

Varying Dispersion Example: `sitophilus`

@Ribeiro2013 carried out an experiment to assess the bioactivity of extracts from different parts (seeds, leaves and branches) of Annona mucosa (Annonaceae) to control Sitophilus zeamaus (Coleoptera: Curculionidae), a major pest of stored maize/corn in Brazil. This dataset can be found as part of the cmpreg package of @R-cmpreg.

In this experiment:

10g of corn and 20 animals adults were placed in each Petri dish.
Extracts prepared with different parts of mucosa or just water (control) were completely randomised with 10 replicates.
The numbers of emerged insects (progeny) after 60 days were recorded and some descriptive statistics can be found below.

library(tidyverse)
data(sitophilus)
sitophilus %>% group_by(extract) %>% summarise(mu = mean(ninsect), 
                                               sigma2 = var(ninsect))

It appears that the treatment may have some impact on both the mean and the variance of the number of progeny. Recall that non-constant variance is the norm in a GLM, non-constant variances do not necessarily mean varying dispersions. Nonetheless, we will try to use the treatment to explain the dispersion in the data set. To do that we specify the formula_nu argument in the glm.cmp() function.

data(sitophilus)
M.sit <- glm.cmp(formula = ninsect ~ extract, formula_nu = ~ extract, data = sitophilus)
summary(M.sit)

From the model summary, we can see that there is not much evidence to suggest the dispersions are varying across treatment groups. We can also formally test whether a constant dispersion model is sufficient here using a likelihood ratio test as the two models are nested:

M.sit2 <- update(M.sit, formula_nu = NULL)
cmplrtest(M.sit, M.sit2)

As a result, our final model is

summary(M.sit2)

Other packages for fitting Conway-Maxwell Poisson Model

There are other R packages to deal with CMP models, and they all contribute to the writing and construction of this package.

compoisson: Routines for density and moments of the CMP distribution under original parametrization by @R-compoisson
CompGLM: Fit CMP models under original parametrization (includes dispersion modelling) by @R-CompGLM.
COMPoissonReg: Fit CMP models under original parametrization (includes zero-inflation and dispersion modelling) by @R-COMPoissonReg.
cmpreg: Fit Mean-type parametrized CMP models (includes dispersion modelling) by @R-cmpreg. The authors purposed to regress on the approximated mean.
DGLMExtPois: Fit mean-parametrized CMP model (includes dispersion modelling) using nloptr by @R-DGLMExtPois.
glmmTMB: Fit (among other) CMP models under a different mean-parametrization (includes zero-inflation, dispersion modelling and random effects) by @R-glmmTMB.