knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(gammacount)

library(ggplot2)
library(gridExtra)
library(foreach)
library(dplyr)

Gamma-Count (GC) Distribution

The gamma-count distribution models the count of event arrivals in an interval when the times between events are distributed according to a gamma distribution.

This distribution was first derived in @winkelmann1995duration. @zeviani2013gammacount also provides a good overview of the distribution and its potential uses. It is a generalization of a Poisson distribution, since the exponentially distributed arrival times of a Poisson process are special cases of the gamma distributed arrival times in this process.

It is also a renewal process and @karlin1975renewal provides a valuable overview of the theory of such processes. In this vignette I use the phrase "gamma-count process" to refer to the renewal process with gamma inter-renewal times.

Distribution Summary \label{sec:gc-definition}

Let $\delta_i \sim \mathrm{gamma}(\alpha,\beta)$, $i \geq 1$ be independent and identically distributed gamma wait times between events, so that $\tau_n = \sum_{i=1}^n \delta_i$ is the arrival time of the $n^\text{th}$ event. I say $X \sim \mathrm{gc}(\lambda, \alpha, \beta)$ if $X$ is the count of these events in the interval $(0, \lambda]$.

$$ X \leq n \iff \tau_{n+1} > \lambda \ P(X \leq n) = P(\tau_{n+1} > \lambda) \ \tau_{n+1} = \sum_{i=1}^{n+1} \delta_i \sim \mathrm{gamma}((n+1)\alpha, \beta) $$

Then the CDF of the gamma-count distribution can be expressed in terms of the CDF of the gamma distribution, which is available in base R with pgamma. Equivalently, $$P(X \leq n) = Q((n+1)\alpha, \beta\lambda),$$ where $Q(s,x) = \Gamma(s)^{-1}\int_x^\infty t^{s-1}e^{-t} \mathrm{d}t$ is the upper regularized incomplete gamma function.

Confounding of $\beta$ and $\lambda$

Notice that the distribution of $X$ depends only on the product $\beta\lambda$ and not on either term individually. The parameters are thus redundant. In this package, by default we set $\beta = \alpha$ so that the inter-arrival times have mean 1 and $\lambda$ approximately models the mean of $X$.

Under and Over-dispersion; Clustering and Regularity

The effect of $\alpha$ on the gamma-count distribution is changing its dispersion. The value $\alpha=1$ results in the equidispersed Poisson distribution. (The variance of a Poisson distributed random variable always equals its mean.) Values of $\alpha > 1$ result in an underdispersed distribution (lower variance), and values of $\alpha < 1$ result in an overdispersed distribution (higher variance).

lambda <- 15
x <- seq(0, 30)

equi <- dgc(x, lambda, 1)
over <- dgc(x, lambda, 0.25)
under <- dgc(x, lambda, 4)

ymax <- max(c(equi, over, under))
par(mfrow=c(1,3))
barplot(equi,  names.arg=x, main="Equidispersion", ylim=c(0, ymax))
barplot(over,  names.arg=x, main="Overdispersion", ylim=c(0, ymax))
barplot(under, names.arg=x, main="Underdispersion", ylim=c(0, ymax))

The parameter $\alpha$ can also be thought of as affecting clustering and regularity in the underlying gamma-count process (i.e. the event arrivals themselves). For $\alpha < 1$ there is a high probability that an event will immediately follow the previous one but then a long tail of large arrival times, resulting in clusters of events with larger gaps between the clusters. On the other hand for $\alpha > 1$, there is a lower probability of an arrival time being very far from the expected value of 1, resulting in more regularly spaced event arrivals. @winkelmann1995duration describes this in terms of hazard functions: for shape parameters greater than 1, gamma distributions have increasing hazard functions while for shape parameters less than 1, gamma distributions have decreasing hazard functions.

set.seed(20)

lambda <- 30

n <- 100
delta.equi  <- rgamma(n, shape=1, rate=1)
delta.over  <- rgamma(n, shape=0.25, rate=0.25)
delta.under <- rgamma(n, shape=4, rate=4)
tau.equi  <- cumsum(delta.equi)
tau.over  <- cumsum(delta.over)
tau.under <- cumsum(delta.under)

plot(x=tau.equi, y=rep(0.5, n), pch="x",  col="black",
     main="Event Arrival Times", xlab="Time (t)", ylab="", yaxt="n",
     ylim=c(0,2.5), xlim=c(0, lambda))
points(x=tau.over, y=rep(1,n), col="red", pch="x")
points(x=tau.under, y=rep(1.5,n), col="blue", pch="x")
legend("top", legend=c("Equidispersed", "Clustered","Regular"),
       pch="x", col=c("black", "red", "blue"), horiz=TRUE)

In the figure above, roughly the same number of events have arrived between 0 and $\lambda=r lambda$: r sum(tau.equi <= lambda) in the equidispersed case, r sum(tau.over <= lambda) in the overdispersed case, and r sum(tau.under <= lambda) in the underdispersed case. But $\alpha$ has affected the appearance and the spacing of the events. If this simulation were repeated, this would result in the overdispersed count having a high variance and the underdispersed count having a low variance as seen earlier.

Early process behavior

There is a problem that occurs very early in this process when $\alpha$ is small (overdispersion) or large (underdispersion). Because the clock begins ticking for the first arrival $\delta_1$ at $t=0$, it is as if there were an uncounted event which arrived exactly at $t=0$. Because we also start counting events at $t=0$, the behavior that occurs here influences our count. If $\alpha \gg 1$ then the first counted event will be unlikely to occur soon due to regularity, and for $\alpha \approx 0$ the process is very likely to start with an immediate cluster of early events before the next longer gap.

For a Poisson process (special case of $\alpha = 1$) this doesn't matter because the exponential distribution is memoryless and the behavior near $t=0$ is the same as the behavior elsewhere. If $\lambda$ is large this also has less effect because the early times account for less of the total watch time. But if $\lambda$ is small, then this early behavior dominates the whole process.

This issue is illustrated by the mean and variance plots below for varying values of $\alpha$ and small $\lambda$.

lambda.vals <- seq(0.02, 3, by=0.02)
alpha.vals <- c(1, 2, 10, 20, 0.5, 0.1)
max.x <- 200

# Create data frame of means and variances for each lambda and alpha
meanvar <- foreach(alpha=alpha.vals, .combine="rbind") %do% {
  foreach(lambda=lambda.vals, .combine="rbind") %do% {
    x <- c(0, seq_len(max.x))
    probs <- dgc(x, lambda, alpha)
    data.frame(lambda=lambda, alpha=alpha, under=(alpha >= 1), over=(alpha <= 1),
               mean=sum(probs * x),
               var=sum(probs * x^2) - sum(probs * x)^2)
  }
}
meanvar$alpha <- factor(meanvar$alpha, levels=alpha.vals)

meanplot.under <- ggplot(meanvar %>% filter(under == TRUE), aes(x=lambda, y=mean, color=alpha)) + 
  geom_line() + 
  ggtitle("Underdispersed Means")
varplot.under <- ggplot(meanvar %>% filter(under == TRUE), aes(x=lambda, y=var, color=alpha)) + 
  geom_line() + 
  ggtitle("Underdispersed Variances")
meanplot.over <- ggplot(meanvar %>% filter(over == TRUE), aes(x=lambda, y=mean, color=alpha)) + 
  geom_line() + 
  ggtitle("Overdispersed Means")
varplot.over <- ggplot(meanvar %>% filter(over == TRUE), aes(x=lambda, y=var, color=alpha)) + 
  geom_line() + 
  ggtitle("Overdispersed Variances")
grid.arrange(meanplot.under, varplot.under, meanplot.over, varplot.over, ncol=2, nrow=2)

In particular, notice that the mean for $\alpha < 1$ increases very quickly at first. This behavior would make it difficult to model counts with low mean and high dispersion using this distribution. For $\alpha > 1$, notice that the mean lags behind $\lambda$ and sometimes has somewhat step function-like behavior, while the variance does not increase monotonically with $\lambda$. This makes modeling low mean underdispersed counts difficult as well, since small changes in the parameters may result in large and unintuitive changes to the distribution.

Stationary Gamma-count Process

To remedy this behavior for small $\lambda$ (low mean) counts, we will move to a stationary gamma-count process. A stationary renewal process is one where the process is assumed to start in the infinite past before the window where events are counted. The general derivation of a stationary renewal process is available in @karlin1975renewal. In summary, we give the first arrival time, $\delta_1 = \tau_1$, the limiting excess life distribution of the gamma-count process. The times $\delta_i \sim gamma(\alpha, \beta)$ for $i \geq 2$ as before.

The stationary gamma-count distribution, specifically, was derived in @baker2018flexible who also shows its usefulness in data analysis. @baker2018flexible calls this the ERP-$\gamma$ distribution and expresses its probability mass function via a different approach to what is used here.

If $X$ is the resulting count random variable, we say $X \sim \mathrm{sgc}(\lambda, \alpha, \beta)$. "SGC" stands for stationary gamma count.

Gamma Limiting Excess Life Distribution

The limiting excess life distribution has density $$f(t) = \frac{\beta}{\alpha}Q(\alpha, \beta t).$$ See @karlin1975renewal for the general case of limiting excess life distributions.

In this package, the functions dglel, pglel, qglel and rgelel refer to the gamma limiting excess life distribution. In this vignette, we say $\tau_1 \sim \mathrm{glel}(\alpha, \beta)$.

Limiting Excess Life CDF

To find the CDF of $\tau_1$, we use the antiderivative identity $\int x^{b-1}\Gamma(s, x) \mathrm{d}x = (x^b \Gamma(s, x) - \Gamma(s+b, x))/b,$ where $\Gamma(s,x)$ is the upper incomplete gamma function (not regularized). We then get $$F(t) = 1 - \left(Q(\alpha+1, \beta t) - \frac{\beta t}{\alpha}Q(\alpha, \beta t)\right).$$

Later Arrival Time Distributions

Now that we have the distribution for $\delta_1 = \tau_1 \sim \mathrm{glel}(\alpha, \beta)$ defined above, the natural next question is if we can similarly know the distribution of later arrival times, $\tau_n = \sum_{i=1}^n \delta_i$ for $n > 1$. Recall that $\delta_i \sim \mathrm{gamma}(\alpha, \beta)$ for $i > 1$.

We will say that $\tau_n \sim \mathrm{sga}(n, \alpha, \beta)$, so that $\mathrm{glel}(\alpha, \beta) = \mathrm{sga}(1, \alpha, \beta)$. "SGA" stands for stationary gamma arrival. In this package, the functions dsga, psga, qsga and rsga relate to these distributions.

Arrival time densities

We want the density of $\tau_n = \tau_1 + \sum_{i=2}^n \delta_i$ for $n > 1$. For this derivation, define $\Delta = \sum_{i=2}^n \delta_i$, and note that $\Delta \sim \mathrm{gamma}((n-1)\alpha, \beta)$. We can do the convolution of the densities $f_{\tau_1}$ of $\tau_1$, and $f_\Delta$ of $\Delta$ to find the density of their sum.

$$ f_{\tau_n}(x) = \int_0^{x} f_{\tau_1}(x - t)f_\Delta(t) \mathrm{d}t $$ If $Y$ is a random variable such that $Y \sim \mathrm{gamma}(\alpha, \beta)$, then we can rewrite the integral above as \begin{align} f_{\tau_n}(x) &= \int_0^{x} \mathrm{P}(Y > x - t)f_\Delta(t) \mathrm{d}t \ &= \int_0^\infty \mathrm{P}(Y > x - t)f_\Delta(t) \mathrm{d}t - \int_x^\infty f_\Delta(t) \mathrm{d}t \ &= \mathrm{P}(Y+\Delta > x) - \mathrm{P}(\Delta > x), \end{align} because $\mathrm{P}(Y > x - t) = 1$ for $t > x$, and noticing that the first integral in the next step is a convolution of two gamma random variables. Since $Y + \Delta \sim \mathrm{gamma}(n\alpha, \beta)$ we have this density also in terms of gamma CDFs.

$$ f_{\tau_n}(t) = \frac{\beta}{\alpha}\left(Q(n\alpha, \beta t) - Q((n-1)\alpha, \beta t)\right), $$ where $Q(s,x) = \Gamma(s)^{-1}\int_x^\infty t^{s-1} e^{-t} \mathrm{d}t$ is the regularized upper incomplete gamma function.

Arrival time CDFs

To find the CDF of $\tau_n$, we can use the identity $\int \Gamma(s, x) \mathrm{d}x = x\Gamma(s, x) - \Gamma(s+1, x) + C$ for $\Gamma(s, x) = \int_x^\infty t^{s-1} e^{-t} \mathrm{d}t$ the upper incomplete gamma function. \begin{align} \int f_{\tau_n}(t) \mathrm{d}t &= \int \frac{\beta}{\alpha}\Gamma(n\alpha, \beta t)/\Gamma(n\alpha) \mathrm{d}t - \int \frac{\beta}{\alpha}\Gamma((n-1)\alpha, \beta t)/\Gamma((n-1)\alpha) \mathrm{d}t \ &= \frac{1}{\alpha\Gamma(n\alpha)}(\beta t \Gamma(n\alpha, \beta t) - \Gamma(n\alpha + 1, \beta t)) - \frac{1}{\alpha\Gamma((n-1)\alpha)}(\beta t \Gamma((n-1)\alpha, \alpha t) - \Gamma((n-1)\alpha + 1, \beta t)) + C\ &= \left(\frac{\beta t}{\alpha} Q(n\alpha, \beta t) - n Q(n\alpha + 1, \beta t)\right) - \left(\frac{\beta t}{\alpha} Q((n-1)\alpha, \beta t) - (n-1)Q((n-1)\alpha + 1, \beta t)\right) + C \end{align}

At $t=0$, this antiderivative evaluates to $-1$, which gives us the constant of integration.

Stationary Gamma-Count distribution

The CDF and PMF for the stationary gamma-count distribution can be derived through the relation to the $\mathrm{sga}(n, \alpha, \beta)$ distribution. If $X \sim \mathrm{sgc}(\lambda, \alpha, \beta)$, then $$ X \leq n \iff \tau_{n+1} > \lambda $$ where $\tau_{n+_1} \sim \mathrm{sga}(n+1, \alpha, \beta)$.

Simulations and demonstration of SGC distribution

Mean and Variance at small $\lambda$

lambda.vals <- seq(0.02, 3, by=0.02)
alpha.vals <- c(1, 2, 10, 20, 0.5, 0.1)
max.x <- 200

# Create data frame of means and variances for each lambda and alpha
meanvar <- foreach(alpha=alpha.vals, .combine="rbind") %do% {
  foreach(lambda=lambda.vals, .combine="rbind") %do% {
    x <- c(0, seq_len(max.x))
    probs <- dsgc(x, lambda, alpha)
    data.frame(lambda=lambda, alpha=alpha, under=(alpha >= 1), over=(alpha <= 1),
               mean=sum(probs * x),
               var=sum(probs * x^2) - sum(probs * x)^2)
  }
}
meanvar$alpha <- factor(meanvar$alpha, levels=alpha.vals)

meanplot.under <- ggplot(meanvar %>% filter(under == TRUE), aes(x=lambda, y=mean, color=alpha)) + 
  geom_line() + 
  ggtitle("Underdispersed Means")
varplot.under <- ggplot(meanvar %>% filter(under == TRUE), aes(x=lambda, y=var, color=alpha)) + 
  geom_line() + 
  ggtitle("Underdispersed Variances")
meanplot.over <- ggplot(meanvar %>% filter(over == TRUE), aes(x=lambda, y=mean, color=alpha)) + 
  geom_line() + 
  ggtitle("Overdispersed Means")
varplot.over <- ggplot(meanvar %>% filter(over == TRUE), aes(x=lambda, y=var, color=alpha)) + 
  geom_line() + 
  ggtitle("Overdispersed Variances")
grid.arrange(meanplot.under, varplot.under, meanplot.over, varplot.over, ncol=2, nrow=2)

Notice now, compared to the same plots for the standard gamma-count distribution, that $\lambda$ directly parameterizes the mean of the distribution and the waviness in the variance of underdispersed $\alpha > 1$ is reduced, although not entirely eliminated.

References



ianmtaylor1/gammacount documentation built on April 2, 2021, 8:31 p.m.