In ianmtaylor1/gammacount: Gamma-Count Distribution Functions

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(gammacount)

General Guidelines

In this package, the guiding principles for ensuring numerical accuracy of computations are

Offload as much as possible to base R functions such as pgamma.
Work with probabilities and densities in the log scale, and exponentiate as a last step if required.
When given the choice, work with the complement of a probability if the complement is smaller.

This vignette describes how these principles are applied to some specific computations in this package.

Log-scale helper functions

This package contains two helper functions (not exported), logsumexp and logdiffexp which can perform operations without leaving the log scale: \begin{align} \mathrm{logsumexp}(a, b) &= \log(e^a + e^b) \ \mathrm{logdiffexp}(a, b) &= \log(e^a - e^b) \end{align}

Each works by factoring out the largest value ($\max(a, b)$ in the logsumexp case, and $a$ in the logdiffexp case). \begin{align} \mathrm{logsumexp}(a, b) &= \max(a,b) + \log(1 + e^{-|a-b|}) \ \mathrm{logdiffexp}(a, b) &= a + \log(1 - e^{b - a}) \end{align}

In this form, logsumexp can take advantage of R's log1p function, and logdiffexp uses either log1p(-exp(b-a)) or log(-expm1(b-a)) depending on the magnitude of $b-a$. Doing this minimizes the loss of accuracy relative to a naive implementation of this operation.

Arrival Time Densities

Recall that the density of the $n^{th}$ arrival time after a random start time has the density $$f_{\tau_n}(t) = \frac{\beta}{\alpha}(Q(n\alpha, \beta t) - Q((n-1)\alpha, \beta t)).$$ For $n \geq 2$, $\lim_{t\to 0} f_{\tau_n}(t) = \lim_{t\to \infty} f_{\tau_n}(t) = 0$. On the left tail this is because both $Q$ functions approach one, while on the right tail this is because both $Q$ functions approach zero.

On the log scale, finding the difference between two probabilities $p$ and $p+\epsilon$ is subject to less loss of accuracy when $p$ is small than large. So the density as written will be more accurate for large $t$ than for small $t$. To improve accuracy for small $t$, we notice that the density can also be written in terms of regularized lower incomplete gamma functions $$f_{\tau_n}(t) = \frac{\beta}{\alpha}(P((n-1)\alpha, \beta t) - P(n\alpha, \beta t)).$$ This form of the density has the opposite behavior for small $t$, as both $P$ functions approach zero.

We choose the distribution's mode as a convenient cutoff point to switch between using each representation and use the $P$ form for $t$ less than the mode, and use the $Q$ form for $t$ greater than the mode.

Arrival Time CDFs

The CDF of the arrival time distribution is given by: $$\int_0^t f_{\tau_n}(u) \mathrm{d}u = (t Q(n\alpha, \alpha t) - n Q(n\alpha + 1, \alpha t)) - (t Q((n-1)\alpha, \alpha t) - (n-1)Q((n-1)\alpha + 1, \alpha t)) + 1$$ There is a lot of potential for loss of accuracy in the final $+1$, so we use equivalent forms for either the lower CDF $F_{\tau_n}(t)$ or the upper CDF $1 - F_{\tau_n}(t)$.

First, the upper CDF: \begin{align} 1 - F_{\tau_n}(t) &= 1 - \left[\left(\frac{\beta t}{\alpha} Q(n\alpha, \beta t) - n Q(n\alpha + 1, \beta t)\right) - \left(\frac{\beta t}{\alpha} Q((n-1)\alpha, \beta t) - (n-1)Q((n-1)\alpha + 1, \beta t)\right) + 1\right] \ &= \left(n Q(n\alpha + 1, \beta t) - \frac{\beta t}{\alpha} Q(n\alpha, \beta t)\right) - \left((n-1)Q((n-1)\alpha + 1, \beta t) - \frac{\beta t}{\alpha} Q((n-1)\alpha, \beta t)\right) \ &= \left(n Q(n\alpha + 1, \beta t) + \frac{\beta t}{\alpha} Q((n-1)\alpha, \beta t)\right) - \left((n-1)Q((n-1)\alpha + 1, \beta t) + \frac{\beta t}{\alpha} Q(n\alpha, \beta t)\right) \end{align}

This form, computed on the log scale, preserves accuracy in the right tail. The only call to logdiffexp will be with arguments which are both large negative log probabilities.

For the lower CDF, we need to start with the alternate representation of the density, $$f_{\tau_n}(t) = \frac{\beta}{\alpha}(P((n-1)\alpha, \alpha t) - P(n\alpha, \alpha t))$$

Integrating this gives the CDF: \begin{align} F_{\tau_n}(t) &= \int_0^t \left[\frac{\beta}{\alpha}P((n-1)\alpha, \beta u) - \frac{\beta}{\alpha}P(n\alpha, \beta u)\right] \mathrm{d}u \ &= \left(\frac{\beta t}{\alpha} P((n-1)\alpha, \beta t) - (n-1)P((n-1)\alpha + 1, \beta t)\right) - \left(\frac{\beta t}{\alpha} P(n\alpha, \beta t) - n P(n\alpha + 1, \beta t)\right) \ &= \left(\frac{\beta t}{\alpha} P((n-1)\alpha, \beta t) + n P(n\alpha + 1, \beta t)\right) - \left(\frac{\beta t}{\alpha} P(n\alpha, \beta t) + (n-1)P((n-1)\alpha + 1, \beta t)\right) \end{align}

This form preserves accuracy on the left tail when computed on the log scale.

Both of these forms, however, lose accuracy in the opposite tail. There are three strategies for how to handle this:

Do nothing. Use the upper CDF form when the user sets lower.tail=F and the lower CDF form otherwise. It's not my fault if the user requests a lower tail probability in the far right tail.
Active tail choice. Use a central switchover point to initially compute whatever tail is closer and convert using logdiffexp(0, logp) if the user requested the complementary probability. The median would be a natural choice for this point, but the median isn't available directly in this distribution. So we can use the mean or the mode instead.
Passive tail choice. Initially do nothing, compute using whichever form corresponds to the user's choice of tail. Then if the resulting probability is above some threshold, recompute using the opposite tail and convert using logdiffexp(0, logp). This can be used to passively compute using a median switchover point, but at the cost of some extra computation. The switchover point can also be made much higher, e.g. $\log(0.99)$, to only recompute if we are fairly certain there was loss of accuracy.

Note: the median can be bounded with medians of gamma distributions, but qgamma is several times as expensive as pgamma. So if the median is desired as the switchover point, it's faster to use the passive option anyway.

In order to avoid double computation and opporunity for error, I chose the active tail choice option with the mean as a switchover point.

ianmtaylor1/gammacount documentation built on April 2, 2021, 8:31 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com