overdispersion_mle: Estimate the Overdispersion for a Vector of Counts
In glmGamPoi: Fit a Gamma-Poisson Generalized Linear Model

Description Usage Arguments Details Value See Also Examples

Estimate the Overdispersion for a Vector of Counts

overdispersion_mle(
  y,
  mean,
  model_matrix = NULL,
  do_cox_reid_adjustment = !is.null(model_matrix),
  global_estimate = FALSE,
  subsample = FALSE,
  max_iter = 200,
  verbose = FALSE
)

`y`	a numeric or integer vector or matrix with the counts for which the overdispersion is estimated
`mean`	a numeric vector of either length 1 or `length(y)` or if `y` is a matrix, a matrix with the same dimensions. Contains the predicted value for that sample. If missing: `mean(y)` / `rowMeans(y)`
`model_matrix`	a numeric matrix that specifies the experimental design. It can be produced using `stats::model.matrix()`. Default: `NULL`
`do_cox_reid_adjustment`	the classical maximum likelihood estimator of the `overdisperion` is biased towards small values. McCarthy et al. (2012) showed that it is preferable to optimize the Cox-Reid adjusted profile likelihood. `do_cox_reid_adjustment` can be either be `TRUE` or `FALSE` to indicate if the adjustment is added during the optimization of the `overdispersion` parameter. Default: `TRUE` if a model matrix is provided, otherwise `FALSE`
`global_estimate`	flag to decide if a single overdispersion for a whole matrix is calculated instead of one estimate per row. This parameter has no affect if `y` is a vector. Default: `FALSE`
`subsample`	the estimation of the overdispersion is the slowest step when fitting a Gamma-Poisson GLM. For datasets with many samples, the estimation can be considerably sped up without loosing much precision by fitting the overdispersion only on a random subset of the samples. Default: `FALSE` which means that the data is not subsampled. If set to `TRUE`, at most 1,000 samples are considered. Otherwise the parameter just specifies the number of samples that are considered for each gene to estimate the overdispersion.
`max_iter`	the maximum number of iterations for each gene
`verbose`	a boolean that indicates if information about the individual steps are printed while fitting the GLM. Default: `FALSE`.

The function is optimized to be fast on many small counts. To achieve this, the frequency table of the counts is calculated and used to avoid repetitive calculations. If there are probably many unique counts the optimization is skipped. Currently the heuristic is to skip if more than half of the counts are expected to be unique. The estimation is based on the largest observed count in y.

An earlier version of this package (< 1.1.1) used a separate set of functions for the case of many small counts based on a paper by Bandara et al. (2019). However, this didn't bring a sufficient performance increase and meant an additional maintenance burden.

The function returs a list with the following elements:

estimate: the numerical estimate of the overdispersion.
iterations: the number of iterations it took to calculate the result.
message: additional information about the fitting process.

glm_gp()

 set.seed(1)
 # true overdispersion = 2.4
 y <- rnbinom(n = 10, mu = 3, size = 1/2.4)
 # estimate = 1.7
 overdispersion_mle(y)


 # true overdispersion = 0
 y <- rpois(n = 10, lambda = 3)
 # estimate = 0
 overdispersion_mle(y)
 # with different mu, overdispersion estimate changes
 overdispersion_mle(y, mean = 15)
 # Cox-Reid adjustment changes the result
 overdispersion_mle(y, mean = 15, do_cox_reid_adjustment = FALSE)


 # Many very small counts, true overdispersion = 50
 y <- rnbinom(n = 1000, mu = 0.01, size = 1/50)
 summary(y)
 # estimate = 31
 overdispersion_mle(y, do_cox_reid_adjustment = TRUE)

 # Function can also handle matrix input
 Y <- matrix(rnbinom(n = 10 * 3, mu = 4, size = 1/2.2), nrow = 10, ncol = 3)
 Y
 as.data.frame(overdispersion_mle(Y))