overdispersion_mle: Estimate the Overdispersion for a Vector of Counts

View source: R/overdispersion.R

overdispersion_mleR Documentation

Estimate the Overdispersion for a Vector of Counts

Description

Estimate the Overdispersion for a Vector of Counts

Usage

overdispersion_mle(
  y,
  mean,
  model_matrix = NULL,
  do_cox_reid_adjustment = !is.null(model_matrix),
  global_estimate = FALSE,
  subsample = FALSE,
  max_iter = 200,
  verbose = FALSE
)

Arguments

y

a numeric or integer vector or matrix with the counts for which the overdispersion is estimated

mean

a numeric vector of either length 1 or length(y) or if y is a matrix, a matrix with the same dimensions. Contains the predicted value for that sample. If missing: mean(y) / rowMeans(y)

model_matrix

a numeric matrix that specifies the experimental design. It can be produced using stats::model.matrix(). Default: NULL

do_cox_reid_adjustment

the classical maximum likelihood estimator of the overdisperion is biased towards small values. McCarthy et al. (2012) showed that it is preferable to optimize the Cox-Reid adjusted profile likelihood.
do_cox_reid_adjustment can be either be TRUE or FALSE to indicate if the adjustment is added during the optimization of the overdispersion parameter. Default: TRUE if a model matrix is provided, otherwise FALSE

global_estimate

flag to decide if a single overdispersion for a whole matrix is calculated instead of one estimate per row. This parameter has no affect if y is a vector. Default: FALSE

subsample

the estimation of the overdispersion is the slowest step when fitting a Gamma-Poisson GLM. For datasets with many samples, the estimation can be considerably sped up without loosing much precision by fitting the overdispersion only on a random subset of the samples. Default: FALSE which means that the data is not subsampled. If set to TRUE, at most 1,000 samples are considered. Otherwise the parameter just specifies the number of samples that are considered for each gene to estimate the overdispersion.

max_iter

the maximum number of iterations for each gene

verbose

a boolean that indicates if information about the individual steps are printed while fitting the GLM. Default: FALSE.

Details

The function is optimized to be fast on many small counts. To achieve this, the frequency table of the counts is calculated and used to avoid repetitive calculations. If there are probably many unique counts the optimization is skipped. Currently the heuristic is to skip if more than half of the counts are expected to be unique. The estimation is based on the largest observed count in y.

An earlier version of this package (< 1.1.1) used a separate set of functions for the case of many small counts based on a paper by Bandara et al. (2019). However, this didn't bring a sufficient performance increase and meant an additional maintenance burden.

Value

The function returns a list with the following elements:

estimate

the numerical estimate of the overdispersion.

iterations

the number of iterations it took to calculate the result.

message

additional information about the fitting process.

See Also

glm_gp()

Examples

 set.seed(1)
 # true overdispersion = 2.4
 y <- rnbinom(n = 10, mu = 3, size = 1/2.4)
 # estimate = 1.7
 overdispersion_mle(y)


 # true overdispersion = 0
 y <- rpois(n = 10, lambda = 3)
 # estimate = 0
 overdispersion_mle(y)
 # with different mu, overdispersion estimate changes
 overdispersion_mle(y, mean = 15)
 # Cox-Reid adjustment changes the result
 overdispersion_mle(y, mean = 15, do_cox_reid_adjustment = FALSE)


 # Many very small counts, true overdispersion = 50
 y <- rnbinom(n = 1000, mu = 0.01, size = 1/50)
 summary(y)
 # estimate = 31
 overdispersion_mle(y, do_cox_reid_adjustment = TRUE)

 # Function can also handle matrix input
 Y <- matrix(rnbinom(n = 10 * 3, mu = 4, size = 1/2.2), nrow = 10, ncol = 3)
 Y
 as.data.frame(overdispersion_mle(Y))


const-ae/glmGamPoi documentation built on Feb. 13, 2024, 1:35 a.m.