glm_gp_impl: Internal Function to Fit a Gamma-Poisson GLM

View source: R/glm_gp_impl.R

glm_gp_implR Documentation

Internal Function to Fit a Gamma-Poisson GLM

Description

Internal Function to Fit a Gamma-Poisson GLM

Usage

glm_gp_impl(
  Y,
  model_matrix,
  offset = 0,
  size_factors = c("normed_sum", "deconvolution", "poscounts", "ratio"),
  overdispersion = TRUE,
  overdispersion_shrinkage = TRUE,
  ridge_penalty = 0,
  do_cox_reid_adjustment = TRUE,
  subsample = FALSE,
  verbose = FALSE
)

Arguments

Y

any matrix-like object (e.g. matrix(), DelayedArray(), HDF5Matrix()) with one column per sample and row per gene.

model_matrix

a numeric matrix that specifies the experimental design. It can be produced using stats::model.matrix(). Default: NULL

offset

Constant offset in the model in addition to log(size_factors). It can either be a single number, a vector of length ncol(data) or a matrix with the same dimensions as dim(data). Note that if data is a DelayedArray or HDF5Matrix, offset must be as well. Default: 0.

size_factors

in large scale experiments, each sample is typically of different size (for example different sequencing depths). A size factor is an internal mechanism of GLMs to correct for this effect.
size_factors is either a numeric vector with positive entries that has the same lengths as columns in the data that specifies the size factors that are used. Or it can be a string that species the method that is used to estimate the size factors (one of c("normed_sum", "deconvolution", "poscounts", "ratio")). Note that "normed_sum" and "poscounts" are fairly simple methods and can lead to suboptimal results. For the best performance on data with many zeros, I recommend to use size_factors = "deconvolution" which calls scran::calculateSumFactors(). However, you need to separately install the scran package from Bioconductor for this method to work. For small datasets common for bulk RNA-seq experiments, I recommend to use size_factors = "ratio", which uses the same procedure as DESeq2 and edgeR. Also note that size_factors = 1 and size_factors = FALSE are equivalent. If only a single gene is given, no size factor is estimated (ie. size_factors = 1). Default: "normed_sum".

overdispersion

the simplest count model is the Poisson model. However, the Poisson model assumes that variance = mean. For many applications this is too rigid and the Gamma-Poisson allows a more flexible mean-variance relation (variance = mean + mean^2 * overdispersion).
overdispersion can either be

  • a single boolean that indicates if an overdispersion is estimated for each gene.

  • a numeric vector of length nrow(data) fixing the overdispersion to those values.

  • the string "global" to indicate that one dispersion is fit across all genes.

Note that overdispersion = 0 and overdispersion = FALSE are equivalent and both reduce the Gamma-Poisson to the classical Poisson model. Default: TRUE.

overdispersion_shrinkage

the overdispersion can be difficult to estimate with few replicates. To improve the overdispersion estimates, we can share information across genes and shrink each individual overdispersion estimate towards a global overdispersion estimate. Empirical studies show however that the overdispersion varies based on the mean expression level (lower expression level => higher dispersion). If overdispersion_shrinkage = TRUE, a median trend of dispersion and expression level is fit and used to estimate the variances of a quasi Gamma Poisson model (Lund et al. 2012). Default: TRUE.

ridge_penalty

to avoid overfitting, we can penalize fits with large coefficient estimates. Instead of directly minimizing the deviance per gene (\sum dev(y_i, X_i b)), we will minimize \sum dev(y_i, X_i b) + N * \sum (penalty_p * b_p)^2.
ridge_penalty can be

  • a scalar in which case all parameters except the intercept are penalized.

  • a vector which has to have the same length as columns in the model matrix

  • a matrix with the same number of columns as columns in the model matrix. This gives maximum flexibility for expert users and allows for full Tikhonov regularization.

Default: ridge_penalty = 0, which is internally replaced with a small positive number for numerical stability.

do_cox_reid_adjustment

the classical maximum likelihood estimator of the overdisperion is biased towards small values. McCarthy et al. (2012) showed that it is preferable to optimize the Cox-Reid adjusted profile likelihood.
do_cox_reid_adjustment can be either be TRUE or FALSE to indicate if the adjustment is added during the optimization of the overdispersion parameter. Default: TRUE.

subsample

the estimation of the overdispersion is the slowest step when fitting a Gamma-Poisson GLM. For datasets with many samples, the estimation can be considerably sped up without loosing much precision by fitting the overdispersion only on a random subset of the samples. Default: FALSE which means that the data is not subsampled. If set to TRUE, at most 1,000 samples are considered. Otherwise the parameter just specifies the number of samples that are considered for each gene to estimate the overdispersion.

verbose

a boolean that indicates if information about the individual steps are printed while fitting the GLM. Default: FALSE.

Value

a list with four elements

  • Beta the coefficient matrix

  • overdispersion the vector with the estimated overdispersions

  • Mu a matrix with the corresponding means for each gene and sample

  • size_factors a vector with the size factor for each sample

  • ridge_penalty a vector with the ridge penalty

See Also

glm_gp() and overdispersion_mle()


const-ae/glmGamPoi documentation built on Dec. 13, 2024, 3:56 p.m.