glmfit: Genewise Negative Binomial Generalized Linear Models
In edgeR: Empirical Analysis of Digital Gene Expression Data in R

Description Usage Arguments Details Value Author(s) References See Also Examples

Fit a negative binomial generalized log-linear model to the read counts for each gene. Conduct genewise statistical tests for a given coefficient or coefficient contrast.

## S3 method for class 'DGEList'
glmFit(y, design=NULL, dispersion=NULL, prior.count=0.125, start=NULL, ...)
## S3 method for class 'SummarizedExperiment'
glmFit(y, design=NULL, dispersion=NULL, prior.count=0.125, start=NULL, ...)
## Default S3 method:
glmFit(y, design=NULL, dispersion=NULL, offset=NULL, lib.size=NULL, weights=NULL,
       prior.count=0.125, start=NULL, ...)
glmLRT(glmfit, coef=ncol(glmfit$design), contrast=NULL)

`y`	an object that contains the raw counts for each library (the measure of expression level); alternatively, a matrix of counts, or a `DGEList` object with (at least) elements `counts` (table of unadjusted counts) and `samples` (data frame containing information about experimental group, library size and normalization factor for the library size), or a `SummarizedExperiment` object with (at least) an component `counts` in its `assays`.
`design`	numeric matrix giving the design matrix for the genewise linear models. Must be of full column rank. Defaults to a single column of ones, equivalent to treating the columns as replicate libraries.
`dispersion`	numeric scalar, vector or matrix of negative binomial dispersions. Can be a common value for all genes, a vector of dispersion values with one for each gene, or a matrix of dispersion values with one for each observation. If `NULL` will be extracted from `y`, with order of precedence: genewise dispersion, trended dispersions, common dispersion.
`offset`	numeric matrix of same size as `y` giving offsets for the log-linear models. Can be a scalar or a vector of length `ncol(y)`, in which case it is expanded out to a matrix.
`weights`	optional numeric matrix giving prior weights for the observations (for each library and gene) to be used in the GLM calculations.
`lib.size`	numeric vector of length `ncol(y)` giving library sizes. Only used if `offset=NULL`, in which case `offset` is set to `log(lib.size)`. Defaults to `colSums(y)`.
`prior.count`	average prior count to be added to observation to shrink the estimated log-fold-changes towards zero.
`start`	optional numeric matrix of initial estimates for the linear model coefficients.
`...`	other arguments are passed to lower level fitting functions.
`glmfit`	a `DGEGLM` object, usually output from `glmFit`.
`coef`	integer or character vector indicating which coefficients of the linear model are to be tested equal to zero. Values must be columns or column names of `design`. Defaults to the last coefficient. Ignored if `contrast` is specified.
`contrast`	numeric vector or matrix specifying one or more contrasts of the linear model coefficients to be tested equal to zero. Number of rows must equal to the number of columns of `design`. If specified, then takes precedence over `coef`.

glmFit and glmLRT implement generalized linear model (glm) methods developed by McCarthy et al (2012).

glmFit fits genewise negative binomial glms, all with the same design matrix but possibly different dispersions, offsets and weights. When the design matrix defines a one-way layout, or can be re-parametrized to a one-way layout, the glms are fitting very quickly using mglmOneGroup. Otherwise the default fitting method, implemented in mglmLevenberg, uses a Fisher scoring algorithm with Levenberg-style damping.

Positive prior.count cause the returned coefficients to be shrunk in such a way that fold-changes between the treatment conditions are decreased. In particular, infinite fold-changes are avoided. Larger values cause more shrinkage. The returned coefficients are affected but not the likelihood ratio tests or p-values.

glmLRT conducts likelihood ratio tests for one or more coefficients in the linear model. If coef is used, the null hypothesis is that all the coefficients indicated by coef are equal to zero. If contrast is non-null, then the null hypothesis is that the specified contrasts of the coefficients are equal to zero. For example, a contrast of c(0,1,-1), assuming there are three coefficients, would test the hypothesis that the second and third coefficients are equal.

glmFit produces an object of class DGEGLM containing components counts, samples, genes and abundance from y plus the following new components:

`design`	design matrix as input.
`weights`	matrix of weights as input.
`df.residual`	numeric vector of residual degrees of freedom, one for each gene.
`offset`	numeric matrix of linear model offsets.
`dispersion`	vector of dispersions used for the fit.
`coefficients`	numeric matrix of estimated coefficients from the glm fits, on the natural log scale, of size `nrow(y)` by `ncol(design)`.
`unshrunk.coefficients`	numeric matrix of estimated coefficients from the glm fits when no log-fold-changes shrinkage is applied, on the natural log scale, of size `nrow(y)` by `ncol(design)`. It exists only when `prior.count` is not 0.
`fitted.values`	matrix of fitted values from glm fits, same number of rows and columns as `y`.
`deviance`	numeric vector of deviances, one for each gene.

glmLRT produces objects of class DGELRT with the same components as for glmFit plus the following:

`table`	data frame with the same rows as `y` containing the log2-fold-changes, likelhood ratio statistics and p-values, ready to be displayed by `topTags`.
`comparison`	character string describing the coefficient or the contrast being tested.

The data frame table contains the following columns:

`logFC`	log2-fold change of expression between conditions being tested.
`logCPM`	average log2-counts per million, the average taken over all libraries in `y`.
`LR`	likelihood ratio statistics.
`PValue`	p-values.

Davis McCarthy and Gordon Smyth

McCarthy, DJ, Chen, Y, Smyth, GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297. https://doi.org/10.1093/nar/gks042

Low-level computations are done by mglmOneGroup or mglmLevenberg.

topTags displays results from glmLRT.

nlibs <- 3
ngenes <- 100
dispersion.true <- 0.1

# Make first gene respond to covariate x
x <- 0:2
design <- model.matrix(~x)
beta.true <- cbind(Beta1=2,Beta2=c(2,rep(0,ngenes-1)))
mu.true <- 2^(beta.true %*% t(design))

# Generate count data
y <- rnbinom(ngenes*nlibs,mu=mu.true,size=1/dispersion.true)
y <- matrix(y,ngenes,nlibs)
colnames(y) <- c("x0","x1","x2")
rownames(y) <- paste("gene",1:ngenes,sep=".")
d <- DGEList(y)

# Normalize
d <- calcNormFactors(d)

# Fit the NB GLMs
fit <- glmFit(d, design, dispersion=dispersion.true)

# Likelihood ratio tests for trend
results <- glmLRT(fit, coef=2)
topTags(results)

# Estimate the dispersion (may be unreliable with so few genes)
d <- estimateGLMCommonDisp(d, design, verbose=TRUE)

Loading required package: limma
Coefficient:  x 
             logFC   logCPM        LR       PValue         FDR
gene.1    1.733482 16.34315 17.342550 3.121193e-05 0.003121193
gene.20   1.542862 14.06644  6.214616 1.266998e-02 0.560938077
gene.49  -2.247175 13.10519  5.696575 1.699806e-02 0.560938077
gene.4    2.146185 13.20478  5.028515 2.493327e-02 0.560938077
gene.37   1.424791 13.78269  4.595435 3.205720e-02 0.560938077
gene.38  -1.432042 13.55518  4.399668 3.594592e-02 0.560938077
gene.26  -1.401936 13.65517  4.214001 4.009168e-02 0.560938077
gene.71   1.924877 13.08241  3.835577 5.017572e-02 0.560938077
gene.100  1.637247 13.32255  3.825296 5.048443e-02 0.560938077
gene.68  -1.156318 13.90806  3.351414 6.714730e-02 0.620429081
Disp = 0.1744 , BCV = 0.4176