DESeq: Differential expression analysis based on the Negative...

View source: R/core.R

DESeqR Documentation

Differential expression analysis based on the Negative Binomial (a.k.a. Gamma-Poisson) distribution

Description

This function performs a default analysis through the steps:

  1. estimation of size factors: estimateSizeFactors

  2. estimation of dispersion: estimateDispersions

  3. Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest

For complete details on each step, see the manual pages of the respective functions. After the DESeq function returns a DESeqDataSet object, results tables (log2 fold changes and p-values) can be generated using the results function. Shrunken LFC can then be generated using the lfcShrink function. All support questions should be posted to the Bioconductor support site: http://support.bioconductor.org.

Usage

DESeq(
  object,
  test = c("Wald", "LRT"),
  fitType = c("parametric", "local", "mean", "glmGamPoi"),
  sfType = c("ratio", "poscounts", "iterate"),
  betaPrior,
  full = design(object),
  reduced,
  quiet = FALSE,
  minReplicatesForReplace = 7,
  modelMatrixType,
  useT = FALSE,
  minmu = if (fitType == "glmGamPoi") 1e-06 else 0.5,
  parallel = FALSE,
  BPPARAM = bpparam()
)

Arguments

object

a DESeqDataSet object, see the constructor functions DESeqDataSet, DESeqDataSetFromMatrix, DESeqDataSetFromHTSeqCount.

test

either "Wald" or "LRT", which will then use either Wald significance tests (defined by nbinomWaldTest), or the likelihood ratio test on the difference in deviance between a full and reduced model formula (defined by nbinomLRT)

fitType

either "parametric", "local", "mean", or "glmGamPoi" for the type of fitting of dispersions to the mean intensity. See estimateDispersions for description.

sfType

either "ratio", "poscounts", or "iterate" for the type of size factor estimation. See estimateSizeFactors for description.

betaPrior

whether or not to put a zero-mean normal prior on the non-intercept coefficients See nbinomWaldTest for description of the calculation of the beta prior. In versions >=1.16, the default is set to FALSE, and shrunken LFCs are obtained afterwards using lfcShrink.

full

for test="LRT", the full model formula, which is restricted to the formula in design(object). alternatively, it can be a model matrix constructed by the user. advanced use: specifying a model matrix for full and test="Wald" is possible if betaPrior=FALSE

reduced

for test="LRT", a reduced formula to compare against, i.e., the full formula with the term(s) of interest removed. alternatively, it can be a model matrix constructed by the user

quiet

whether to print messages at each step

minReplicatesForReplace

the minimum number of replicates required in order to use replaceOutliers on a sample. If there are samples with so many replicates, the model will be refit after these replacing outliers, flagged by Cook's distance. Set to Inf in order to never replace outliers. It set to Inf for fitType="glmGamPoi".

modelMatrixType

either "standard" or "expanded", which describe how the model matrix, X of the GLM formula is formed. "standard" is as created by model.matrix using the design formula. "expanded" includes an indicator variable for each level of factors in addition to an intercept. for more information see the Description of nbinomWaldTest. betaPrior must be set to TRUE in order for expanded model matrices to be fit.

useT

logical, passed to nbinomWaldTest, default is FALSE, where Wald statistics are assumed to follow a standard Normal

minmu

lower bound on the estimated count for fitting gene-wise dispersion and for use with nbinomWaldTest and nbinomLRT. If fitType="glmGamPoi", then 1e-6 will be used (as this fitType is optimized for single cell data, where a lower minmu is recommended), otherwise the default value as evaluated on bulk datasets is 0.5

parallel

if FALSE, no parallelization. if TRUE, parallel execution using BiocParallel, see next argument BPPARAM. Two notes on running in parallel using BiocParallel: 1) it is recommended to filter out genes where all samples have low counts before running DESeq2 in parellel: this improves efficiency as otherwise you will be sending data to child processes, though those have little power for detection of differences, and will likely be removed by independent filtering anyway; 2) it may be advantageous to remove large, unneeded objects from your current R environment before calling DESeq, as it is possible that R's internal garbage collection will copy these files while running on worker nodes.

BPPARAM

an optional parameter object passed internally to bplapply when parallel=TRUE. If not specified, the parameters last registered with register will be used.

Details

The differential expression analysis uses a generalized linear model of the form:

K_{ij} \sim \textrm{NB}( \mu_{ij}, \alpha_i)

\mu_{ij} = s_j q_{ij}

\log_2(q_{ij}) = x_{j.} \beta_i

where counts K_{ij} for gene i, sample j are modeled using a Negative Binomial distribution with fitted mean \mu_{ij} and a gene-specific dispersion parameter \alpha_i. The fitted mean is composed of a sample-specific size factor s_j and a parameter q_{ij} proportional to the expected true concentration of fragments for sample j. The coefficients \beta_i give the log2 fold changes for gene i for each column of the model matrix X. The sample-specific size factors can be replaced by gene-specific normalization factors for each sample using normalizationFactors.

For details on the fitting of the log2 fold changes and calculation of p-values, see nbinomWaldTest if using test="Wald", or nbinomLRT if using test="LRT".

Experiments without replicates do not allow for estimation of the dispersion of counts around the expected value for each group, which is critical for differential expression analysis. Analysis without replicates was deprecated in v1.20 and is no longer supported since v1.22.

The argument minReplicatesForReplace is used to decide which samples are eligible for automatic replacement in the case of extreme Cook's distance. By default, DESeq will replace outliers if the Cook's distance is large for a sample which has 7 or more replicates (including itself). Outlier replacement is turned off entirely for fitType="glmGamPoi". This replacement is performed by the replaceOutliers function. This default behavior helps to prevent filtering genes based on Cook's distance when there are many degrees of freedom. See results for more information about filtering using Cook's distance, and the 'Dealing with outliers' section of the vignette. Unlike the behavior of replaceOutliers, here original counts are kept in the matrix returned by counts, original Cook's distances are kept in assays(dds)[["cooks"]], and the replacement counts used for fitting are kept in assays(dds)[["replaceCounts"]].

Note that if a log2 fold change prior is used (betaPrior=TRUE) then expanded model matrices will be used in fitting. These are described in nbinomWaldTest and in the vignette. The contrast argument of results should be used for generating results tables.

Value

a DESeqDataSet object with results stored as metadata columns. These results should accessed by calling the results function. By default this will return the log2 fold changes and p-values for the last variable in the design formula. See results for how to access results for other variables.

Author(s)

Michael Love

References

Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. https://doi.org/10.1186/s13059-014-0550-8

For fitType="glmGamPoi":

Ahlmann-Eltze, C., Huber, W. (2020) glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaa1009

See Also

link{results}, lfcShrink, nbinomWaldTest, nbinomLRT

Examples


# see vignette for suggestions on generating
# count tables from RNA-Seq data
cnts <- matrix(rnbinom(n=1000, mu=100, size=1/0.5), ncol=10)
cond <- factor(rep(1:2, each=5))

# object construction
dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~ cond)

# standard analysis
dds <- DESeq(dds)
res <- results(dds)

# moderated log2 fold changes
resultsNames(dds)
resLFC <- lfcShrink(dds, coef=2, type="apeglm")

# an alternate analysis: likelihood ratio test
ddsLRT <- DESeq(dds, test="LRT", reduced= ~ 1)
resLRT <- results(ddsLRT)


mikelove/DESeq2 documentation built on Nov. 18, 2024, 1:37 p.m.