estimateDispersions: Estimate and fit dispersions for a CountDataSet.
In DESeq: Differential gene expression analysis based on the negative binomial distribution

Description Usage Arguments Details Value Author(s) Examples

This function obtains dispersion estimates for a count data set. For each condition (or collectively for all conditions, see 'method' argument below) it first computes for each gene an empirical dispersion value (a.k.a. a raw SCV value), then fits by regression a dispersion-mean relationship and finally chooses for each gene a dispersion parameter that will be used in subsequent tests from the empirical and the fitted value according to the 'sharingMode' argument.

## S4 method for signature 'CountDataSet'
estimateDispersions( object,
   method = c( "pooled", "pooled-CR", "per-condition", "blind" ),
   sharingMode = c( "maximum", "fit-only", "gene-est-only" ),
   fitType = c("parametric", "local"),
   locfit_extra_args=list(), lp_extra_args=list(),
   modelFrame = NULL, modelFormula = count ~ condition, ... )

`object`	a `CountDataSet` with size factors.
`method`	There are three ways how the empirical dispersion can be computed: `pooled` - Use the samples from all conditions with replicates to estimate a single pooled empirical dispersion value, called "pooled", and assign it to all samples. `pooled-CR` - Fit models according to `modelFormula` and estimate the dispersion by maximizing a Cox-Reid adjusted profile likelihood (CR-APL). This method is much slower than `method=="pooled"` but works also with crossed factors (as may occur, e.g., in designs with paired samples). Usually, you will need to specify the model formula, which should be the same as the one used later in the call to `nbinomFitGLMs` for fitting the full model. Note: The method of using CR-APL maximization for this application has been developed by McCarthy, Chen and Smyth [Nucl. Acid Res., 2012 and been first implemented in edgeR (in 2010). DESeq optimizes the expression for the CR-APL given in McCarthy et al.'s paper, but does not use the weigthed maximum likelihood scheme proposed there. `per-condition` - For each condition with replicates, compute a gene's empirical dispersion value by considering the data from samples for this condition. For samples of unreplicated conditions, the maximum of empirical dispersion values from the other conditions is used. If `object` has a multivariate design (i.e., if a data frame was passed instead of a factor for the `condition` argument in `newCountDataSet`), this method is not available. (Note: This method was called “normal” in previous versions.) `blind` - Ignore the sample labels and compute a gene's empirical dispersion value as if all samples were replicates of a single condition. This can be done even if there are no biological replicates. This method can lead to loss of power; see the vignette for details. The single estimated dispersion condition is called "blind" and used for all samples.
`sharingMode`	After the empirical dispersion values have been computed for each gene, a dispersion-mean relationship is fitted for sharing information across genes in order to reduce variability of the dispersion estimates. After that, for each gene, we have two values: the empirical value (derived only from this gene's data), and the fitted value (i.e., the dispersion value typical for genes with an average expression similar to those of this gene). The `sharingMode` argument specifies which of these two values will be written to the `featureData`'s `disp_` columns and hence will be used by the functions `nbinomTest` and `fitNbinomGLMs`. `fit-only` - use only the fitted value, i.e., the empirical value is used only as input to the fitting, and then ignored. Use this only with very few replicates, and when you are not too concerned about false positives from dispersion outliers, i.e. genes with an unusually high variability. `maximum` - take the maximum of the two values. This is the conservative or prudent choice, recommended once you have at least three or four replicates and maybe even with only two replicates. `gene-est-only` - No fitting or sharing, use only the empirical value. This method is preferable when the number of replicates is large and the empirical dispersion values are sufficiently reliable. If the number of replicates is small, this option may lead to many cases where the dispersion of a gene is accidentally underestimated and a false positive arises in the subsequent testing.
`fitType`	`parametric` - Fit a dispersion-mean relation of the form `dispersion = asymptDisp + extraPois / mean` via a robust gamma-family GLM. The coefficients `asymptDisp` and `extraPois` are given in the attribute `coefficients` of the `dispFunc` in the `fitInfo` (see below). `local` - Use the locfit package to fit a dispersion-mean relation, as described in the DESeq paper.
`locfit_extra_args, lp_extra_args`	(only for `fitType=local`) Options to be passed to the `locfit` and to the `lp` function of the locfit package. Use this to adjust the local fitting. For example, you may pass a value for `nn` different from the default (0.7) if the fit seems too smooth or too rough by setting `lp_extra_agrs=list(nn=0.9)`. As another example, you can set `locfit_extra_args=list(maxk=200)` if you get the error that locfit ran out of nodes. See the documentation of the `locfit` package for details. In most cases, you will not need to provide these parameters, as the defaults seem to work quite well.
`modelFrame`	By default, the information in `conditions(object)` or `pData(object)` is used to determine which samples are replicates (see `newCountDataSet`). For `method="pooled"`, a data frame can be passed here, and all rows that are identical in this data frame are considered to indicate replicate samples in `object`. For `method="pooled-CR"`, the data frame is used in the fits. For the other methods, this argument is ignored.
`modelFormula`	For `method="pooled-CR"`, this is the formual used for the dispersion fits. For all other methods, this argument is ignored.
`...`	extra arguments are ignored

Behaviour for method="per-condition": For each replicated condition, a list, named with the condition's name, is placed in the environment object@fitInfo. This list has five named elements: The vector perGeneDispEsts contains the empirical dispersions. The function dispFunc is the fitted function, i.e., it takes as its argument a normalized mean expression value and returns the corresponding fitted dispersion. The values fitted according to this function are in the third element fittedDispEst, a vector of the same length as perGeneDispEsts. The fourt element, df, is an integer, indicating the number of degrees of freedom of the per-gene estimation. The fifth element, sharingMode, stores the value of the sharingMode argument to esimateDispersions.

Behaviour for method="blind" and method="pooled": Only one list is produced, named "blind" or "pooled" and placed in object@fitInfo.

For each list in the fitInfo environment, the dispersion values that are intended to be used in subsequent testing are computed according to the value of sharingMode and are placed in the featureData data frame, in a column named with the same name, prefixed with "disp_".

Then, the dispTable (see there) is filled to assign to each condition the appropriate dispersion column in the phenoData frame.

Note: Up to DESeq version 1.4.x (Bioconductor release 2.8), this function was called estimateVarianceFunctions, stored its result differently and did not have the arguments sharingMode and fitType. estimatevarianceFunction's behaviour corresponded to the settings sharingMode="fit-only" and fitType="local". Note that these are not the default, because the new defaults sharingMode="maximum" and fitType="parametric" are more robust and tend to give better results.

The CountDataSet cds, with the slots fitInfo and featureData updated as described in Details.

Simon Anders, sanders@fs.tum.de

cds <- makeExampleCountDataSet()
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds )
str( fitInfo( cds ) )
head( fData( cds ) )

DESeq documentation built on April 28, 2020, 6:37 p.m.

DESeq index

Analysing RNA-Seq data with the "DESeq" package

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DESeq
Differential gene expression analysis based on the negative binomial distribution

estimateDispersions: Estimate and fit dispersions for a CountDataSet.
In DESeq: Differential gene expression analysis based on the negative binomial distribution

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to estimateDispersions in DESeq...

R Package Documentation

Browse R Packages

We want your feedback!

DESeq Differential gene expression analysis based on the negative binomial distribution

estimateDispersions: Estimate and fit dispersions for a CountDataSet. In DESeq: Differential gene expression analysis based on the negative binomial distribution

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to estimateDispersions in DESeq...

R Package Documentation

Browse R Packages

We want your feedback!

DESeq
Differential gene expression analysis based on the negative binomial distribution

estimateDispersions: Estimate and fit dispersions for a CountDataSet.
In DESeq: Differential gene expression analysis based on the negative binomial distribution