estimateDispersions: Estimate and fit dispersions for a CountDataSet.

Description Usage Arguments Details Value Author(s) Examples

Description

This function obtains dispersion estimates for a count data set. For each condition (or collectively for all conditions, see 'method' argument below) it first computes for each gene an empirical dispersion value (a.k.a. a raw SCV value), then fits by regression a dispersion-mean relationship and finally chooses for each gene a dispersion parameter that will be used in subsequent tests from the empirical and the fitted value according to the 'sharingMode' argument.

Usage

1
2
3
4
5
6
7
## S4 method for signature 'CountDataSet'
estimateDispersions( object,
   method = c( "pooled", "pooled-CR", "per-condition", "blind" ),
   sharingMode = c( "maximum", "fit-only", "gene-est-only" ),
   fitType = c("parametric", "local"),
   locfit_extra_args=list(), lp_extra_args=list(),
   modelFrame = NULL, modelFormula = count ~ condition, ... )

Arguments

object

a CountDataSet with size factors.

method

There are three ways how the empirical dispersion can be computed:

  • pooled - Use the samples from all conditions with replicates to estimate a single pooled empirical dispersion value, called "pooled", and assign it to all samples.

  • pooled-CR - Fit models according to modelFormula and estimate the dispersion by maximizing a Cox-Reid adjusted profile likelihood (CR-APL). This method is much slower than method=="pooled" but works also with crossed factors (as may occur, e.g., in designs with paired samples). Usually, you will need to specify the model formula, which should be the same as the one used later in the call to nbinomFitGLMs for fitting the full model.

    Note: The method of using CR-APL maximization for this application has been developed by McCarthy, Chen and Smyth [Nucl. Acid Res., 2012 and been first implemented in edgeR (in 2010). DESeq optimizes the expression for the CR-APL given in McCarthy et al.'s paper, but does not use the weigthed maximum likelihood scheme proposed there.

  • per-condition - For each condition with replicates, compute a gene's empirical dispersion value by considering the data from samples for this condition. For samples of unreplicated conditions, the maximum of empirical dispersion values from the other conditions is used. If object has a multivariate design (i.e., if a data frame was passed instead of a factor for the condition argument in newCountDataSet), this method is not available. (Note: This method was called “normal” in previous versions.)

  • blind - Ignore the sample labels and compute a gene's empirical dispersion value as if all samples were replicates of a single condition. This can be done even if there are no biological replicates. This method can lead to loss of power; see the vignette for details. The single estimated dispersion condition is called "blind" and used for all samples.

sharingMode

After the empirical dispersion values have been computed for each gene, a dispersion-mean relationship is fitted for sharing information across genes in order to reduce variability of the dispersion estimates. After that, for each gene, we have two values: the empirical value (derived only from this gene's data), and the fitted value (i.e., the dispersion value typical for genes with an average expression similar to those of this gene). The sharingMode argument specifies which of these two values will be written to the featureData's disp_ columns and hence will be used by the functions nbinomTest and fitNbinomGLMs.

  • fit-only - use only the fitted value, i.e., the empirical value is used only as input to the fitting, and then ignored. Use this only with very few replicates, and when you are not too concerned about false positives from dispersion outliers, i.e. genes with an unusually high variability.

  • maximum - take the maximum of the two values. This is the conservative or prudent choice, recommended once you have at least three or four replicates and maybe even with only two replicates.

  • gene-est-only - No fitting or sharing, use only the empirical value. This method is preferable when the number of replicates is large and the empirical dispersion values are sufficiently reliable. If the number of replicates is small, this option may lead to many cases where the dispersion of a gene is accidentally underestimated and a false positive arises in the subsequent testing.

fitType
  • parametric - Fit a dispersion-mean relation of the form dispersion = asymptDisp + extraPois / mean via a robust gamma-family GLM. The coefficients asymptDisp and extraPois are given in the attribute coefficients of the dispFunc in the fitInfo (see below).

  • local - Use the locfit package to fit a dispersion-mean relation, as described in the DESeq paper.

locfit_extra_args, lp_extra_args

(only for fitType=local) Options to be passed to the locfit and to the lp function of the locfit package. Use this to adjust the local fitting. For example, you may pass a value for nn different from the default (0.7) if the fit seems too smooth or too rough by setting lp_extra_agrs=list(nn=0.9). As another example, you can set locfit_extra_args=list(maxk=200) if you get the error that locfit ran out of nodes. See the documentation of the locfit package for details. In most cases, you will not need to provide these parameters, as the defaults seem to work quite well.

modelFrame

By default, the information in conditions(object) or pData(object) is used to determine which samples are replicates (see newCountDataSet). For method="pooled", a data frame can be passed here, and all rows that are identical in this data frame are considered to indicate replicate samples in object. For method="pooled-CR", the data frame is used in the fits. For the other methods, this argument is ignored.

modelFormula

For method="pooled-CR", this is the formual used for the dispersion fits. For all other methods, this argument is ignored.

...

extra arguments are ignored

Details

Behaviour for method="per-condition": For each replicated condition, a list, named with the condition's name, is placed in the environment object@fitInfo. This list has five named elements: The vector perGeneDispEsts contains the empirical dispersions. The function dispFunc is the fitted function, i.e., it takes as its argument a normalized mean expression value and returns the corresponding fitted dispersion. The values fitted according to this function are in the third element fittedDispEst, a vector of the same length as perGeneDispEsts. The fourt element, df, is an integer, indicating the number of degrees of freedom of the per-gene estimation. The fifth element, sharingMode, stores the value of the sharingMode argument to esimateDispersions.

Behaviour for method="blind" and method="pooled": Only one list is produced, named "blind" or "pooled" and placed in object@fitInfo.

For each list in the fitInfo environment, the dispersion values that are intended to be used in subsequent testing are computed according to the value of sharingMode and are placed in the featureData data frame, in a column named with the same name, prefixed with "disp_".

Then, the dispTable (see there) is filled to assign to each condition the appropriate dispersion column in the phenoData frame.

Note: Up to DESeq version 1.4.x (Bioconductor release 2.8), this function was called estimateVarianceFunctions, stored its result differently and did not have the arguments sharingMode and fitType. estimatevarianceFunction's behaviour corresponded to the settings sharingMode="fit-only" and fitType="local". Note that these are not the default, because the new defaults sharingMode="maximum" and fitType="parametric" are more robust and tend to give better results.

Value

The CountDataSet cds, with the slots fitInfo and featureData updated as described in Details.

Author(s)

Simon Anders, sanders@fs.tum.de

Examples

1
2
3
4
5
cds <- makeExampleCountDataSet()
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds )
str( fitInfo( cds ) )
head( fData( cds ) )

DESeq documentation built on April 28, 2020, 6:37 p.m.