estimateSCV: Estimate squared coefficient of variation for each gene

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

A similar method is applied to estimate the SCV for each gene based on the method used in DESeq

Usage

1
2
3
4
## S4 method for signature 'XBSeqDataSet'
estimateSCV( object, method = c( "pooled", "per-condition", "blind" ), sharingMode = c( "maximum", "fit-only", "gene-est-only" ),
   fitType = c("local","parametric"),
   locfit_extra_args=list(), lp_extra_args=list(), ... )

Arguments

object

a XBSeqDataSet with size factors.

method

There are three ways how the empirical dispersion can be computed:

  • pooled - Use the samples from all conditions with replicates to estimate a single pooled empirical dispersion value, called "pooled", and assign it to all samples.

  • per-condition - For each condition with replicates, compute a gene's empirical dispersion value by considering the data from samples for this condition. For samples of unreplicated conditions, the maximum of empirical dispersion values from the other conditions is used.

  • blind - Ignore the sample labels and compute a gene's empirical dispersion value as if all samples were replicates of a single condition. This can be done even if there are no biological replicates. This method can lead to loss of power.

sharingMode

After the empirical dispersion values have been computed for each gene, a dispersion-mean relationship is fitted for sharing information across genes in order to reduce variability of the dispersion estimates. After that, for each gene, we have two values: the empirical value (derived only from this gene's data), and the fitted value (i.e., the dispersion value typical for genes with an average expression similar to those of this gene). The sharingMode argument specifies which of these two values will be written to the dispEst and hence will be used by the functions XBSeqTest

  • fit-only - use only the fitted value, i.e., the empirical value is used only as input to the fitting, and then ignored. Use this only with very few replicates, and when you are not too concerned about false positives from dispersion outliers, i.e. genes with an unusually high variability.

  • maximum - take the maximum of the two values. This is the conservative or prudent choice, recommended once you have at least three or four replicates and maybe even with only two replicates.

  • gene-est-only - No fitting or sharing, use only the empirical value. This method is preferable when the number of replicates is large and the empirical dispersion values are sufficiently reliable. If the number of replicates is small, this option may lead to many cases where the dispersion of a gene is accidentally underestimated and a false positive arises in the subsequent testing.

fitType
  • parametric - Fit a dispersion-mean relation of the form dispersion = asymptDisp + extraPois / mean via a robust gamma-family GLM. The coefficients asymptDisp and extraPois are given in the attribute coefficients of the dispFunc in the fitInfo.

  • local - Use the locfit package to fit a dispersion-mean relation, as described in the DESeq paper.

locfit_extra_args, lp_extra_args

(only for fitType=local) Options to be passed to the locfit and to the lp function of the locfit package. Use this to adjust the local fitting. For example, you may pass a value for nn different from the default (0.7) if the fit seems too smooth or too rough by setting lp_extra_agrs=list(nn=0.9). As another example, you can set locfit_extra_args=list(maxk=200) if you get the error that locfit ran out of nodes. See the documentation of the locfit package for details. In most cases, you will not need to provide these parameters, as the defaults seem to work quite well.

...

extra arguments are ignored

Details

The details regarding which option to choose can be found in the DESeq help page. Generally speaking, if you have less number of replicates (<=3), set method="pooled". Otherwise, try method="per-condition". We revised the code to estimate the variance of the true signal by using variance sum law rather than calculate the variance directly.

Value

The XBSeqDataSet cds, with the slots fitInfo and dispEst updated.

Author(s)

Yuanhang Liu

References

H. I. Chen, Y. Liu, Y. Zou, Z. Lai, D. Sarkar, Y. Huang, et al., "Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads," BMC Genomics, vol. 16 Suppl 7, p. S14, Jun 11 2015.

See Also

XBSeqDataSet

Examples

1
2
3
4
5
6
7
   conditions <- factor(c(rep('C1', 3), rep('C2', 3)))
   data(ExampleData)
   XB <- XBSeqDataSet(Observed, Background, conditions)
   XB <- estimateRealCount(XB)
   XB <- estimateSizeFactors(XB)
   XB <- estimateSCV(XB, fitType='local')
   str(fitInfo(XB))

XBSeq documentation built on Nov. 8, 2020, 11:12 p.m.