evaluateDist: Model Diagnostics for RNAseq Data
In bvieth/powsimR: Power Simulations for RNA-sequencing

evaluateDist

R Documentation

Model Diagnostics for RNAseq Data

Description

With this function, the user can determine goodness of fit for each gene.

Usage

evaluateDist(countData, batchData = NULL,
spikeData = NULL, spikeInfo = NULL,
Lengths = NULL, MeanFragLengths = NULL,
RNAseq = "singlecell", Protocol,
Normalisation,
GeneFilter = 0.25, SampleFilter = 3,
FracGenes = 1,
verbose = TRUE)

Arguments

`countData`	is a count matrix (row=gene, column=sample). Please provide the measurements of one group only, e.g. the control group.
`batchData`	is a `data.frame` for batch annotation. Rows correspond to samples. The first column should contain the batches, e.g. 'a', 'b', 'c', etc.. This is only used for the normalisation.
`spikeData`	is a count `matrix`. Rows correspond to spike-ins, columns to samples. The order of columns should be the same as in the `countData`. This is only needed for spike-in aware normalisation methods ('MR', 'Linnorm', 'scran', 'SCnorm', 'bayNorm', 'Census'), see details section of `estimateParam`.
`spikeInfo`	is a molecule count `matrix` of spike-ins. Rows correspond to spike-ins. The order of rows should be the same as in the `spikeData`. The column names should be 'SpikeID' and 'SpikeInput' for molecule counts of spike-ins. This is only needed for spike-in aware normalisation methods ('MR', 'Linnorm', 'scran', 'SCnorm', 'bayNorm', 'Census'), see details section of `estimateParam`.
`Lengths`	is a numeric vector of transcript lengths with the same length and order as the rows in countData. This variable is only needed for internal gene length corrections (TPM), see details section of `estimateParam`.
`MeanFragLengths`	is a numeric vector of mean fragment lengths with the same length as columns in countData. This variable is only needed for internal gene length corrections (TPM), see details section of `estimateParam`.
`RNAseq`	is a character value: "bulk" or "singlecell". We recommended to use this evaluaiton for single cells only.
`Protocol`	is a character value defining the type of counts given in `countData`. Options are "UMI" (e.g. 10X Genomics, CEL-seq2) or "Read" (e.g. Smart-seq2).
`Normalisation`	is a character value: 'TMM', 'MR', 'PosCounts', 'UQ', 'scran', 'Linnorm', 'SCnorm', 'Census', 'depth', 'none'. For more information, please consult the details section of `estimateParam`.
`GeneFilter`	is a numeric vector indicating the minimal proportion of nonzero expression values for a gene across all samples to be considered expressed and used for estimating normalisation size factors. The default is `0.25`, i.e. at least 25% of the expression values per gene need to be nonzero.
`SampleFilter`	is a numeric vector indicating the minimal number of MADs (median absolute deviation) away from the median number of features detected as well as sequencing depth across all samples so that outlying samples are removed prior to normalisation. The default is `5`, i.e. at least 5 MADs away from the median. Choose higher values if you wsant to filter out less samples. This parameter is particularly important for single cells to ensure reliable parameter estimation. For more information, please consult `isOutlier`.
`FracGenes`	The fraction of genes to calculate goodness of fit statistics, default is 1, i.e. for all genes.
`verbose`	Logical value to indicate whether to print function information. Default is `TRUE`.

Value

List object with the results of goodness of fit and estimated parameters:

`edgeR`	Goodness-of-fit statistic, degrees of freedom and associated p-value using the deviance and residual degrees of freedom from `glmFit`. Furthermore, the AIC of the edgeR model fit using the residuals of `zscoreNBinom`.
`GOF`	The fitting results per distribution, including loglikelihood, goodness-of-fit statistics, AIC and predicted number of zeroes. The following distributions were considered: Poisson, negative binomial, zero-inflated poisson and negative binomial following the 'standard' (i.e. `glm`, `glm.nb` and `zeroinfl` implementation) and fitdist approach (see `fitdist`) and Beta-Poisson following Marioni or Hemberg parameterisation. Furthermore, model fit comparison by LRT for nested and Vuong Test for non-nested models.
`Estimates`	The estimated parameters of distribution fitting.
`ObservedZeros`	The number of zeroes and dropout rate per gene.

Author(s)

Beate Vieth

Examples

## Not run: 
## using example data set, but run it for fraction of genes
data("CELseq2_Gene_UMI_Counts")
evalDistRes <- evaluateDist(countData = CELseq2_Gene_UMI_Counts, batchData = NULL,
                            spikeData = NULL, spikeInfo = NULL,
                            Lengths = NULL, MeanFragLengths = NULL,
                            RNAseq = "singlecell", Protocol = "UMI",
                            Normalisation = "scran",
                            GeneFilter = 0.1, SampleFilter = 3,
                            FracGenes = 0.1,
                            verbose = TRUE)
plotEvalDist(evalDistRes)

## End(Not run)

bvieth/powsimR documentation built on Aug. 19, 2023, 7:48 p.m.