Dino: Normalize scRNAseq data

View source: R/Dino.R

DinoR Documentation

Normalize scRNAseq data

Description

Dino removes cell-to-cell variation in observed counts due to the effects of sequencing depth from single-cell mRNA sequencing experiments. Dino was particularly designed with UMI based protocols in mind, but is applicable to non-UMI based chemistries in the library preparation stage of sequencing.

Usage

Dino(counts, nCores = 2, prec = 3, minNZ = 10,
    nSubGene = 1e4, nSubCell = 1e4, depth = NULL, slope = NULL,
    minSlope = 1/2, maxSlope = 2, clusterSlope = TRUE,
    returnMeta = FALSE, doRQS = FALSE,
    emPar = list(maxIter = 100, tol = 0.1, conPar = 15, maxK = 100), ...)

Arguments

counts

A numeric matrix object of expression counts - usually in dgCMatrix format for memory efficiency. Column names denote cells (samples or droplets) and row names denote genes.

nCores

A non-negative integer scalar denoting the number of cores which should be used. Setting nCores to 0 uses all cores as determined by running parallel::detectCores()

prec

A positive integer denoting the number of decimals to which to round depth (if estimated internally via depth = NULL) and normalized counts for computational efficiency.

minNZ

A positive integer denoting the minimum number of non-zero counts for a gene to be normalized by the Dino algorithm. It is recommended to pre-filter the counts matrix such that all genes meet this threshold. Otherwise, genes with fewer than minNZ non-zeros will be scaled by depth for normalization.

nSubGene

A positive integer denoting the number of genes to subset for calculation of slope.

nSubCell

A positive integer denoting the number of samples to subset for calculation of slope and the EM algorithm.

depth

A numeric vector of length equal to the columns of counts. depth denotes a median-centered, log-scale measure of cell-wise sequencing depth. Dino defaults to defining depth as the (within-cell) sum of counts across genes, followed by a log and median-centering transformation.

slope

A numeric scalar denoting the count-depth relationship on the log-log scale. Typical values are close to 1 (implying a unit increase in depth corresponds to a unit increase in expected counts on the log-log scale), but may be higher, particularly in the case of non-UMI protocols. Dino defaults to estimating slope internally.

minSlope

A numeric scalar denoting the minimum slope. Fitted slopes below this value will return a warning and be set to 1

maxSlope

A numeric scalar denoting the maximum slope. Fitted slopes above this value will return a warning and be set to 1

clusterSlope

A logical indicating whether cells should be pre-clustered prior to calculation of slope. Under the default where cells are pre-clustered, cluster is used as a factor in the regression.

returnMeta

A logical indicating whether metadata (sequencing depth and slope) should be returned.

doRQS

A logical indicating how normalization resampling is to be done. By default (F), normalization is done by resampling from the full posterior distribution. Alternately, restricted quantile sampling (RQS) can be performed to enforce stronger preservation of expression ranks in normalized data. Currently RQS is considered experimental.

emPar

A list of parameters to send to the EM algorithm. maxIter denotes the maximum number of model updates. tol denotes the cutoff threshold for reductions in the log likelihood function. conPar denotes the concentration parameter for the resampling. conPar = 1 implies full resampling from the fitted distribution. As conPar increases, the normalized expression converges to the scale-factor normalized values. maxK denotes the maximum number of mixture components in the mixture model.

...

Additional parameters to pass to Scran::quickCluster.

Value

Dino by default returns a matrix of normalized expression with identical dimensions as counts. If returnMeta = TRUE, then Dino returns a list of normalized expression, sequencing depth, and slope.

Author(s)

Jared Brown

References

Brown, J., Ni, Z., Mohanty, C., Bacher, R. and Kendziorski, C. (2020) "Normalization by distributional resampling of high throughput single-cell RNA-sequencing data." bioRxiv. https://doi.org/10.1101/2020.10.28.359901

Examples

# raw data
data("pbmcSmall")
str(pbmcSmall)

# run Dino on raw expression matrix
pbmcSmall_Norm <- Dino(pbmcSmall)
str(pbmcSmall_Norm)


JBrownBiostat/Dino documentation built on June 11, 2022, 1:27 p.m.