scDD: scDD

View source: R/scDD.R

scDDR Documentation



Find genes with differential distributions (DD) across two conditions


scDD(SCdat, prior_param = list(alpha = 0.1, mu0 = 0, s0 = 0.01, a0 = 0.01, b0
  = 0.01), permutations = 0, testZeroes = TRUE, adjust.perms = FALSE,
  param = bpparam(), parallelBy = c("Genes", "Permutations"),
  condition = "condition", min.size = 3, min.nonzero = NULL,
  level = 0.05, categorize = TRUE)



An object of class SingleCellExperiment that contains normalized single-cell expression and metadata. The assays slot contains a named list of matrices, where the normalized counts are housed in the one named normcounts. This matrix should have one row for each gene and one sample for each column. The colData slot should contain a data.frame with one row per sample and columns that contain metadata for each sample. This data.frame should contain a variable that represents biological condition, which is in the form of numeric values (either 1 or 2) that indicates which condition each sample belongs to (in the same order as the columns of normcounts). Optional additional metadata about each cell can also be contained in this data.frame, and additional information about the experiment can be contained in the metadata slot as a list.


A list of prior parameter values to be used when modeling each gene as a mixture of DP normals. Default values are given that specify a vague prior distribution on the cluster-specific means and variances.


The number of permutations to be used in calculating empirical p-values. If set to zero (default), the full Bayes Factor permutation test will not be performed. Instead, a fast procedure to identify the genes with significantly different expression distributions will be performed using the nonparametric Kolmogorov-Smirnov test, which tests the null hypothesis that the samples are generated from the same continuous distribution. This test will yield slightly lower power than the full permutation testing framework (this effect is more pronounced at smaller sample sizes, and is more pronounced in the DB category), but is orders of magnitude faster. This option is recommended when compute resources are limited. The remaining steps of the scDD framework will remain unchanged (namely, categorizing the significant DD genes into patterns that represent the major distributional changes, as well as the ability to visualize the results with violin plots using the sideViolin function).


Logical indicating whether or not to test for a difference in the proportion of zeroes. This will only be done for genes that have at least one zero value (genes where all cells have a nonzero value will have a 'zero.pvalue' of NA).


Logical indicating whether or not to adjust the permutation tests for the sample detection rate (proportion of nonzero values). If true, the residuals of a linear model adjusted for detection rate are permuted, and new fitted values are obtained using these residuals.


a MulticoreParam or SnowParam object of the BiocParallel package that defines a parallel backend. The default option is BiocParallel::bpparam() which will automatically creates a cluster appropriate for the operating system. Alternatively, the user can specify the number of cores they wish to use by first creating the corresponding MulticoreParam (for Linux-like OS) or SnowParam (for Windows) object, and then passing it into the scDD function. This could be done to specify a parallel backend on a Linux-like OS with, say 12 cores by setting param=BiocParallel::MulticoreParam(workers=12)


For the permutation test (if invoked), the manner in which to parallelize. The default option is "Genes" which will spawn processes that divide up the genes across all cores defined in param cores, and then loop through the permutations. The alternate option is "Permutations" which loop through each gene and spawn processes that divide up the permutations across all cores defined in param. The default option is recommended when analyzing more genes than the number of permutations.


A character object that contains the name of the column in colData that represents the biological group or condition of interest (e.g. treatment versus control). Note that this variable should only contain two possible values since scDD can currently only handle two-group comparisons. The default option assumes that there is a column named "condition" that contains this variable.


a positive integer that specifies the minimum size of a cluster (number of cells) for it to be used during the classification step. Any clusters containing fewer than min.size cells will be considered an outlier cluster and ignored in the classfication algorithm. The default value is three.


a positive integer that specifies the minimum number of nonzero cells in each condition required for the test of differential distributions. If a gene has fewer nonzero cells per condition, it will still be tested for DZ (if testZeroes is TRUE). Default value is NULL (no minimum value is enforced).


numeric value between 0 and 1 that specifies the alpha level for significance of a differential gene test (default value 0.05). This is used to decide whether to classify a gene into one of the differential patterns. If 'testZeroes' is FALSE and the adjusted p-value for a given gene is below 'level', then the gene is categorized. Alternatively, if 'testZeroes' is TRUE, then the adjusted p-value must be below 'level/2' in order to be considered significant and categorized. This is done to control for multiple testing since 'testZeroes=TRUE' means that each gene is tested for a difference in nonzeroes and zeroes separately.


a logical indicating whether to determine which categories (DE, DP, DM, DB) each gene belongs to (default = TRUE). This can only be set to FALSE if 'permutations' is set to zero, since the full model fitting will automatically be carried out if permutations are run.


Find genes with differential distributions (DD) across two conditions. Models each log-transformed gene as a Dirichlet Process Mixture of normals and uses a permutation test to determine whether condition membership is independent of sample clustering. The FDR adjusted (Benjamini-Hochberg) permutation p-value is returned along with the classification of each significant gene (with p-value less than 0.05 (or 0.025 if also testing for a difference in the proportion of zeroes)) into one of four categories (DE, DP, DM, DB). For genes that do not show significant influence, of condition on clustering, an optional test of whether the proportion of zeroes (dropout rate) is different across conditions is performed (DZ).


A SingleCellExperiment object that contains the data and sample information from the input object, but where the results objects are now added to the metadata slot. The metadata slot is now a list with four items: the first (main results object) is a data.frame with the following columns:

  • 'gene': gene name (matches rownames of SCdat)

  • 'DDcategory': name of the DD (DE, DP, DM, DB, DZ) pattern (or NS = not significant)

  • 'Clusters.combined': the number of clusters identified overall

  • 'Clusters.C1': the number of clusters identified in condition 1 alone

  • 'Clusters.C2': the number of clusters identified in condition 2 alone

  • 'nonzero.pvalue': permutation (or KS) p-value for testing difference in nonzero expression values

  • 'nonzero.pvalue.adj': Benjamini-Hochberg adjusted version of the 'nonzero.pvalue'column

  • 'zero.pvalue': p-value for test of difference in dropout rate (only if 'testZeroes' is TRUE)

  • 'zero.pvalue': Benjamini-Hochberg adjusted version of the previous column (only if 'testZeroes' is TRUE)

  • ‘combined.pvalue': Fisher’s combined p-value for a difference in nonzero or zero values (only if 'testZeroes' is TRUE).

  • 'combined.pvalue.adj': Benjamini-Hochberg adjusted version of the previous column (only if 'testZeroes' is TRUE)

The remaining three elements are matrices (first for condition 1 and 2 combined, then condition 1 alone, then condition 2 alone) that contains the cluster memberships for each sample (cluster 1,2,3,...) in columns and genes in rows. Zeroes, which are not involved in the clustering, are labeled as zero. See the results function for a convenient way to extract these results objects.


Korthauer KD, Chu LF, Newton MA, Li Y, Thomson J, Stewart R, Kendziorski C. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biology. 2016 Oct 25;17(1):222.


# load toy simulated example SingleCellExperiment object to find DD genes


# check that this object is a member of the SingleCellExperiment class
# and that it contains 200 samples and 30 genes


# set arguments to pass to scDD function
# we will perform 100 permutations on each of the 30 genes

prior_param=list(alpha=0.01, mu0=0, s0=0.01, a0=0.01, b0=0.01)
nperms <- 100

# call the scDD function to perform permutations, classify DD genes, 
# and return results
# we won't perform the test for a difference in the proportion of zeroes  
# since none exists in this simulated toy example data
# this step will take significantly longer with more genes and/or 
# more permutations

scDatExSim <- scDD(scDatExSim, prior_param=prior_param, permutations=nperms, 

kdkorthauer/scDD documentation built on March 27, 2022, 5:11 a.m.