diem: Run DIEM pipeline
In diem: Debris-Containing Droplet Identification using EM

Description Usage Arguments Details Value Examples

Run DIEM pipeline on an SCE object. This function takes as input raw data from a single-nucleus or single-cell experiment and filters out droplets that show evidence of contamination from debris/ambient RNA. The input data is an sce object, returned by the function create_SCE. The function outputs the SCE object with meta data specifying the filtered and clean droplets.

diem(sce, top_n = 10000, min_counts = 100, min_genes = 100,
  fix_debris = NULL, cpm_thresh = 10, cluster_n = 1000,
  order_by = "gene", use_var = TRUE, n_var = 2000, lss = 0.3,
  sf = "median", nn = 30, min_size = 20, eps = 1e-08,
  max_iter = 100, psc = 1e-04, pp_thresh = 0.95, gene_thresh = 200,
  verbose = TRUE)

`sce`	An SCE object.
`top_n`	A numeric value giving the total number of test droplets and take the counts of the `top_n` ranked droplet. Set to `NULL` to disable and set all droplets above `min_counts` and `min_genes` to the test set.
`min_counts`	A numeric value that specifies the minimum total read/UMI counts a test droplet can have.
`min_genes`	A numeric value that specifies the minimum total genes detected a test droplet can have.
`fix_debris`	A character vector of droplet IDs that will be assigned to the debris set, regardless of its total counts or genes detected.
`cpm_thresh`	The minimum CPM threshold for removing genes.
`cluster_n`	Numeric value specifying the number of droplets to use in the test set.
`order_by`	Whether to order the droplets by total number of total counts or total number of genes detected in calculating the cluster set.
`use_var`	A logical indicating whether to subset the data to include only variable genes when initializing the clusters. The default is TRUE as it may better identify cell types.
`n_var`	Number of variable genes to use when initializing the clusters.
`lss`	Numeric value of the span parameter of the loess regression when identifying variable genes.
`sf`	The scaling factor for the normalization. Either a numeric scaling factor to multiply counts after division by column sums, or "median" indicating to multiply by the median number of total read/UMI counts in droplets (default).
`nn`	Number of nearest neighbors to calculate in constructing the kNN graph for cluster initialization.
`min_size`	Numeric value giving the minimum number of droplets in cluster for it to be used for initialization as a cell type for EM.
`eps`	Numeric threshold. The EM algorithm converges when the percent change in log likihood is less than `eps`.
`max_iter`	The maximum number of iterations allowed to run.
`psc`	Pseudocount to add to multinomial parameters to avoid collapsing likelihood to 0.
`pp_thresh`	Numeric threshold, where clean droplets must have a posterior probability of at least `pp_thresh`.
`gene_thresh`	Numeric threshold, where clean droplets must have at least `min_genes` genes detected.
`verbose`	Logical indicating verbosity.

The diem function takes the sce object sce and first partitions the droplets into the debris, test, and cluster sets. min_counts and min_genes specify the lower limit that droplets in the test set must have. In addition, the parameter top_n specifies that at most the top top_n should be included in the test set. The default is 10,000 in accordance with the typical maximum number of droplets generated during a single-cell experiment, but should be changed accordingly.

If the read/UMI count of the top_n count-ranked droplet is greater than min_counts, then the min_counts is increased to this limit. The fix_debris parameter allows additional flexibility in specifying those droplets that are included in the debris set. For example, it may be useful to set droplets above a pre-determined MT regardless of the total number of read/UMI counts present. In order to remove lowly expressed genes, the Counts Per Million mapped reads (CPM) are calculated for the both the test set and debris set. Then, only genes that have a CPM of at least cpm_thresh in both the test set and debris set are kept.

Instead of randomly initializing the EM, cell types are estimated from droplets that are expected to contain cells/nuclei. The top cluster_n droplets are ranked by "gene" or "count", given by the parameter order_by. Then, droplets with at least those ranked counts/genes are included in the cluster set. The data is then normalized by first calculating the variable genes. A loess regression line is fit between the log counts and log variance, and the only top genes ranked by residual are used to initialize the clusters. The number of genes is specified with n_var. Optionally, one can use all genes by setting use_var to FALSE. The span of the loess regression line is given by lss (default is 0.3). The data is normalized by dividing counts by the total counts per droplet. Then, the counts are multiplied by a scaling factor, given by sf (the median of total counts by default). Finally, the data is log transformed after adding a constant value of 1. After normalization, the k-nearest neighbors are identified in the cluster set. The number of nearest neighbors is specified by nn. Clusters are identified from the KNN graph using the Louvain algorithm. Finally, only clusters with at least min_size (20 by default) droplets are considered cell types.

The number of clusters and their initialized multinomial means are taken from the initial clustering assignments calculated with initialize_clusters. The EM algorithm is run by repeatedly updating the membership probabilities and then estimating the MLE parameters of the multinomial mixture model. The algorithm converges when the percent change in the log likihood is less than eps. If the algorithm doesn't converge by max_iter, it breaks off. The posterior probability is the calculated by taking the sum of the likelihood fractions across the cell types.

Given the posterior probabilities, the droplets with those less than pp_thresh and with less than n_genes genes detected are marked as debris.

SCE object.

counts <- read_10x("mouse_nuclei_2k/raw_gene_bc_matrices/mm10/")
mb_sce <- create_SCE(x=counts, name="Mouse Brain")
mb_sce <- diem(mb_sce)
mm_seur <- convert_to_seurat(x=mb_sce, min.features = 200, min.cells = 3, project=mb_sce@name)