Description Usage Arguments Details Value Examples
Run DIEM pipeline on an SCE object. This function takes as input raw data
from a single-nucleus or single-cell experiment and filters out droplets
that show evidence of contamination from debris/ambient RNA. The input
data is an sce object, returned by the function create_SCE
.
The function outputs the SCE object with meta data specifying the
filtered and clean droplets.
1 2 3 4 5 6 | diem(sce, top_n = 10000, min_counts = 100, min_genes = 100,
fix_debris = NULL, cpm_thresh = 10, cluster_n = 1000,
order_by = "gene", use_var = TRUE, n_var = 2000, lss = 0.3,
sf = "median", nn = 30, min_size = 20, eps = 1e-08,
max_iter = 100, psc = 1e-04, pp_thresh = 0.95, gene_thresh = 200,
verbose = TRUE)
|
sce |
An SCE object. |
top_n |
A numeric value giving the total number of test droplets
and take the counts of the |
min_counts |
A numeric value that specifies the minimum total read/UMI counts a test droplet can have. |
min_genes |
A numeric value that specifies the minimum total genes detected a test droplet can have. |
fix_debris |
A character vector of droplet IDs that will be assigned to the debris set, regardless of its total counts or genes detected. |
cpm_thresh |
The minimum CPM threshold for removing genes. |
cluster_n |
Numeric value specifying the number of droplets to use in the test set. |
order_by |
Whether to order the droplets by total number of total counts or total number of genes detected in calculating the cluster set. |
use_var |
A logical indicating whether to subset the data to include only variable genes when initializing the clusters. The default is TRUE as it may better identify cell types. |
n_var |
Number of variable genes to use when initializing the clusters. |
lss |
Numeric value of the span parameter of the loess regression when identifying variable genes. |
sf |
The scaling factor for the normalization. Either a numeric scaling factor to multiply counts after division by column sums, or "median" indicating to multiply by the median number of total read/UMI counts in droplets (default). |
nn |
Number of nearest neighbors to calculate in constructing the kNN graph for cluster initialization. |
min_size |
Numeric value giving the minimum number of droplets in cluster for it to be used for initialization as a cell type for EM. |
eps |
Numeric threshold. The EM algorithm converges when the
percent change in log likihood is less than |
max_iter |
The maximum number of iterations allowed to run. |
psc |
Pseudocount to add to multinomial parameters to avoid collapsing likelihood to 0. |
pp_thresh |
Numeric threshold, where clean droplets must have a
posterior probability of at least |
gene_thresh |
Numeric threshold, where clean droplets must have at least
|
verbose |
Logical indicating verbosity. |
The diem function takes the sce object sce
and first
partitions the droplets into the debris, test, and cluster sets.
min_counts
and min_genes
specify the lower limit
that droplets in the test set must have. In addition,
the parameter top_n
specifies that at most the top top_n
should be included in the test set. The default is 10,000
in accordance with the typical maximum number of droplets
generated during a single-cell experiment, but should be changed
accordingly.
If the read/UMI count of the top_n
count-ranked droplet is greater
than min_counts
, then the min_counts
is
increased to this limit. The fix_debris
parameter allows
additional flexibility in specifying those droplets that are
included in the debris set. For example, it may be useful to
set droplets above a pre-determined MT
regardless of the total number of read/UMI counts present.
In order to remove lowly expressed genes, the Counts
Per Million mapped reads (CPM) are calculated for the
both the test set and debris set. Then, only genes
that have a CPM of at least cpm_thresh
in both the test set and
debris set are kept.
Instead of randomly initializing the EM, cell types are estimated from
droplets that are expected to contain cells/nuclei. The top
cluster_n
droplets are ranked by "gene" or "count", given by the
parameter order_by
. Then, droplets with at least those ranked
counts/genes are included in the cluster set. The data is then normalized
by first calculating the variable genes. A loess regression line
is fit between the log counts and log variance, and the only top genes
ranked by residual are used to initialize the clusters. The number of
genes is specified with n_var
. Optionally, one can use all genes
by setting use_var
to FALSE. The span of the loess regression line
is given by lss
(default is 0.3).
The data is normalized by dividing counts by the total counts per droplet.
Then, the counts are multiplied by a scaling factor, given by
sf
(the median of total counts by default). Finally, the data is
log transformed after adding a constant value of 1.
After normalization, the k-nearest neighbors are identified in the
cluster set. The number of nearest neighbors is specified by
nn
. Clusters are identified from the KNN graph
using the Louvain algorithm. Finally, only clusters with at least
min_size
(20 by default) droplets are considered cell types.
The number of clusters and their initialized multinomial means
are taken from the initial clustering assignments calculated with
initialize_clusters
. The EM algorithm is run
by repeatedly updating the membership probabilities and then
estimating the MLE parameters of the multinomial mixture model.
The algorithm converges when the percent change in
the log likihood is less than eps
. If the
algorithm doesn't converge by max_iter
, it breaks off.
The posterior probability is the calculated by taking the
sum of the likelihood fractions across the cell types.
Given the posterior probabilities, the droplets with those
less than pp_thresh
and with less than
n_genes
genes detected are marked as debris.
SCE object.
1 2 3 4 | counts <- read_10x("mouse_nuclei_2k/raw_gene_bc_matrices/mm10/")
mb_sce <- create_SCE(x=counts, name="Mouse Brain")
mb_sce <- diem(mb_sce)
mm_seur <- convert_to_seurat(x=mb_sce, min.features = 200, min.cells = 3, project=mb_sce@name)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.