knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The diem
package provides a lightweight solution to clean up single-cell
data during the pre-processing step. The purpose
of diem
is to remove droplets that contain extranuclear or extracellular
RNA. diem
is designed for droplet-based single-cell RNA-seq and has been
tested on 10X Chromium data.
This tutorial will use the publicly available 10X single-nucleus RNA-seq data that was generated from fresh mouse brain tissue. We will use a subset of the data that is available with the package. The full data set is available here.
diem
takes as input the raw counts from a droplet-based single-cell
experiment such as from 10X. Since the raw data consists of roughly
20,000 genes and over 100,000 droplets, the counts data has to be
stored as a sparse matrix from the R package
Matrix.
Furthermore, diem
takes advantage of the matrix manipulation
functions in the Matrix
package to speed up computations. The R
object that wraps data access and the diem
functions is the
SCE
object. It is important to note here that diem
works best
when all droplets generated by the experiment are included, even
those with only 1 read mapping to a gene. They provide extensive
information on the RNA profile of the background distribution.
If you have 10X data that has been aligned with
CellRanger,
then the raw data will be stored in outs/raw_feature_bc_matrix
in
the output directory (for CellRanger v3). We provide a convenience function
read_10x
that can read the 10X output into a sparse Matrix.
library(diem) # In the diem directory counts <- read_10x("../tests/testdata/") dim(counts) class(counts)
This function reads in the counts present in the tests/testdata
directory into a sparse matrix. This subset of the mouse brain raw data
from 10X contains 1,001 genes and 2,457 droplets.
The raw counts in a sparse matrix forms the starting point for
pre-processing with diem
. To create the SCE
object:
mb_small <- create_SCE(counts, name="MouseBrain") dim(mb_small) class(mb_small)
The SCE object mb_small
stores the counts as well as the meta data
on the droplets:
drop_data <- droplet_data(mb_small) head(drop_data) summary(drop_data)
Useful diagnostics for assessing droplets include the percent of reads that align to the mitochondrial genome, in addition to percent that align the nuclear-localized MALAT1. We can generate this using the following:
mt_genes <- grep(pattern = "^mt-", x = rownames(mb_small@gene_data), ignore.case = TRUE, value = TRUE) mb_small <- get_gene_pct(x = mb_small, genes = mt_genes, name = "pct.mt") genes <- grep(pattern = "^malat1$", x = rownames(mb_small@gene_data), ignore.case = TRUE, value = TRUE) mb_small <- get_gene_pct(x = mb_small, genes = genes, name = "MALAT1") drop_data <- droplet_data(mb_small) summary(drop_data)
Before filtering, it's good plot the distribution of reads and MT%.
datf <- drop_data par(mfrow=c(2,2)) plot(datf$total_counts, datf$n_genes, xlab="Total Counts", ylab="Number Genes") plot(datf$n_genes, datf$pct.mt, xlab="Number Genes", ylab="MT%") plot(datf$n_genes, datf$MALAT1, xlab="Number Genes", ylab="MALAT1%") plot(datf$pct.mt, datf$MALAT1, xlab="MT%", ylab="MALAT1%")
The data suggests we can separate out 2 clusters, consisting of one low count/high MT% cluster, and one high count/low MT% cluster.
The following outlines the steps involved in diem
:
To run diem
, we first have to specify which droplets we would like to:
1 and 2 are mutually exclusive, while 3 is a subset of 1. It is helpful to use the above plots, as well as a barcode-rank plot, to determine the droplet sets.
barcode_rank_plot(mb_small, title = "MouseBrain")
Since this is only a subset, the tail of the low-count droplets in the
barcode-rank plot is much smaller. Typical data sets will have
tens to hundreds of thousands of droplets more than what is shown here.
The choice of the cutoff to fix debris and classify droplets is somewhat
arbitrary and depends on subjective interpretation of what will likely
be included in the debris. We should select the cutoff to fix as much
of the debris droplets into the debris group, while allowing to classify
droplets that may contain cells or nuclei. To set the groups, we use
the function set_debris_test_set
:
mb_small <- set_debris_test_set(mb_small, top_n = 2000, min_counts = 100, min_genes = 100) length(mb_small@test_set) length(mb_small@bg_set)
This results in 457 test droplets and 2,000 debris droplets. The function
call states that the test set (droplets to classify) must have at least
100 total counts and 100 genes detected. Furthermore, only consider the
top 2,000 droplets ranked by counts. Anything below this, regardless
of total counts and genes detected, will be fixed to debris. This option
isn't necessary here but we put it here to illustrate it. This can
help with variable total counts when we have a good approximation of the
maximum number of nuclei or cells that can be captured. To turn this
off, set top_n = Inf
.
Now that we set the test and debris droplets, we remove lowly expressed genes. We use counts per million mapped reads (CPM) to set the filtering criteria.
mb_small <- filter_genes(mb_small, cpm_thresh = 10) genes <- gene_data(mb_small) summary(genes)
All genes are expressed here (since this is a subset including the expressed genes).
Before running EM on the multinomial mixture model, we need to
estimate the number of mixtures and initialize the mean parameters
for each multinomial. We do this by non-parametric clustering of
of droplets that are likely to contain nuclei. Here we use the top
200 ranked by gene to estimate the cell types. Alternatively, you
can order_by = "count"
.
mb_small <- set_cluster_set(mb_small, cluster_n = 200, order_by = "gene") length(mb_small@cluster_set)
The number of cell types are estimated by running Louvain clustering
on the cluster set defined above. Briefly, this involves extracting
the variable genes, normalizing the counts, generating a weighted
k-nearest graph, and clustering the graph. This is done with a
single function initialize_clusters
. You can use variable genes
or all genes by setting the use_var
parameter, along with
setting the number of variable genes with n_var
.
mb_small <- initialize_clusters(mb_small, use_var = TRUE, n_var = 200, verbose = TRUE)
This identified 3 distinct cell types in addition to the debris set. Now that we have the number of mixtures and their initializations, we can run EM to estimate the parameters and classify the droplets.
Use the run_em
function to estimate the parameters of the multinomial
mixture and calculate droplet probabilities:
mb_small <- run_em(mb_small, verbose = TRUE) drop_data <- droplet_data(mb_small) summary(drop_data)
This provides the probability that a droplet belongs to one of the k cell
types or debris. The column CleanProb
gives 1 minus the probability that
the droplet belongs to the debris group. The Cluster
column gives the
group with the highest probability, while ClusterProb
gives the probability
of that cluster.
To classify a droplet as debris or nucleus/cell, set a threshold on
CleanProb
.
mb_small <- call_targets(mb_small, pp_thresh = 0.95, min_genes = 100) drop_data <- droplet_data(mb_small) summary(drop_data)
This gives 371 clean droplets and 2,086 debris. The majority of the 2,086 droplets had very low counts and would have been excluded anyways. To see how many of the droplets with at least 100 genes detected were removed as debris, we can run the following:
clean <- get_clean_ids(mb_small) removed <- setdiff(mb_small@test_set, clean) length(clean) length(removed) length(removed) / (length(removed) + length(clean))
This removed nearly 19% of the 457 test droplets. To see which droplets which were removed, we can plot the droplets colored by their probability.
plot_umi_gene_call(mb_small, alpha = 1)
We can also plot with MT% on one of the axes.
library(ggplot2) drop_data <- droplet_data(mb_small) ggplot(drop_data, aes(x = n_genes, y = pct.mt, color = CleanProb)) + geom_point() + theme_minimal() + xlab("Number Genes") + ylab("MT%")
The droplets removed tend to be those with high MT% and low UMI counts/ genes detected.
diem
filtering provided a set of droplets we can use for downstream
analysis. We can extract the droplets passing diem
filtering as well
as the counts.
counts <- raw_counts(mb_small) counts_clean <- counts[,clean] dim(counts_clean)
We also provide a convenience function for converting to a Seurat object:
seur <- convert_to_seurat(mb_small, targets = TRUE, meta = TRUE, min.features = 100) dim(seur) head(seur@meta.data)
Since we only used a subset of the entire raw data that covers the transcriptome for this example, we modified many of the parameters to fit this. We recommend using the entire raw data set with default parameters, as this should be appropriate for most data sets.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.