knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The diem
package provides a lightweight solution to clean up single-nucleus or
single-cell RNA-seq data during the pre-processing step. The purpose
of diem
is to remove droplets that contain extranuclear/extracellular
RNA. diem
is designed for droplet-based single-nucleus and
single-cell RNA-seq and has been tested on 10X Chromium data.
This tutorial will use the publicly available 10X single-nucleus RNA-seq data that was generated from fresh mouse brain tissue. We will use a subset of the data that is available with the package. The full data set is available here.
diem
takes as input the raw counts from a droplet-based single-cell
experiment such as from 10X. The raw counts are stored in a
column-wise sparse matrix using the implementation from the R package
Matrix.
Furthermore, diem
takes advantage of the matrix manipulation
functions in the Matrix
package to speed up computations. The R
object that wraps data access and the diem
functions is the
SCE
object. It is important to note here that diem
works best
when all droplets generated by the experiment are included, including
those with only 1 read mapping to a gene. They provide extensive
information on the RNA profile of the background distribution.
If you have 10X data that has been aligned with
CellRanger,
then the raw data will be stored in outs/raw_feature_bc_matrix
in
the output directory (for CellRanger v3). We provide a convenience function
read_10x
that can read the 10X output into a sparse Matrix.
library(diem) # In the diem directory counts <- read_10x("../tests/testdata/") dim(counts) class(counts)
This function reads in the counts present in the tests/testdata
directory into a sparse matrix. This subset of the mouse brain raw data
from 10X contains 501 genes and 1,200 droplets. The raw counts in a
sparse matrix forms the starting point for pre-processing with diem
.
The SCE object is created first: require(gridExtra)
mb_small <- create_SCE(counts, name="MouseBrain") dim(mb_small) class(mb_small)
The SCE object mb_small
stores the counts as well as the meta data
on the droplets:
drop_data <- droplet_data(mb_small) head(drop_data) summary(drop_data)
Useful diagnostics for assessing droplets include the percent
of reads that align to the mitochondrial genome, in addition to
percent that align to the nuclear-localized MALAT1. We can generate
this using the get_gene_pct
function:
mt_genes <- grep(pattern = "^mt-", x = rownames(mb_small@gene_data), ignore.case = TRUE, value = TRUE) mb_small <- get_gene_pct(x = mb_small, genes = mt_genes, name = "pct.mt") genes <- grep(pattern = "^malat1$", x = rownames(mb_small@gene_data), ignore.case = TRUE, value = TRUE) mb_small <- get_gene_pct(x = mb_small, genes = genes, name = "MALAT1") drop_data <- droplet_data(mb_small, type="all") summary(drop_data)
We plot the distribution of reads and MT% using the plot_data
function:
require(gridExtra) p1 <- plot_data(mb_small, feat_x = "total_counts", feat_y = "n_genes", log_x = TRUE, log_y = TRUE, ret = TRUE, data_type = "all") p2 <- plot_data(mb_small, feat_x = "n_genes", feat_y = "pct.mt", log_x = TRUE, log_y = FALSE, ret = TRUE, data_type = "all") p3 <- plot_data(mb_small, feat_x = "n_genes", feat_y = "MALAT1", log_x = TRUE, log_y = FALSE, ret = TRUE, data_type = "all") p4 <- plot_data(mb_small, feat_x = "pct.mt", feat_y = "MALAT1", log_x = FALSE, log_y = FALSE, ret = TRUE, data_type = "all") grid.arrange(p1, p2, p3, p4, ncol = 2)
The data suggests we can separate out 2 clusters, consisting of one low count/high MT% cluster, and one high count/low MT% cluster.
The following outlines the steps involved in diem
:
We first specify which droplets we would like to fix as debris to guide the semi-supervised clustering. We specify these droplets as those that have fewer UMIs than a specified count threshold. We then classify droplets with at least this many counts. We use 100 counts by default but it may be helpful to use a barcode rank plot to determine the cutoff for fixing the debris
barcode_rank_plot(mb_small, title = "MouseBrain")
Since this is only a subset, the tail of the low-count droplets in the barcode-rank plot is much smaller. Typical data sets can have tens to hundreds of thousands of droplets more than what is shown here.
The choice of the cutoff to fix debris and classify droplets is somewhat arbitrary and depends on subjective interpretation of what will likely be included in the debris. Setting the cutoff too high can result in excluding droplets that contain nuclei or cells, while setting the cutoff too low can result in losing information and misclassifying debris as a cell type. Here we set it to 10 since this is a subset.
mb_small <- set_debris_test_set(mb_small, min_counts = 10) length(mb_small@test_set) length(mb_small@bg_set)
In order to initialize the parameters, we run PCA and then k-means on these PCs. The goal is to best initialize the centers of the cell types for the mixture model so that DIEM can accurately classify droplets.
We subset both the droplets and the genes to best identify clusters of cell types in the data. For identifying cell types, we only include droplets with at least 200 genes by default. This further avoids initializing clutsers of low-count debris droplets and improves results. For subsetting genes, we select the top 2,000 variable genes as is common practice.
First, we removed non-expressed genes.
mb_small <- filter_genes(mb_small, cpm_thresh = 0) genes <- gene_data(mb_small) summary(genes)
To reduce the dimensionality for k-means, we run PCA. This function also calculates the top varaible genes and sets the droplets used for initialization. Here, we select the top 250 most variable genes and droplets with at least 50 genes detected. Since this is a smaller data set, we only use 10 PCs (in contrast the default of 30)
mb_small <- get_pcs(mb_small, n_var_genes = 50, n_pcs = 10)
Variable genes are identified after taking into account the relationship between the mean gene expression and the variance. We do so by running loess regression of the log variance against the log mean. The 'lss' parameter of the 'get_pcs' function controls the span for the 'loess' function used. The variances are standardized by taking the residuals of the observed variance subtracted by the fitted variance.
For the initializatoin, DIEM runs k-means on the principal components. By default, we choose k = 20 cell type clusters. Then, small clusters are merged, defined as those with less than 10 droplets by default. We set this value here to 2.
mb_small <- init(mb_small, k_init = 20, nstart_init = 30, min_size_init = 2)
We see that 4 clusters were merged as they contained less than 2 droplets, resulting in 16 cell types. The total number of clusters used in the mixture model is then 17, including the debris cluster. Now that we have the number of mixtures and their initializations, we can run EM to estimate the parameters and classify the droplets.
We call the run_em
function to estimate the parameters of the
multinomial mixture model.
mb_small <- run_em(mb_small) mb_small <- assign_clusters(mb_small) drop_data <- droplet_data(mb_small, type="test") table(drop_data[,"Cluster"]) tapply(drop_data[,"n_genes"], drop_data[,"Cluster"], mean) tapply(drop_data[,"pct.mt"], drop_data[,"Cluster"], mean) tapply(drop_data[,"MALAT1"], drop_data[,"Cluster"], mean)
We see some refinement in the clustering of droplets using the multinomial mixture model. Particularly, more droplets were assigned to the fixed debris cluster. We also see evidence of contamination in particular clusters that are separate from the debris cluster, which are clusters 8 and 13.
You can also take advantage of parallel processing if your system allows for
OpenMP threading. Specify the number of threads with the threads
parameter.
Removing droplets that belong only to the debris cluster would not account for the different levels of contamination within individual droplets and clusters. To overcome this, we use a droplet score based on the genes up-regulated in the debris droplets. These genes are specified after running differential expression with a basic Welch's t-test. We call DE genes using an FDR threshold p < 0.5 for illustrative purposes, since there are very few droplets in this subset. We further specify that any clusters with an average number of genes detected less than 50 are also considered debris droplets (the default value is 200 for full snRNA-seq data sets). This helps in estimating the debris score by specifying adding additional droplets that are likely debris.
We estimate the debris score by ruunning the function estimate_dbr_score
and can pull the DE genes with the `de_genes function.
mb_small <- estimate_dbr_score(mb_small, thresh_genes = 50, thresh_p = 0.5) de_genes <- debris_genes(mb_small, p_adj = 0.5) head(de_genes)
The debris up-regulated genes are those we would expect (the
mitochondria-encoded genes). With larger data sets, we can recover hundreds
of genes up-regulated in the debris cluster to help estimate the scores.
If you would like to use on the most specific genes in calculating the
debris score, you can cap the number by specifying the max_genes
parameter.
We can explore the relationship of the debris score with total counts, genes detected, MT%, and MALAT1%. It is helpful to look at the correlation of the debris score with counts/genes to evaluate the accuracy of this measure. There should be a general negative correlation with counts and genes detected.
require(gridExtra) p1 <- plot_clust(mb_small, feat_x = "n_genes", feat_y = "score.debris", log_x = TRUE, ret = TRUE) p2 <- plot_clust(mb_small, feat_x = "total_counts", feat_y = "score.debris", log_x = TRUE, ret = TRUE) p3 <- plot_clust(mb_small, feat_x = "pct.mt", feat_y = "score.debris", log_x = FALSE, ret = TRUE) p4 <- plot_clust(mb_small, feat_x = "MALAT1", feat_y = "score.debris", log_x = FALSE, ret = TRUE) grid.arrange(p1, p2, p3, p4, ncol = 2)
In this case, the score is basically the same as the percent of reads that align to the mitochondria. In our experience, full snRNA-seq data sets will show more genes enriched in the debris droplets, and provide more information.
We also look into the clusters formed by DIEM. Specifically, we can look at the average counts, genes, debris score, and some genes up-regulated in each.
sm <- summarize_clusters(mb_small, top_n = 20) par(mfrow=c(1,2)) plot(sm[,"avg_n_genes"], sm[,"avg_dbr_score"], pch= NA, xlab="Average number Genes", ylab="Avergae debris score") text(sm[,"avg_n_genes"], sm[,"avg_dbr_score"], sm[,"Cluster"]) plot(sm[,"avg_n_counts"], sm[,"avg_dbr_score"], pch= NA, xlab="Average total counts", ylab="Avergae debris score") text(sm[,"avg_n_counts"], sm[,"avg_dbr_score"], sm[,"Cluster"])
The summarize_clusters
function returns a data frame with summary statistics
on the clusters identified. The genes given are the top up-regulated genes
ranked by difference in proportion relative to all droplets in all the other
clusters.
head(sm)
To run a differential expression analysis for each cluster, we can
run the de_ttest_all
function to get statistically significant genes
up-regulated in each cluster.
clust <- clusters(mb_small) counts <- raw_counts(mb_small) deg <- de_ttest_all(counts = counts, labels = clust) head(deg)
This returns a data frame with genes in each row that are up-regualted in the indicated cluster.
We have the choice of either removing droplets that belong to specific clusters, or removing droplets above a certain debris score. We recommend filtering by debris score with snRNA-seq data, as droplets within clusters can show a range of contamination. For single-cell RNA-seq, we recommend filtering out droplets that belong to the debris cluster(s). Use caution when filtering by the debris score, however, since there are no gaurantees that the given debris genes can accurately classify contaminated and non-contaminated dropelts. It is recommended to make sure that the debris score and the debris genes make sense.
The default threshold for the debris score is 0.5. This represents the middle value between the average debris score of the debris droplets, and the average debris score of the least contaminated cluster. We recommend using the above plots to help select the most appropriate score that reflects the desired tolerance for contamination. For example, one could increase the threshold to include as many droplets as possible and then use the score as a covariate in downstream analysis. Here, we stick with the default value of 0.5. We also allow droplets with any number of genes detected from the test set (the default value is 200). The test set of droplets are those that we specified to have at least 10 UMIs.
In either case, we remove contaminated droplets with the
call_targets
function. The thresh_score
parameter specifies that
droplets with a score above this are removed.
mb_small <- call_targets(mb_small, thresh_score = 0.5, min_genes = 0)
The characteristics of the kept and removed droplets are as expected
drop_data <- droplet_data(mb_small, type="test") tapply(drop_data[,"n_genes"], drop_data[,"Call"], mean) tapply(drop_data[,"score.debris"], drop_data[,"Call"], mean) tapply(drop_data[,"pct.mt"], drop_data[,"Call"], mean) tapply(drop_data[,"MALAT1"], drop_data[,"Call"], mean) tapply(drop_data[,"Cluster"], drop_data[,"Call"], table)
We also show how to remove droplets that belong to the debris cluster.
We set the clusters
parameter to debris
. We can also remove
additional clusters by giving a numeric vector of the clusters
we would like to remove.
mb_small <- call_targets(mb_small, clusters = "debris", thresh_score = NULL, min_genes = 0) drop_data <- droplet_data(mb_small) tapply(drop_data[,"n_genes"], drop_data[,"Call"], mean) tapply(drop_data[,"score.debris"], drop_data[,"Call"], mean) tapply(drop_data[,"pct.mt"], drop_data[,"Call"], mean) tapply(drop_data[,"MALAT1"], drop_data[,"Call"], mean) tapply(drop_data[,"Cluster"], drop_data[,"Call"], table)
This shows the differences in removing droplets by cluster instead of by debris score. Filtering by debris score removes the contaminated droplets themselves and thus cleans up the resulting clusters. One can also use a higher threshold to include more droplets and then use the debris score as a covariate with the aim of maximizing the number of droplets while still accounting for extranuclear contamination.
While diem
uses a multinomial mixture model, it is possible to model the
mixtures with a Dirichlet-multinomial. This helps model overdispersion in the
data for more accurate clustering. In out data sets, we found the results are
highly similar, if not the same. Fitting the parameters of the
Dirichlet-multinomial is much slower, so we use a multinomial by default. We
still provide the option to model with a Dirichlet-multinomial. One option is to
first estimate parameters using the multinomial, and then run the estimation
of the Dirichlet-multinomial.
mb_small <- init(mb_small, k_init = 20, nstart_init = 30, min_size_init = 2) mb_small <- run_em(mb_small) mb_small <- assign_clusters(mb_small) drop_data <- droplet_data(mb_small, type="test") clusters_mult <- drop_data[,"Cluster"] mb_small <- run_em(mb_small, model = "DM") mb_small <- assign_clusters(mb_small) drop_data <- droplet_data(mb_small, type="test") clusters_dm <- drop_data[,"Cluster"] table(clusters_mult, clusters_dm)
We see that only a handful of droplets have changed clusters assignments. While this option is slower and provides practically the same results here, we include this as an option for cases where modeling overdispersion in the read counts is desirable.
We can extract the droplets that pass diem
filtering as well
as the counts for downstream analysis such as clustering.
counts <- raw_counts(mb_small) dim(counts) clean_drops <- get_clean_ids(mb_small) counts <- counts[,clean_drops] dim(counts) drop_data <- droplet_data(mb_small, type="test") drop_data <- drop_data[clean_drops,]
As diem
also clusters the droplets, one can use this for analyses of
cell types. They are stored in the data frame return by
droplet_data
as shown above.
We also provide a convenience function convert_to_seurat
for converting to
a Seurat object. Arguments to the CreateSeuratObject
function can be passed
at the end of the function call, such as min.features
here. For example
seur <- convert_to_seurat(mb_small, min.features = 50)
This will return a Seurat
object of the clean droplets. This additionally
adds the droplet data such as the debris score from diem
to the
meta data of the Seurat
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.