Cleaning snRNA-seq data using DIEM

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

1. Introduction

The diem package provides a lightweight solution to clean up single-cell data during the pre-processing step. The purpose of diem is to remove droplets that contain extranuclear or extracellular RNA. diem is designed for droplet-based single-cell RNA-seq and has been tested on 10X Chromium data.

This tutorial will use the publicly available 10X single-nucleus RNA-seq data that was generated from fresh mouse brain tissue. We will use a subset of the data that is available with the package. The full data set is available here.

2. Data input

diem takes as input the raw counts from a droplet-based single-cell experiment such as from 10X. Since the raw data consists of roughly 20,000 genes and over 100,000 droplets, the counts data has to be stored as a sparse matrix from the R package Matrix. Furthermore, diem takes advantage of the matrix manipulation functions in the Matrix package to speed up computations. The R object that wraps data access and the diem functions is the SCE object. It is important to note here that diem works best when all droplets generated by the experiment are included, even those with only 1 read mapping to a gene. They provide extensive information on the RNA profile of the background distribution.

If you have 10X data that has been aligned with CellRanger, then the raw data will be stored in outs/raw_feature_bc_matrix in the output directory (for CellRanger v3). We provide a convenience function read_10x that can read the 10X output into a sparse Matrix.

library(diem)

# In the diem directory
counts <- read_10x("../tests/testdata/")
dim(counts)
class(counts)

This function reads in the counts present in the tests/testdata directory into a sparse matrix. This subset of the mouse brain raw data from 10X contains 1,001 genes and 2,457 droplets.

The raw counts in a sparse matrix forms the starting point for pre-processing with diem. To create the SCE object:

2.1 Creating an SCE object

mb_small <- create_SCE(counts, name="MouseBrain")
dim(mb_small)
class(mb_small)

The SCE object mb_small stores the counts as well as the meta data on the droplets:

drop_data <- droplet_data(mb_small)
head(drop_data)
summary(drop_data)

2.2 Plotting quality metrics

Useful diagnostics for assessing droplets include the percent of reads that align to the mitochondrial genome, in addition to percent that align the nuclear-localized MALAT1. We can generate this using the following:

mt_genes <- grep(pattern = "^mt-", x = rownames(mb_small@gene_data), 
                 ignore.case = TRUE, value = TRUE)
mb_small <- get_gene_pct(x = mb_small, genes = mt_genes, name = "pct.mt")
genes <- grep(pattern = "^malat1$", x = rownames(mb_small@gene_data), 
              ignore.case = TRUE, value = TRUE)
mb_small <- get_gene_pct(x = mb_small, genes = genes, name = "MALAT1")
drop_data <- droplet_data(mb_small)
summary(drop_data)

Before filtering, it's good plot the distribution of reads and MT%.

datf <- drop_data
par(mfrow=c(2,2))
plot(datf$total_counts, datf$n_genes, 
     xlab="Total Counts", ylab="Number Genes")
plot(datf$n_genes, datf$pct.mt,
     xlab="Number Genes", ylab="MT%")
plot(datf$n_genes, datf$MALAT1, 
     xlab="Number Genes", ylab="MALAT1%")
plot(datf$pct.mt, datf$MALAT1,
     xlab="MT%", ylab="MALAT1%")

The data suggests we can separate out 2 clusters, consisting of one low count/high MT% cluster, and one high count/low MT% cluster.

3. Running DIEM

The following outlines the steps involved in diem:

  1. Estimating the initial number of clusters, as well droplets within these clusters in order to initialize the multinomial means.
  2. Run EM to estimate the parameters of the multinomial mixture model.
  3. Classify droplets based on their likelihood.

3.1 Specifying test and debris droplets

To run diem, we first have to specify which droplets we would like to:

  1. classify
  2. fix as debris
  3. initialize clustering

1 and 2 are mutually exclusive, while 3 is a subset of 1. It is helpful to use the above plots, as well as a barcode-rank plot, to determine the droplet sets.

barcode_rank_plot(mb_small, title = "MouseBrain")

Since this is only a subset, the tail of the low-count droplets in the barcode-rank plot is much smaller. Typical data sets will have tens to hundreds of thousands of droplets more than what is shown here. The choice of the cutoff to fix debris and classify droplets is somewhat arbitrary and depends on subjective interpretation of what will likely be included in the debris. We should select the cutoff to fix as much of the debris droplets into the debris group, while allowing to classify droplets that may contain cells or nuclei. To set the groups, we use the function set_debris_test_set:

mb_small <- set_debris_test_set(mb_small, 
                                top_n = 2000, 
                                min_counts = 100, 
                                min_genes = 100)
length(mb_small@test_set)
length(mb_small@bg_set)

This results in 457 test droplets and 2,000 debris droplets. The function call states that the test set (droplets to classify) must have at least 100 total counts and 100 genes detected. Furthermore, only consider the top 2,000 droplets ranked by counts. Anything below this, regardless of total counts and genes detected, will be fixed to debris. This option isn't necessary here but we put it here to illustrate it. This can help with variable total counts when we have a good approximation of the maximum number of nuclei or cells that can be captured. To turn this off, set top_n = Inf.

3.2 Filtering genes

Now that we set the test and debris droplets, we remove lowly expressed genes. We use counts per million mapped reads (CPM) to set the filtering criteria.

mb_small <- filter_genes(mb_small, cpm_thresh = 10)
genes <- gene_data(mb_small)
summary(genes)

All genes are expressed here (since this is a subset including the expressed genes).

3.3 Initializing clusters

Before running EM on the multinomial mixture model, we need to estimate the number of mixtures and initialize the mean parameters for each multinomial. We do this by non-parametric clustering of of droplets that are likely to contain nuclei. Here we use the top 200 ranked by gene to estimate the cell types. Alternatively, you can order_by = "count".

mb_small <- set_cluster_set(mb_small, 
                            cluster_n = 200, 
                            order_by = "gene")
length(mb_small@cluster_set)

The number of cell types are estimated by running Louvain clustering on the cluster set defined above. Briefly, this involves extracting the variable genes, normalizing the counts, generating a weighted k-nearest graph, and clustering the graph. This is done with a single function initialize_clusters. You can use variable genes or all genes by setting the use_var parameter, along with setting the number of variable genes with n_var.

mb_small <- initialize_clusters(mb_small, 
                                use_var = TRUE, 
                                n_var = 200, 
                                verbose = TRUE)

This identified 3 distinct cell types in addition to the debris set. Now that we have the number of mixtures and their initializations, we can run EM to estimate the parameters and classify the droplets.

3.4 Running EM

Use the run_em function to estimate the parameters of the multinomial mixture and calculate droplet probabilities:

mb_small <- run_em(mb_small, verbose = TRUE)
drop_data <- droplet_data(mb_small)
summary(drop_data)

This provides the probability that a droplet belongs to one of the k cell types or debris. The column CleanProb gives 1 minus the probability that the droplet belongs to the debris group. The Cluster column gives the group with the highest probability, while ClusterProb gives the probability of that cluster.

To classify a droplet as debris or nucleus/cell, set a threshold on CleanProb.

mb_small <- call_targets(mb_small, 
                         pp_thresh = 0.95, 
                         min_genes = 100)
drop_data <- droplet_data(mb_small)
summary(drop_data)

This gives 371 clean droplets and 2,086 debris. The majority of the 2,086 droplets had very low counts and would have been excluded anyways. To see how many of the droplets with at least 100 genes detected were removed as debris, we can run the following:

clean <- get_clean_ids(mb_small)
removed <- setdiff(mb_small@test_set, clean)
length(clean)
length(removed)
length(removed) / (length(removed) + length(clean))

This removed nearly 19% of the 457 test droplets. To see which droplets which were removed, we can plot the droplets colored by their probability.

plot_umi_gene_call(mb_small, alpha = 1)

We can also plot with MT% on one of the axes.

library(ggplot2)
drop_data <- droplet_data(mb_small)
ggplot(drop_data, aes(x = n_genes, y = pct.mt, color = CleanProb)) + 
geom_point() + 
theme_minimal() + 
xlab("Number Genes") + ylab("MT%")

The droplets removed tend to be those with high MT% and low UMI counts/ genes detected.

4. Downstream analysis

diem filtering provided a set of droplets we can use for downstream analysis. We can extract the droplets passing diem filtering as well as the counts.

counts <- raw_counts(mb_small)
counts_clean <- counts[,clean]
dim(counts_clean)

We also provide a convenience function for converting to a Seurat object:

seur <- convert_to_seurat(mb_small, 
                          targets = TRUE, 
                          meta = TRUE, 
                          min.features = 100)
dim(seur)
head(seur@meta.data)

5. Considerations

Since we only used a subset of the entire raw data that covers the transcriptome for this example, we modified many of the parameters to fit this. We recommend using the entire raw data set with default parameters, as this should be appropriate for most data sets.



Try the diem package in your browser

Any scripts or data that you put into this service are public.

diem documentation built on Nov. 16, 2019, 1:08 a.m.