knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

1. Introduction

The diem package provides a lightweight solution to clean up single-nucleus or single-cell RNA-seq data during the pre-processing step. The purpose of diem is to remove droplets that contain extranuclear/extracellular RNA. diem is designed for droplet-based single-nucleus and single-cell RNA-seq and has been tested on 10X Chromium data.

This tutorial will use the publicly available 10X single-nucleus RNA-seq data that was generated from fresh mouse brain tissue. We will use a subset of the data that is available with the package. The full data set is available here.

2. Data input

diem takes as input the raw counts from a droplet-based single-cell experiment such as from 10X. The raw counts are stored in a column-wise sparse matrix using the implementation from the R package Matrix. Furthermore, diem takes advantage of the matrix manipulation functions in the Matrix package to speed up computations. The R object that wraps data access and the diem functions is the SCE object. It is important to note here that diem works best when all droplets generated by the experiment are included, including those with only 1 read mapping to a gene. They provide extensive information on the RNA profile of the background distribution.

If you have 10X data that has been aligned with CellRanger, then the raw data will be stored in outs/raw_feature_bc_matrix in the output directory (for CellRanger v3). We provide a convenience function read_10x that can read the 10X output into a sparse Matrix.

library(diem)

# In the diem directory
counts <- read_10x("../tests/testdata/")
dim(counts)
class(counts)

This function reads in the counts present in the tests/testdata directory into a sparse matrix. This subset of the mouse brain raw data from 10X contains 501 genes and 1,200 droplets. The raw counts in a sparse matrix forms the starting point for pre-processing with diem.

2.1 Creating an SCE object

The SCE object is created first: require(gridExtra)

mb_small <- create_SCE(counts, name="MouseBrain")
dim(mb_small)
class(mb_small)

The SCE object mb_small stores the counts as well as the meta data on the droplets:

drop_data <- droplet_data(mb_small)
head(drop_data)
summary(drop_data)

2.2 Plotting quality metrics

Useful diagnostics for assessing droplets include the percent of reads that align to the mitochondrial genome, in addition to percent that align to the nuclear-localized MALAT1. We can generate this using the get_gene_pct function:

mt_genes <- grep(pattern = "^mt-", x = rownames(mb_small@gene_data),
                 ignore.case = TRUE, value = TRUE)
mb_small <- get_gene_pct(x = mb_small, genes = mt_genes, name = "pct.mt")
genes <- grep(pattern = "^malat1$", x = rownames(mb_small@gene_data),
              ignore.case = TRUE, value = TRUE)
mb_small <- get_gene_pct(x = mb_small, genes = genes, name = "MALAT1")
drop_data <- droplet_data(mb_small, type="all")
summary(drop_data)

We plot the distribution of reads and MT% using the plot_data function:

require(gridExtra)
p1 <- plot_data(mb_small, feat_x = "total_counts", feat_y = "n_genes", 
                log_x = TRUE, log_y = TRUE, ret = TRUE, data_type = "all")
p2 <- plot_data(mb_small, feat_x = "n_genes", feat_y = "pct.mt", 
                log_x = TRUE, log_y = FALSE, ret = TRUE, data_type = "all")
p3 <- plot_data(mb_small, feat_x = "n_genes", feat_y = "MALAT1", 
                log_x = TRUE, log_y = FALSE, ret = TRUE, data_type = "all")
p4 <- plot_data(mb_small, feat_x = "pct.mt", feat_y = "MALAT1", 
                log_x = FALSE, log_y = FALSE, ret = TRUE, data_type = "all")
grid.arrange(p1, p2, p3, p4, ncol = 2)

The data suggests we can separate out 2 clusters, consisting of one low count/high MT% cluster, and one high count/low MT% cluster.

3. Running DIEM

The following outlines the steps involved in diem:

  1. Speciyfy the droplet we want to classify and the droplets we want to fix as debris.
  2. Initialize cell type clusters with k-means. We run k-means on the PCs to speed up the computations.
  3. Run EM to estimate the parameters of the mutlinomial mixture model and cluster droplets.
  4. Estimate the level of contamination within droplets.
  5. Filter droplets.

3.1 Specifying test and debris droplets

We first specify which droplets we would like to fix as debris to guide the semi-supervised clustering. We specify these droplets as those that have fewer UMIs than a specified count threshold. We then classify droplets with at least this many counts. We use 100 counts by default but it may be helpful to use a barcode rank plot to determine the cutoff for fixing the debris

barcode_rank_plot(mb_small, title = "MouseBrain")

Since this is only a subset, the tail of the low-count droplets in the barcode-rank plot is much smaller. Typical data sets can have tens to hundreds of thousands of droplets more than what is shown here.

The choice of the cutoff to fix debris and classify droplets is somewhat arbitrary and depends on subjective interpretation of what will likely be included in the debris. Setting the cutoff too high can result in excluding droplets that contain nuclei or cells, while setting the cutoff too low can result in losing information and misclassifying debris as a cell type. Here we set it to 10 since this is a subset.

mb_small <- set_debris_test_set(mb_small, min_counts = 10)
length(mb_small@test_set)
length(mb_small@bg_set)

3.2 Initializing cell type clusters

In order to initialize the parameters, we run PCA and then k-means on these PCs. The goal is to best initialize the centers of the cell types for the mixture model so that DIEM can accurately classify droplets.

We subset both the droplets and the genes to best identify clusters of cell types in the data. For identifying cell types, we only include droplets with at least 200 genes by default. This further avoids initializing clutsers of low-count debris droplets and improves results. For subsetting genes, we select the top 2,000 variable genes as is common practice.

First, we removed non-expressed genes.

mb_small <- filter_genes(mb_small, cpm_thresh = 0)
genes <- gene_data(mb_small)
summary(genes)

To reduce the dimensionality for k-means, we run PCA. This function also calculates the top varaible genes and sets the droplets used for initialization. Here, we select the top 250 most variable genes and droplets with at least 50 genes detected. Since this is a smaller data set, we only use 10 PCs (in contrast the default of 30)

mb_small <- get_pcs(mb_small,
                    n_var_genes = 50,
                    n_pcs = 10)

Variable genes are identified after taking into account the relationship between the mean gene expression and the variance. We do so by running loess regression of the log variance against the log mean. The 'lss' parameter of the 'get_pcs' function controls the span for the 'loess' function used. The variances are standardized by taking the residuals of the observed variance subtracted by the fitted variance.

For the initializatoin, DIEM runs k-means on the principal components. By default, we choose k = 20 cell type clusters. Then, small clusters are merged, defined as those with less than 10 droplets by default. We set this value here to 2.

mb_small <- init(mb_small,
                 k_init = 20,
                 nstart_init = 30,
                 min_size_init = 2)

We see that 4 clusters were merged as they contained less than 2 droplets, resulting in 16 cell types. The total number of clusters used in the mixture model is then 17, including the debris cluster. Now that we have the number of mixtures and their initializations, we can run EM to estimate the parameters and classify the droplets.

3.3 Running EM

We call the run_em function to estimate the parameters of the multinomial mixture model.

mb_small <- run_em(mb_small)
mb_small <- assign_clusters(mb_small)

drop_data <- droplet_data(mb_small, type="test")
table(drop_data[,"Cluster"])
tapply(drop_data[,"n_genes"],
       drop_data[,"Cluster"],
       mean)
tapply(drop_data[,"pct.mt"],
       drop_data[,"Cluster"],
       mean)
tapply(drop_data[,"MALAT1"],
       drop_data[,"Cluster"],
       mean)

We see some refinement in the clustering of droplets using the multinomial mixture model. Particularly, more droplets were assigned to the fixed debris cluster. We also see evidence of contamination in particular clusters that are separate from the debris cluster, which are clusters 8 and 13.

You can also take advantage of parallel processing if your system allows for OpenMP threading. Specify the number of threads with the threads parameter.

3.4 Estimate the level of contamination within droplets

Removing droplets that belong only to the debris cluster would not account for the different levels of contamination within individual droplets and clusters. To overcome this, we use a droplet score based on the genes up-regulated in the debris droplets. These genes are specified after running differential expression with a basic Welch's t-test. We call DE genes using an FDR threshold p < 0.5 for illustrative purposes, since there are very few droplets in this subset. We further specify that any clusters with an average number of genes detected less than 50 are also considered debris droplets (the default value is 200 for full snRNA-seq data sets). This helps in estimating the debris score by specifying adding additional droplets that are likely debris.

We estimate the debris score by ruunning the function estimate_dbr_score and can pull the DE genes with the `de_genes function.

mb_small <- estimate_dbr_score(mb_small,
                               thresh_genes = 50,
                               thresh_p = 0.5)
de_genes <- debris_genes(mb_small, p_adj = 0.5)
head(de_genes)

The debris up-regulated genes are those we would expect (the mitochondria-encoded genes). With larger data sets, we can recover hundreds of genes up-regulated in the debris cluster to help estimate the scores. If you would like to use on the most specific genes in calculating the debris score, you can cap the number by specifying the max_genes parameter.

We can explore the relationship of the debris score with total counts, genes detected, MT%, and MALAT1%. It is helpful to look at the correlation of the debris score with counts/genes to evaluate the accuracy of this measure. There should be a general negative correlation with counts and genes detected.

require(gridExtra)
p1 <- plot_clust(mb_small, feat_x = "n_genes", feat_y = "score.debris", 
                 log_x = TRUE, ret = TRUE)
p2 <- plot_clust(mb_small, feat_x = "total_counts", feat_y = "score.debris",
                 log_x = TRUE, ret = TRUE)
p3 <- plot_clust(mb_small, feat_x = "pct.mt", feat_y = "score.debris", 
                 log_x = FALSE, ret = TRUE)
p4 <- plot_clust(mb_small, feat_x = "MALAT1", feat_y = "score.debris", 
                 log_x = FALSE, ret = TRUE)
grid.arrange(p1, p2, p3, p4, ncol = 2)

In this case, the score is basically the same as the percent of reads that align to the mitochondria. In our experience, full snRNA-seq data sets will show more genes enriched in the debris droplets, and provide more information.

We also look into the clusters formed by DIEM. Specifically, we can look at the average counts, genes, debris score, and some genes up-regulated in each.

sm <- summarize_clusters(mb_small, top_n = 20)
par(mfrow=c(1,2))
plot(sm[,"avg_n_genes"], sm[,"avg_dbr_score"], pch= NA,
     xlab="Average number Genes", ylab="Avergae debris score")
text(sm[,"avg_n_genes"], sm[,"avg_dbr_score"], sm[,"Cluster"])
plot(sm[,"avg_n_counts"], sm[,"avg_dbr_score"], pch= NA,
     xlab="Average total counts", ylab="Avergae debris score")
text(sm[,"avg_n_counts"], sm[,"avg_dbr_score"], sm[,"Cluster"])

The summarize_clusters function returns a data frame with summary statistics on the clusters identified. The genes given are the top up-regulated genes ranked by difference in proportion relative to all droplets in all the other clusters.

head(sm)

To run a differential expression analysis for each cluster, we can run the de_ttest_all function to get statistically significant genes up-regulated in each cluster.

clust <- clusters(mb_small)
counts <- raw_counts(mb_small)
deg <- de_ttest_all(counts = counts, labels = clust)
head(deg)

This returns a data frame with genes in each row that are up-regualted in the indicated cluster.

3.5 Filtering droplets.

We have the choice of either removing droplets that belong to specific clusters, or removing droplets above a certain debris score. We recommend filtering by debris score with snRNA-seq data, as droplets within clusters can show a range of contamination. For single-cell RNA-seq, we recommend filtering out droplets that belong to the debris cluster(s). Use caution when filtering by the debris score, however, since there are no gaurantees that the given debris genes can accurately classify contaminated and non-contaminated dropelts. It is recommended to make sure that the debris score and the debris genes make sense.

The default threshold for the debris score is 0.5. This represents the middle value between the average debris score of the debris droplets, and the average debris score of the least contaminated cluster. We recommend using the above plots to help select the most appropriate score that reflects the desired tolerance for contamination. For example, one could increase the threshold to include as many droplets as possible and then use the score as a covariate in downstream analysis. Here, we stick with the default value of 0.5. We also allow droplets with any number of genes detected from the test set (the default value is 200). The test set of droplets are those that we specified to have at least 10 UMIs.

In either case, we remove contaminated droplets with the call_targets function. The thresh_score parameter specifies that droplets with a score above this are removed.

mb_small <- call_targets(mb_small,
                         thresh_score = 0.5,
                         min_genes = 0)

The characteristics of the kept and removed droplets are as expected

drop_data <- droplet_data(mb_small, type="test")
tapply(drop_data[,"n_genes"],
       drop_data[,"Call"],
       mean)
tapply(drop_data[,"score.debris"],
       drop_data[,"Call"],
       mean)
tapply(drop_data[,"pct.mt"],
       drop_data[,"Call"],
       mean)
tapply(drop_data[,"MALAT1"],
       drop_data[,"Call"],
       mean)
tapply(drop_data[,"Cluster"],
       drop_data[,"Call"],
       table)

We also show how to remove droplets that belong to the debris cluster. We set the clusters parameter to debris. We can also remove additional clusters by giving a numeric vector of the clusters we would like to remove.

mb_small <- call_targets(mb_small,
                         clusters = "debris",
                         thresh_score = NULL,
                         min_genes = 0)
drop_data <- droplet_data(mb_small)
tapply(drop_data[,"n_genes"],
       drop_data[,"Call"],
       mean)
tapply(drop_data[,"score.debris"],
       drop_data[,"Call"],
       mean)
tapply(drop_data[,"pct.mt"],
       drop_data[,"Call"],
       mean)
tapply(drop_data[,"MALAT1"],
       drop_data[,"Call"],
       mean)
tapply(drop_data[,"Cluster"],
       drop_data[,"Call"],
       table)

This shows the differences in removing droplets by cluster instead of by debris score. Filtering by debris score removes the contaminated droplets themselves and thus cleans up the resulting clusters. One can also use a higher threshold to include more droplets and then use the debris score as a covariate with the aim of maximizing the number of droplets while still accounting for extranuclear contamination.

4 Clustering using a Dirichlet-multinomial

While diem uses a multinomial mixture model, it is possible to model the mixtures with a Dirichlet-multinomial. This helps model overdispersion in the data for more accurate clustering. In out data sets, we found the results are highly similar, if not the same. Fitting the parameters of the Dirichlet-multinomial is much slower, so we use a multinomial by default. We still provide the option to model with a Dirichlet-multinomial. One option is to first estimate parameters using the multinomial, and then run the estimation of the Dirichlet-multinomial.

mb_small <- init(mb_small,
                 k_init = 20,
                 nstart_init = 30,
                 min_size_init = 2)

mb_small <- run_em(mb_small)
mb_small <- assign_clusters(mb_small)
drop_data <- droplet_data(mb_small, type="test")
clusters_mult <- drop_data[,"Cluster"]

mb_small <- run_em(mb_small, model = "DM")
mb_small <- assign_clusters(mb_small)
drop_data <- droplet_data(mb_small, type="test")
clusters_dm <- drop_data[,"Cluster"]

table(clusters_mult, clusters_dm)

We see that only a handful of droplets have changed clusters assignments. While this option is slower and provides practically the same results here, we include this as an option for cases where modeling overdispersion in the read counts is desirable.

5. Downstream analysis

We can extract the droplets that pass diem filtering as well as the counts for downstream analysis such as clustering.

counts <- raw_counts(mb_small)
dim(counts)

clean_drops <- get_clean_ids(mb_small)
counts <- counts[,clean_drops]
dim(counts)

drop_data <- droplet_data(mb_small, type="test")
drop_data <- drop_data[clean_drops,]

As diem also clusters the droplets, one can use this for analyses of cell types. They are stored in the data frame return by droplet_data as shown above.

We also provide a convenience function convert_to_seuratfor converting to a Seurat object. Arguments to the CreateSeuratObject function can be passed at the end of the function call, such as min.features here. For example

seur <- convert_to_seurat(mb_small, min.features = 50)

This will return a Seurat object of the clean droplets. This additionally adds the droplet data such as the debris score from diem to the meta data of the Seurat.



marcalva/diem documentation built on Jan. 1, 2023, 2:33 a.m.