knitr::opts_chunk$set(
  collapse = TRUE, fig.width = 7,
  comment = "#>"
)

1 Overview

DSAVE (Down-SAmpling based Variation Estimation) is a package that can be used for investigating the cell-to-cell variation in single-cell RNA-Seq data. Specifically, it supports the following functions:

The package works on cell populations containing similar cells. The cells are expected to be classified into cell types if the dataset has a mix of cell types, and the total cell population need to be divided into populations per cell type to give meaningful results.

2 Total Variation - estimation of cell population size needed for pooled expression profiles

DSAVE Total Variation Estimation - This metric measures the total variation, including sampling noise, for the mean expression over a gene range for a pool of single-cells of a certain size. The metric is useful for determining the pool size needed to obtain the same variation as in for example typical bulk RNA-Seq data. The calculations include generation of sampling noise only (SNO) datasets, which are randomly generated by sampling from a multinomial distribution with probabilities calculated from the mean expression of the cell population.

library(ggplot2)
library(DSAVE)
library(progress)
library(Seurat)
library("plotly")
library("downloader")
set.seed(1)
extrDir <- downloadData("https://cf.10xgenomics.com/samples/cell-exp/1.1.0/b_cells/b_cells_filtered_gene_bc_matrices.tar.gz", "B10k")
dataDir = paste0(extrDir,"/filtered_matrices_mex/hg19")
bcells <- Read10X(data.dir = dataDir)

extrDir <- downloadData("https://cf.10xgenomics.com/samples/cell-exp/2.1.0/t_4k/t_4k_filtered_gene_bc_matrices.tar.gz", "T4k")
dataDir = paste0(extrDir,"/filtered_gene_bc_matrices/GRCh38")
tcells <- Read10X(data.dir = dataDir)

extrDir <- downloadData("https://cf.10xgenomics.com/samples/cell-exp/1.1.0/fresh_68k_pbmc_donor_a/fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz", "PBMC68k")
dataDir = paste0(extrDir,"/filtered_matrices_mex/hg19")
pbmc <- Read10X(data.dir = dataDir)
#only use the 20000 first cells to speed up the demo

pbmc = pbmc[,1:20000]
pbmcCellTypes = ctPbmc68k[1:20000]

#further filter out a fraction of the cells from cell types we will not use to speed up the demo
toRem = (pbmcCellTypes == "CD8+/CD45RA+ Naive Cytotoxic") | (pbmcCellTypes == "CD8+ Cytotoxic T")
toRem[15000:20000] = F
pbmc = pbmc[,!toRem]
pbmcCellTypes = pbmcCellTypes[!toRem]
scTotalVariation_bcells <- DSAVEGetTotalVariationPoolSize(bcells,
        upperBound = 1e5, lowerBound = 5e-1)

scTotalVariation_tcells <- DSAVEGetTotalVariationPoolSize(tcells,
          upperBound = 1e5, lowerBound = 5e-1)
DSAVEPlotTotalVariation(list(scTotalVariation_bcells, scTotalVariation_tcells), list("B cells", "T cells"))

We can see that we need to pool around 1500 T cells to on average get down to bulk variation level, while we need twice as many of the B cells, likely depending on that the sampling noise is higher in the B10k dataset due to fewer reads. It is also to be noted that the number of cells that is needed in a pool highly depends on gene expression. For highly expressed genes, a few hundreds are enough, while for lowly expressed genes several thousands are needed.

3 BTM Variation score

The DSAVE BTM variation score is a metric describing the non-sampling part of the cell-to-cell variation in single-cell RNA-Seq data. It is useful both as a technical quality metric (including the cell type annotation process if applicable) and as a way to compare different clustering methods on a specific dataset.

To calculate the score, DSAVE divides the variation into two components; sampling noise and BTM (Biological, Technical and Misclassification) variation. The sampling noise is a function of the number of reads per cell, while the BTM variation contains interesting information about technical artefacts, biological variation, and cell misclassifications. The sampling noise typically varies between datasets and is often the dominating variation factor, often obscuring the BTM variation if not separated with a method such as DSAVE.

BTM variation can be explained the following way: For each mapped read from a cell in a cell population, there is a certain chance that this will belong to a certain gene. This probability is based on the mean expression of that gene in the entire population. If no BTM variation exists, the reads will be sampled according to these gene probabilities the same way in all cells, and we will get only sampling noise. In this context, BTM variation is the variation between cells in the probabilities used when sampling reads.

The function uses a template, which is used for making the metric comparable across cell populations and even datasets. The template specifies how the algorithm should align the data (downsample etc.) before measuring the variation. We provide 3 different templates for human data as part of this package: The standard template, for populations of 2000 cells or more, and two variants that require only 1000 and 500 cells, respectively.

We have shown that if the BTM variation of a whole cell population is large compared to other datasets, the reason for this is most likely technical; biological variation is often smaller when compared over the whole gene range.

The metric's use as a technical quality metric is shown below, while the use for comparing clustering methods is shown further down in this document:

templInfo <- DSAVEGetStandardTemplate()
tcells_btm <- DSAVECalcBTMScore(tcells, templInfo, skipAlignment=FALSE,
                                iterations = 15, useLogTransform=FALSE,
                                logTPMAddon=1, silent=FALSE)
DSAVEPlotDSAVEScore(tcells_btm$DSAVEScore, "T4k", DSAVE::datasetScoresHuman)

We can see that the T4k dataset has a BTM variation that is comparable to the other datasets and that the technical quality thus is what can be expected for single-cell RNA-Seq.

5 Evaluate clustering and detect misclassified cells

This section shows a typical use case, where the clustering is improved in a second clustering iteration using a newer clustering method.

so = CreateSeuratObject(counts = pbmc, project = "pbmc", min.cells = 0, min.features = 0)
so <- NormalizeData(so, normalization.method = "LogNormalize", scale.factor = 10000)

#Finds the most variable genes
so <- FindVariableFeatures(so, selection.method = "vst", nfeatures = 2000)

#Make mean and variance the same for all genes:
all.genes <- rownames(so)
so <- ScaleData(so, features = all.genes)

#Principal component analysis
so <- RunPCA(so, features = VariableFeatures(object = so))

so[["extCellTypes"]] = pbmcCellTypes

so <- RunUMAP(so, dims = 1:10)
DimPlot(so, reduction = "umap", group.by = "extCellTypes")
dend = pbmc[, pbmcCellTypes == "Dendritic"]

dend_cell_divergence <- DSAVEGetSingleCellDivergence(dend,
                                                     minUMIsPerCell = 200, tpmLowerBound = 0,
                                                     iterations = 15,silent=FALSE)

DSAVEPlotDivergence(dend, dend_cell_divergence)
rm(dend_cell_divergence)

The divergence plot shows each cell's UMI counts plotted against their divergence. The most divergent cells are the ones that have the highest divergence. By investigating the rightmost cells by hoovering over them with the mouse pointer, we can see which 5 genes that contribute the most to the divergence, and thus get an idea about which types of cells are most divergent. For example, some cells have the gene PPBP as highly divergent, suggesting that they may be megakaryocytes, while NKG7 or GNLY show up for other cells, suggesting that they may be NK cells.

We also calculate the BTM variation score, with a template for fewer cells:

btm_old_clust <- DSAVECalcBTMScore(dend, DSAVEGetStandardTemplate500())
DSAVEPlotDSAVEScore(btm_old_clust$DSAVEScore, "OldClust", DSAVE::datasetScoresHuman500)

We now use the newer clustering in Seurat:

#show data extraction from Seurat:
so <- FindNeighbors(so, dims = 1:10)
so <- FindClusters(so, resolution = 0.5)
DimPlot(so, reduction = "umap")

We now investigate cluster 7 from the new clustering, which is a cluster of similar cells. First, extract the data from the Seurat object:

clust7 = extractSeuratData(so, 7)
#clean up, we don't need the Seurat object anymore
rm(so)

Then plot in the same way as above:

#also investigate the clustering from Seurat directly
clust7_divergence <- DSAVEGetSingleCellDivergence(clust7,
                                                     minUMIsPerCell = 200, tpmLowerBound = 0,
                                                     iterations = 15,silent=FALSE)

DSAVEPlotDivergence(clust7, clust7_divergence)
#clean up memory
rm(clust7_divergence)

We see less misclassified cells here in general, although there seem to still be some NK cells present.

We also plot the BTM variation score for the new clustering

btm_new_clust <- DSAVECalcBTMScore(clust7, DSAVEGetStandardTemplate500())
DSAVEPlotDSAVEScore(btm_new_clust$DSAVEScore, "NewClust", DSAVE::datasetScoresHuman500)

btm_old_clust$DSAVEScore
btm_new_clust$DSAVEScore

rm(btm_old_clust)
rm(btm_new_clust)

We also see that the BTM variation score has decreased considerably, suggesting a better clustering performance.It seems that the new clustering is a better choice.

It is possible to further elaborate with clustering parameters with more iterations. When we are satisfied with the clustering performance, it is now possible to manually remove cells. A suggestion would be to remove cells expressing NK cell genes from this particular cluster. Since these cells are potentially doublets (they still ended up in this cluster), we will simply remove them from the dataset here.Two potential markers, as given by the interactive divergence plot, are "GZMB" and "NKG7". GZMB is also expressed in T cells, but such cells would also be misclassified.

First investigate those genes:

clust7Purified = clust7
#investigate those genes
h1 = hist(clust7[rownames(clust7) == "GZMB",], breaks = seq(-0.5, 19.5, by=1))
h1$counts 

h2 = hist(clust7[rownames(clust7) == "NKG7",], breaks = seq(-0.5, 19.5, by=1))
h2$counts 

For GZMB, the number of cells with zero counts is high(496), the cells with one count is low (3), and there are several cells with higher values. This suggests that the gene is highly expressed, but present in few cells, which makes it an excellent marker. For NKG7, the number of cells with one count is much higher(59), suggesting that the gene is expressed and present in more cells (some happen to have zero value although it is expressed in the cell). There is a risk that this gene is lowly expressed in the cells that are supposed to be in cluster 7. We will here therefore play safe and only remove cells that have six or more counts of this gene, although this strategy is open for debate.

clust7Purified = clust7
#investigate those genes
toRemove = clust7[rownames(clust7) == "GZMB",] >= 1 | clust7[rownames(clust7) == "NKG7",] >= 6;
print(paste0("To remove: ", sum(toRemove)))
clust7Purified = clust7Purified[, !toRemove]
dim(clust7)
dim(clust7Purified)

6 Plot the UMI Counts and mitochondrial content vs Cell Divergence to estimate suitable thresholds for cell filtering

bcells_id <- sample(1:dim(bcells)[2], 1000)
subb = bcells[,bcells_id]
bcells_cell_divergence <- DSAVEGetSingleCellDivergence(subb,
                                                       minUMIsPerCell = 200, tpmLowerBound = 0,
                                                       iterations = 15,silent=FALSE)
DSAVEDivUMIPlot(subb, bcells_cell_divergence)

We can see that cells with low number of UMIs are more divergent. Cells below a certain UMI count threshold are usually discarded in a quality control step. We can see that elevating that threshold would likely get rid of more divergent cells, however at the cost of loss of data. Also cells with a high UMI count are on average more divergent; this could potentially be explained by that some of them are doublets, i.e. cases where two or more cells have been accidentally joined and treated as just one cell. Divergence could then come from that the joined cells are of different cell type.

library(textTinyR)

mitoGenes <- grep(pattern = "^MT-", x = row.names(subb), value = TRUE)
fracMT <- sparse_Sums(subb[mitoGenes, ], rowSums = F)/sparse_Sums(subb, rowSums = F)

#for mitochondrial content, it makes sense to calculate the divergence without the MT genes,
#because these in themselves affect the divergence:

subbNoMT = subb[-match(mitoGenes, row.names(subb)),]
bcells_cell_divergence_no_MT <- DSAVEGetSingleCellDivergence(subbNoMT,
                                                       minUMIsPerCell = 200, tpmLowerBound = 0,
                                                       iterations = 15,silent=FALSE)

DSAVEDivParamPlot(bcells_cell_divergence_no_MT, fracMT, "MT Content")

It is evident that cells with high mitochondrial content are more divergent, dispite that the the mitochondrial genes are not part of the calculation. This suggest that a high mitochondrial content is a sign of low quality for cells, which is also commonly used in many processing pipelines.

7 Template generation

Templates are generally usable within the same species, as long as the cell population contains enough cells and has enough counts per cell. To generate a template, a dataset with clustering information (called master) from a species is needed. Only one of the cell populations is used.

Here, we generate the template from the bcells dataset (where all bcells are considered a cell population):

bcellsRed = bcells[rowSums(bcells) > 3,] #remove empty genes to speed things up for this demo
newTemplate = DSAVEGenerateTemplate (bcellsRed, genesToUse = NULL, numCells = 500, numUMIs = 750,
                                  fractionUpperOutliers = 0.025, fractionLowerOutliers = 0.025)
#test it
testScore <- DSAVECalcBTMScore(tcells, newTemplate)
testScore$DSAVEScore

genesToUse is a list of genes that can be specified to limit the genes to include. This may be useful if the master dataset may contain genes in the matrix that are not present in some datasets - such genes are better to avoid. A way to create such a gene list is to use the intersection of the available genes in multiple datasets.

numCells defines how many cells that will be used in each iteration in the calculations. A cell population must contain this many cells to work with a template.

numUMIs sets the average number of UMIs that are needed in the cells for a cell population.

fractionUpperOutliers and fractionLowerOutliers sets the fraction of genes with the highest and lowest variation that should be discarded before the calculation. In the default setting, the 2.5 % genes with highest respectively lowest variation will be discarded.

Template generation involves random sampling; therefore a template should be saved to disk rather than being regenerated for each run!



SysBioChalmers/DSAVE-R documentation built on Oct. 19, 2021, 11:37 p.m.