In MarioniLab/compareSingleCell: Comparative Analyses of Single-Cell RNA-seq Data with Bioconductor

knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE, cache=TRUE)
library(BiocStyle)

Introduction

We will demonstrate a comparative analysis using data from a study of the early mouse embryo [@pijuansala2019single]. E8.5 embryos were dissociated and used to generate scRNA-seq data from the 10X Genomics protocol [@zheng2017massively]. This was performed using chimeric embryos in which Tal1-knockout (KO) embryonic stem cells were injected into a wild-type (WT) blastocyst. The aim is to study the function of Tal1 (a transcription factor implicated in haematopoietic differentiation) in embryonic development.

The experiment uses a paired design with two batches of samples. Each batch was independently generated by pooling cells from 6-7 chimeric embryos, fluorescence-activated cell sorting by the expression of the tdTomato marker to separate WT and KO cells, and using the sorted cell partitions in the 10X protocol. This resulted in four samples in total, with two (matched) replicates per genotype containing 10,000 cells per sample.

Setting up the data

Obtaining the counts

We obtain the count data through the r Biocpkg("BiocFileCache") framework. This caches the data locally upon the first download, avoiding the need to repeat the download on subsequent analyses.

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
count.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/chimera_tal1_data/raw_counts.mtx.gz"))

We load in the count data from the MatrixMarket format, using methods from the r CRANpkg("Matrix") package:

library(Matrix)
counts <- readMM(count.path)
dim(counts)

For the purposes of this demonstration, we will downsample to a quarter of the cells in the original data set. This is simply to stay within time and memory limits on the Bioconductor build system^[Namely, 1 hour with 8 GB RAM.], and can be skipped by users with more computing resources.

# Taking only every 4th cell.
down.keep <- rep(c(FALSE, FALSE, FALSE, TRUE), length.out=ncol(counts))
counts <- counts[,down.keep,drop=FALSE]

Finally, we will convert the matrix into the dgCMatrix format, which is more space and time-efficient than the default dgTMatrix produced by readMM().

counts <- as(counts, "dgCMatrix")

gc()

Constructing a `SingleCellExperiment`

We download the cell- and gene-level metadata using r Biocpkg("BiocFileCache"), and read them into R.

meta.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/chimera_tal1_data/meta.tab.gz"))
meta.tab <- read.delim(meta.path, stringsAsFactors=FALSE)
head(meta.tab)

gene.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/chimera_tal1_data/genes.tsv.gz"))
gene.tab <- read.delim(gene.path, header=FALSE, stringsAsFactors=FALSE)
colnames(gene.tab) <- c("ENSEMBL", "SYMBOL")
head(gene.tab)

We store the count matrix in a SingleCellExperiment object with the metadata associated with each cell and gene.

library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=counts), 
    colData=meta.tab[down.keep,], rowData=gene.tab)
sce

This object contains cells from all four samples. KO cells should stain positive for the tdTomato marker. Samples 1 and 3 come from the first batch, while samples 2 and 4 come from the second batch.

sce$sample <- factor(sce$sample)
table(sce$sample, sce$tomato)

gc()

We set the row names to the gene symbols, augmented with the Ensembl identifier if the symbol is not unique or missing.

library(scater)
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ENSEMBL, rowData(sce)$SYMBOL)
head(rownames(sce))

Quality control on the cells

Quality control (QC) on this data set has already been performed using a procedure similar to that described r Biocpkg("simpleSingleCell", "tenx.html", "here").

Cell-containing droplets were identified using the emptyDrops() function from the r Biocpkg("DropletUtils") package at a false discovery rate of 1%. See r Biocpkg("simpleSingleCell", "tenx.html#calling-cells-from-empty-droplets", "here") for more details.
Cells with high mitochondrial content were removed using an outlier-based approach as discussed r Biocpkg("simpleSingleCell", "tenx.html#quality-control-on-the-cells", "here"). This removes putative damaged cells that have lost much of their cytoplasmic mRNA.

These steps can be performed separately per sample so existing QC methods for 10X data can be directly applied. At this point, there is no need to explicitly use information across samples. However, users should continue to inspect diagnostic statistics like the percentage of cells discarded in each sample. Large differences between samples may be indicative of (i) strong biological differences at best or (ii) underlying technical problems that could compromise the downstream comparative analysis.

The code used for the processing of the embryo data set is available at https://github.com/MarioniLab/EmbryoTimecourse2018. Note that, in this case, the mitochondrial filtering was performed on the pool of cells from all samples. This is not strictly necessary.

Normalization of cell-specific biases

We will use the computeSumFactors() function from r Biocpkg("scran") to calculate size factors for each cell. This removes scaling biases associated with cell-specific differences in capture efficiency, sequencing depth and composition biases. We refer users to the r Biocpkg("simpleSingleCell", "tenx.html#normalizing-for-cell-specific-biases", "previous workflow") for more details on the method. Note that we set block= to perform pre-clustering within each sample to reduce computational work.

library(scran)

set.seed(100) # For the irlba usage.
clusters <- quickCluster(sce, use.ranks=FALSE, pc.approx=TRUE, block=sce$sample)
length(unique(clusters))

sce <- computeSumFactors(sce, clusters=clusters, min.mean=0.1)
summary(sizeFactors(sce))

We check that the size factors are roughly aligned with the total library sizes (Figure \@ref(fig:normplot)). Deviations from the diagonal correspond to composition biases due to differential expression between cell subpopulations. In this case, the cells below the diagonal probably correspond to erythrocytes that are dominated by hemoglobin upregulation.

plot(Matrix::colSums(counts(sce)), sizeFactors(sce), xlab="Total library size", 
    ylab="Size factor", log="xy")

We use the size factors to compute log-transformed normalized expression values, stored as the "logcounts" assay.

sce <- normalize(sce)
assayNames(sce)

In practice, it is possible to perform this step separately for each sample, and then to combine the resulting SingleCellExperiment objects together. This may be logistically simpler but requires a call to multiBatchNorm() to standardize the size factors to the same scale in the combined object. See r Biocpkg("simpleSingleCell", "batch.html#feature-selection-across-batches", "here") for a demonstration of how multiBatchNorm() might be used.

Comments from Aaron:

Both quickCluster() and computeSumFactors() can be made considerably faster by turning on parallelization via the block.BPPARAM= and BPPARAM= arguments, respectively. This typically comes at the cost of increased memory usage.

Concluding remarks

Once the initial processing is done, we can save the resulting SingleCellExperiment object to file. This will be used when merging the different samples onto the same coordinate system.

saveRDS(sce, "embryo_processed.rds")

All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.