knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE, cache=TRUE)
library(BiocStyle)
compareSingleCell:::.compile("embryo_preprocess")

Introduction

Once a processed SingleCellExperiment is available, the next step is to merge different samples to place them on the same coordinate system. This allows us to perform downstream procedures like dimensionality reduction and clustering without needing to model sample-to-sample variation in expression. We will merge using the mutual nearest neighbors (MNN) method [@haghverdi2018batch] as described r Biocpkg("simpleSingleCell", "batch.html", "elsewhere"), using the SingleCellExperiment object that we constructed in the r compareSingleCell:::.link("embryo_merge", NULL, "previous workflow").

library(SingleCellExperiment)
sce <- readRDS("embryo_processed.rds")

Modelling the technical variability

We model the technical noise in each sample using the multiBlockVar() function as described r Biocpkg("simpleSingleCell", "var.html#fitting-batch-specific-trends", "here"). As this data set does not contain spike-ins, we set make.tech.trend=TRUE to estimate the technical component based on Poisson noise -
see r Biocpkg("simpleSingleCell", "tenx.html#modelling-the-mean-variance-trend", "here") for details.

library(scran)
dec.out <- multiBlockVar(sce, block=sce$sample, 
    make.tech.trend=TRUE, weighted=FALSE)
dec.out[,-7] # don't show per-block results.

We also turn off weighting to ensure that each sample contributes equally to the results, regardless of the number of cells. This is desirable here because different conditions may have different sets of highly variable genes (HVGs). If one condition contains more cells, weighting by the number of cells would bias the combined statistics in favour of that condition's HVGs. This differs from more typical applications of multiBlockVar() where all samples are replicates. In such cases, the underlying set of HVGs should be similar and thus weighting could be used to improve precision without biasing the results.

The trend lies close to the lower bound of the per-gene variances for each sample (Figure \@ref(fig:trendplots)). This indicates that the assumption of Poisson noise is reasonable.

par(mfrow=c(2,2))
per.block.stats <- dec.out$per.block
for (i in colnames(per.block.stats)) {
    cur.stats <- per.block.stats[,i]
    plot(cur.stats$mean, cur.stats$total, xlab="Mean log-expression", 
        ylab="Variance", main=sprintf("Sample %s", i), pch=16)
    curve(metadata(cur.stats)$trend(x), add=TRUE, col="dodgerblue", lwd=2)
}

Comments from Aaron:

Identifying genes of interest

We define the features of interest as those with net biological components greater than zero. This enriches for genes that contain some biological variation, reducing the effect of uninteresting Poisson noise on downstream analyses.

to.use <- dec.out$bio > 0
summary(to.use)

The injected KO cells were derived from a single male cell line while the host embryos were a mix of male or female mice. Thus, there is a (largely uninteresting) sex effect between the WT and KO samples. To mitigate this, we remove Xist and genes on the Y chromosome.

library(TxDb.Mmusculus.UCSC.mm10.ensGene)
loc <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, key=rowData(sce)$ENSEMBL, 
    keytype="GENEID", column="CDSCHROM")
is.y <- loc=="chrY" & !is.na(loc)
to.use <- to.use & !is.y 
to.use[which(rowData(sce)$SYMBOL=="Xist")] <- FALSE
sum(to.use)

We also remove the tdTomato marker used to sort for KO cells.

to.use[rowData(sce)$SYMBOL=="tomato-td"] <- FALSE
sum(to.use)

Dimensionality reduction with PCA

We use principal components analysis (PCA) to perform dimensionality reduction prior to merging. This reduces computational work and removes some high-dimensional noise, as r Biocpkg("simpleSingleCell", "reads.html#denoising-expression-values-using-pca", "previously discussed"). We perform PCA across all samples by using multiBatchPCA() on our selected subset of genes in to.use. This ensures each sample contributes equally to the definition of the coordinate space, as described r Biocpkg("simpleSingleCell", "batch.html#hierarchical-merging", "here").

library(batchelor)
library(BiocSingular)

set.seed(101) # for irlba.
pc.out <- batchelor::multiBatchPCA(sce, batch=sce$sample, 
    subset.row=to.use, get.variance=TRUE,
    BSPARAM=IrlbaParam(deferred=TRUE))
pc.out # one output matrix per level of 'sample'.

By default, multiBatchPCA() will return 50 PCs for all samples. We use denoisePCANumber() to choose the number of PCs to retain based on our previous estimates of technical noise in this data set. This discards later PCs until the variance lost is equal to the total technical component.

to.retain <- denoisePCANumber(
    metadata(pc.out)$var.explained, # variance explained per PC.
    sum(dec.out$tech[to.use]), # technical noise in subset of genes.
    metadata(pc.out)$var.total # total variance in the data
)
to.retain

We then subset the matrices to only retain the first to.retain PCs in each sample.

for (i in seq_along(pc.out)) {
    pc.out[[i]] <- pc.out[[i]][,seq_len(to.retain),drop=FALSE]
}

Comments from Aaron:

Batch correction with MNN

We use the fastMNN() function in a hierarchical manner as described r Biocpkg("simpleSingleCell", "batch.html#hierarchical-merging", "elsewhere"). This involves merging the most similar samples before merging those that are more different, to weaken the assumption of shared populations required for correct MNN detection. In this case, we first merge samples from the same genotype to remove the batch effect. Note the use of pc.input=TRUE to specify that the input values are already in low-dimensional PC space.

ko.out <- batchelor::fastMNN(pc.out[["1"]], pc.out[["2"]], pc.input=TRUE)
metadata(ko.out)$merge.info$lost.var
wt.out <- batchelor::fastMNN(pc.out[["3"]], pc.out[["4"]], pc.input=TRUE)
metadata(wt.out)$merge.info$lost.var

The lost.var represents the proportion of variance in each batch that is removed by the batch correction^[Specifically, the orthogonalization step, as described r Biocpkg("simpleSingleCell", "batch.html#with-diagnostics", "here").]. If the assumptions underlying the MNN approach hold, this should be low and represent removal of noise along the batch vector. A large proportion of lost variance (>10%) may be cause for concern as it suggests that biological structure within each batch is being discarded.

Our next step is to merge samples across genotypes to remove technical and uninteresting biological differences. This includes sex and changes in expression induced by injection or cell sorting.

overall <- batchelor::fastMNN(ko.out$corrected, wt.out$corrected, pc.input=TRUE)
metadata(overall)$merge.info$lost.var

We store the result in the reducedDims slot of our SingleCellExperiment object, for use in downstream functions.

# Cell order is the same between sce and corrected: see comments.
reducedDim(sce, "corrected") <- overall$corrected

Note that the merge to create overall will also eliminate changes in expression caused by loss of Tal1. This is necessary for the purposes of mapping all cells onto a common coordinate system. Without such correction, cells of the same type would be separated by the genotype difference across samples, precluding common annotation in downstream analyses. Nonetheless, differential expression upon KO is of substantial biological interest and will be recovered in our downstream differential testing.

Comments from Aaron:

Visual inspection of merge results

We use $t$-stochastic neighbor embedding (t-SNE) plots [@van2008visualizing] to perform further dimensionality reduction for visualization. In the uncorrected data, cells clearly separate by genotype with no visible batch effects between replicates (Figure \@ref(fig:beforeplot)).

library(scater)
old <- sce
reducedDim(sce, "PCA") <- do.call(rbind, pc.out)
plotTSNE(sce, rerun=TRUE, run_args=list(use_dimred="PCA"), colour_by="sample")

After correction, cells from all samples are merged together in the majority of subpopulations (Figure \@ref(fig:afterplot)). This is consistent with the removal of the inter-genotype differences and simplifies annotation and interpretation in downstream analyses.

sce <- runTSNE(sce, use_dimred="corrected", colour_by="sample")
plotTSNE(sce, colour_by="sample")

Defining common annotation

Clustering cells

We use the shared nearest-neighbour approach [@xu2015identification] to cluster cells in the corrected space, as described r Biocpkg("simpleSingleCell", "umis.html#clustering-cells-into-putative-subpopulations", "here") and r Biocpkg("simpleSingleCell", "batch.html#using-the-corrected-values-in-downstream-analyses", "here").

snn.gr <- buildSNNGraph(sce, use.dimred="corrected")
clusters <- igraph::cluster_walktrap(snn.gr)
table(clusters$membership, sce$sample)

We visually examine the clusters on a t-SNE plot to confirm that a sensible partitioning was generated (Figure \@ref(fig:clusterplot)).

sce$cluster <- factor(clusters$membership)
plotTSNE(sce, colour_by="cluster", text_by="cluster", text_colour="red")

Comments from Aaron:

Annotating cell types

We use findMarkers() to identify the genes that define each cluster. This is done by testing for differential expression between each pair of clusters and consolidating the results into a single table per cluster - see r Biocpkg("simpleSingleCell", "reads.html#detecting-marker-genes-between-clusters", "here") for details. We also block on the sample of origin to avoid confounding effects from sample-to-sample variability or differential expression between genotypes.

markers <- findMarkers(sce, sce$cluster, block=sce$sample)
blood <- "5"
hemo <- markers[[blood]]["Hbb-bh1",-(1:3)]
stopifnot(all(unlist(hemo)>0))

Of particular interest is cluster r blood. This upregulates a range of hemoglobin genes (Figure \@ref(fig:bloodheat)) and probably represents cells in the erythroid lineage.

blood.set <- markers[["5"]]
as.data.frame(blood.set[1:20,1:3])

logFCs <- as.matrix(blood.set[1:50,-(1:3)])
colnames(logFCs) <- sub("logFC.", "", colnames(logFCs))

library(pheatmap)
max.lfc <- max(abs(range(logFCs)))
pheatmap(logFCs, breaks=seq(-5, 5, length.out=101))

One could repeat this procedure with all of the other clusters. However, it is more efficient to use our differential analyses to prioritize clusters of interest, so we will delay further annotation until that point.

Detecting doublets

One potentially problematic feature of this data set is its high doublet frequency. Thus, we want a measure of how "doublet-like" each cell is, in order to assist downstream interpretation of our results. This is achieved using the doubletCells() function on each sample, as described r Biocpkg("simpleSingleCell", "doublets.html#doublet-detection-by-simulation", "here").

set.seed(1000)
doublets <- numeric(ncol(sce))
for (i in levels(sce$sample)) {
    keep <- sce$sample==i
    cur.sce <- sce[,keep]
    scores <- doubletCells(cur.sce, BSPARAM=BiocSingular::IrlbaParam(deferred=TRUE))
    doublets[keep] <- scores
}
summary(doublets)

Some clusters are entirely composed of putative doublets, while others simply contain a small proportion of doublets (Figure \@ref(fig:doubletplot)). The former are not of any interest and can be dismissed immediately. The latter are salvageable but require some care in interpretation during downstream analyses.

sce$doublet <- doublets
plotColData(sce, "doublet", "cluster", colour_by="sample")

We perform a complementary analysis based on the clusters directly, as described r Biocpkg("simpleSingleCell", "doublets.html#doublet-detection-with-clusters", "here"). The top-ranking doublet-like clusters are defined by the absence of any uniquely expressed genes distinguishing them from a putative pair of source clusters. These are consistent with the most obvious offenders in Figure \@ref(fig:doubletplot).

by.clust <- doubletCluster(sce, sce$cluster, block=sce$sample)
by.clust[,1:8]

We remove the obvious doublet clusters prior to any downstream analysis. We give the benefit of the doubt to clusters containing but not dominated by doublets, provided that their results are treated with caution.

offenders <- c("4", "8")
retain <- !sce$cluster %in% offenders
sce <- sce[,retain]
summary(retain)
# Sanity check for the identity of the offending clusters.
stopifnot(identical(rownames(by.clust)[by.clust$N==0], offenders))

by.clust <- split(doublets, clusters$membership)
per.clust <- vapply(by.clust, median, FUN.VALUE=0)
stopifnot(all(outer(per.clust[offenders], per.clust[setdiff(names(per.clust), offenders)], ">")))

Concluding remarks

We save the SingleCellExperiment object with the merged coordinates to file for downstream use in differential testing. We also save the test results for later annotation.

saveRDS(sce, "embryo_merged.rds")
saveRDS(markers, "embryo_markers.rds")

All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.

sessionInfo()

References



MarioniLab/compareSingleCell documentation built on May 25, 2019, 4 a.m.