In stephaniehicks/biocdemo: Large-scale clustering in Bioconductor

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Introduction

In this demo, we will showcase an end-to-end clustering pipeline, starting from the count data matrix stored in HDF5 (similar to what one would download from the HCA data portal) all the way to visualization and interpretation of the clustering results.

While we use a small-ish dataset for this demo for convenience, the code is computationally efficient even for (very) large datasets.

library(SingleCellExperiment)
library(TENxPBMCData)
library(scater)
library(scran)
library(scry)
library(mbkmeans)

Getting the data

sce <- TENxPBMCData("pbmc4k")
sce
counts(sce)
seed(counts(sce))

In this demo we use a small dataset for the sake of running all the code in a short amount of time. However, this workflow is designed for large data and it will run just fine with any sized dataset. For instance, we have analyzed 1.3 million cells on a machine with a moderately sized RAM (e.g., 64GB).

By running the code below, you will run the workflow on the 10X Genomics 1.3 million cells dataset. (Warning: it takes some time!) Alternatively, you can substitute the code below with your own data in SingleCellExperiment format.

library(TENxBrainData)
sce <- TENxBrainData()
sce

Filtering and normalization

Removing low-quality cells

First, we use the scater package to compute a set of QC measures and filter out the low-quality samples.

Here, we exclude those cells that have a too high percentage of mitochondrial genes or for which we detect too few genes.

sce <- addPerCellQC(sce, 
            subsets = list(Mito = grep("^MT-", rowData(sce)$Symbol_TENx)))
high_mito <- isOutlier(sce$subsets_Mito_percent, 
                       nmads = 3, type="higher")
low_detection <- (sce$detected < 1000)
high_counts <- sce$sum > 45000
sce <- sce[,!high_mito & !low_detection & !high_counts]
sce

Removing lowly expressed genes

Next, we remove the lowly expressed genes. Here, we keep only those genes that have at least 1 UMI in at least 5% of the data. These threshold are dataset-specific and may need to be taylored to specific applications.

num_reads <- 1
num_cells <- 0.01*ncol(sce)
keep <- which(DelayedArray::rowSums(counts(sce) >= num_reads ) >= num_cells)
sce <- sce[keep,]
sce

These leaves us with length(keep) genes.

Normalization

Here, we apply mbkmeans (k=10 and batch size of 500) as a preliminary step to scran normalization.

set.seed(18)
mbk <- mbkmeans(sce, whichAssay = "counts", reduceMethod = NA,
                  clusters=10, batch_size = 500)
sce$mbk10 <- paste0("mbk", mbk$Clusters)
table(mbk$Clusters)

We then compute the normalization factors and normalize the data.

sce <- computeSumFactors(sce, cluster=mbk$Clusters, min.mean=0.1)
sce <- logNormCounts(sce)
sce

Dimensionality reduction

PCA on normalized values

Here, we compute the first 50 principal components using the top variable genes.

sce <- scater::runPCA(sce, ncomponents = 50,
                      ntop = 1000,
                      scale = TRUE,
                      BSPARAM = BiocSingular::RandomParam())
plotPCA(sce, colour_by = "mbk10")

GLM-PCA

An alternative to PCA on normalized data is to use the GLM-PCA approach, implemented in the scry Bioconductor package. Here, we use the faster, approximate approach that computes the null residuals and runs PCA on them.

Other approaches implemented in Bioconductor for dimensionality reduction include correspondence analysis (in the corral package) and ZINB-WaVE (in the zinbwave and NewWave packages).

sce <- nullResiduals(sce, assay="counts", type="deviance")
sce <- scater::runPCA(sce, ncomponents = 50,
                      ntop = 1000,
                      exprs_values = "binomial_deviance_residuals",
                      scale = TRUE, name = "GLM-PCA",
                      BSPARAM = BiocSingular::RandomParam())
plotReducedDim(sce, dimred = "GLM-PCA", colour_by = "mbk10")

Clustering

Here, we use the GLM-PCA results to obtain the final cluster labels. We use two alternative approaches: Louvain and mini-batch k-means.

Louvain

g <- buildSNNGraph(sce, k=10, use.dimred = "GLM-PCA")
lou <- igraph::cluster_louvain(g)
sce$louvain <- paste0("Louvain", lou$membership)
table(sce$louvain)

If you want more control on the resolution of the clustering, you can use the Louvain implementation available in the resolution package. Alternatively, the leiden package implements the Leiden algorithm.

Mini-batch k-means

Mini-batch $k$-means is a faster version of $k$-means that uses only a random "mini-batch" of data at each iteration. The algorithm is fast enough to cluster the 1.3 million cell data in the space of the top 50 PC in under 30 seconds.

Here, we run it multiple times to select the value of $k$ with the elbow method.

k_list <- seq(5, 20)
km_res <- lapply(k_list, function(k) {
    mbkmeans(sce, clusters = k, 
             batch_size = 500,
             reduceMethod = "GLM-PCA",
             calc_wcss = TRUE)
})
wcss <- sapply(km_res, function(x) sum(x$WCSS_per_cluster))
plot(k_list, wcss, type = "b")

sce$kmeans <- paste0("mbk", km_res[[which(k_list==13)]]$Clusters)
table(sce$kmeans)
table(sce$kmeans, sce$louvain)

Cluster visualization

We can use UMAP or t-SNE to visualize the clusters.

sce <- scater::runUMAP(sce, dimred = "GLM-PCA", 
                       external_neighbors = TRUE,
                       BNPARAM = BiocNeighbors::AnnoyParam())
plotUMAP(sce, colour_by = "louvain")
plotUMAP(sce, colour_by = "kmeans")

sce <- scater::runTSNE(sce, dimred = "GLM-PCA", 
                       external_neighbors = TRUE,
                       BNPARAM = BiocNeighbors::AnnoyParam())
plotTSNE(sce, colour_by = "louvain")
plotTSNE(sce, colour_by = "kmeans")

Cluster interpretation

To interpret the 13 clusters, we start by computing a list of markers.

sce <- sce[,!(sce$kmeans %in% c("mbk11", "mbk2", "mbk5"))]
colLabels(sce) <- sce$kmeans
markers <- findMarkers(sce, pval.type="all", direction="up")

chosen <- "mbk1"
interesting <- markers[[chosen]]
interesting[1:10,1:3]

top10 <- rownames(interesting[1:10,])
rowData(sce[top10,])

library(pheatmap)
means <- t(apply(logcounts(sce[c(top10,"ENSG00000177455"),]), 1, tapply, sce$kmeans, mean))
rownames(means) <- rowData(sce[c(top10,"ENSG00000177455"),])$Symbol_TENx
pheatmap(means, scale = "row")

The presence of CD40, IL4R suggests that these are activated B-cells.

Similar analyses can be carried out for all clusters.

Session info

sessionInfo()

stephaniehicks/biocdemo documentation built on Dec. 31, 2020, 7:33 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

stephaniehicks/biocdemo
Large-scale clustering in Bioconductor

In stephaniehicks/biocdemo: Large-scale clustering in Bioconductor

Introduction

Getting the data

Filtering and normalization

Removing low-quality cells

Removing lowly expressed genes

Normalization

Dimensionality reduction

PCA on normalized values

GLM-PCA

Clustering

Louvain

Mini-batch k-means

Cluster visualization

Cluster interpretation

Session info

R Package Documentation

Browse R Packages

We want your feedback!

stephaniehicks/biocdemo Large-scale clustering in Bioconductor

In stephaniehicks/biocdemo: Large-scale clustering in Bioconductor

Introduction

Getting the data

Filtering and normalization

Removing low-quality cells

Removing lowly expressed genes

Normalization

Dimensionality reduction

PCA on normalized values

GLM-PCA

Clustering

Louvain

Mini-batch k-means

Cluster visualization

Cluster interpretation

Session info

R Package Documentation

Browse R Packages

We want your feedback!

stephaniehicks/biocdemo
Large-scale clustering in Bioconductor