knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "figures/", dpi = 36 )
Here, we demonstrate a grid search of clustering parameters with a mouse
hippocampus VeraFISH dataset. BANKSY currently provides four algorithms for
clustering the BANKSY matrix with clusterBanksy: Leiden (default), Louvain,
k-means, and model-based clustering. In this vignette, we run only Leiden
clustering. See ?clusterBanksy
for more details on the parameters for
different clustering methods.
start.time <- Sys.time()
The dataset comprises gene expression for 10,944 cells and 120 genes in 2
spatial dimensions. See ?Banksy::hippocampus
for more details.
# Load libs library(Banksy) library(SummarizedExperiment) library(SpatialExperiment) library(scuttle) library(scater) library(cowplot) library(ggplot2) # Load data data(hippocampus) gcm <- hippocampus$expression locs <- as.matrix(hippocampus$locations)
Here, gcm
is a gene by cell matrix, and locs
is a matrix specifying the
coordinates of the centroid for each cell.
head(gcm[,1:5]) head(locs)
Initialize a SpatialExperiment object and perform basic quality control. We keep cells with total transcript count within the 5th and 98th percentile:
se <- SpatialExperiment(assay = list(counts = gcm), spatialCoords = locs) colData(se) <- cbind(colData(se), spatialCoords(se)) # QC based on total counts qcstats <- perCellQCMetrics(se) thres <- quantile(qcstats$total, c(0.05, 0.98)) keep <- (qcstats$total > thres[1]) & (qcstats$total < thres[2]) se <- se[, keep]
Next, perform normalization of the data.
# Normalization to mean library size se <- computeLibraryFactors(se) aname <- "normcounts" assay(se, aname) <- normalizeCounts(se, log = FALSE)
BANKSY has a few key parameters. We describe these below.
For characterising neighborhoods, BANKSY computes the weighted neighborhood
mean (H_0
) and the azimuthal Gabor filter (H_1
), which estimates gene
expression gradients. Setting compute_agf=TRUE
computes both H_0
and H_1
.
k_geom
specifies the number of neighbors used to compute each H_m
for
m=0,1
. If a single value is specified, the same k_geom
will be used
for each feature matrix. Alternatively, multiple values of k_geom
can be
provided for each feature matrix. Here, we use k_geom[1]=15
and
k_geom[2]=30
for H_0
and H_1
respectively. More neighbors are used to
compute gradients.
We compute the neighborhood feature matrices using normalized expression
(normcounts
in the se
object).
k_geom <- c(15, 30) se <- computeBanksy(se, assay_name = aname, compute_agf = TRUE, k_geom = k_geom)
computeBanksy
populates the assays
slot with H_0
and H_1
in this
instance:
se
The lambda
parameter is a mixing parameter in [0,1]
which
determines how much spatial information is incorporated for downstream analysis.
With smaller values of lambda
, BANKY operates in cell-typing mode, while at
higher levels of lambda
, BANKSY operates in domain-finding mode. As a
starting point, we recommend lambda=0.2
for cell-typing and lambda=0.8
for
zone-finding. Here, we run lambda=0
which corresponds to non-spatial
clustering, and lambda=0.2
for spatially-informed cell-typing. We compute PCs
with and without the AGF (H_1
).
lambda <- c(0, 0.2) se <- runBanksyPCA(se, use_agf = c(FALSE, TRUE), lambda = lambda, seed = 1000)
runBanksyPCA
populates the reducedDims
slot, with each combination of
use_agf
and lambda
provided.
reducedDimNames(se)
Next, we cluster the BANKSY embedding with Leiden graph-based clustering. This
admits two parameters: k_neighbors
and resolution
. k_neighbors
determines
the number of k nearest neighbors used to construct the shared nearest
neighbors graph. Leiden clustering is then performed on the resultant graph
with resolution resolution
. For reproducibiltiy we set a seed for each
parameter combination.
k <- 50 res <- 1 se <- clusterBanksy(se, use_agf = c(FALSE, TRUE), lambda = lambda, k_neighbors = k, resolution = res, seed = 1000)
clusterBanksy
populates colData(se)
with cluster labels:
colnames(colData(se))
To compare clustering runs visually, different runs can be relabeled to
minimise their differences with connectClusters
:
se <- connectClusters(se)
Visualise spatial coordinates with cluster labels.
cnames <- colnames(colData(se)) cnames <- cnames[grep("^clust", cnames)] cplots <- lapply(cnames, function(cnm) { plotColData(se, x = "sdimx", y = "sdimy", point_size = 0.1, colour_by = cnm) + coord_equal() + labs(title = cnm) + theme(legend.title = element_blank()) + guides(colour = guide_legend(override.aes = list(size = 2))) }) plot_grid(plotlist = cplots, ncol = 2)
Compare all cluster outputs with compareClusters
. This function computes
pairwise cluster comparison metrics between the clusters in colData(se)
based
on adjusted Rand index (ARI):
compareClusters(se, func = "ARI")
or normalized mutual information (NMI):
compareClusters(se, func = "NMI")
See ?compareClusters
for the full list of comparison measures.
Vignette runtime:
Sys.time() - start.time
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.