knitr::opts_chunk$set( message = FALSE, warning = FALSE )
t0 <- Sys.time()
In this vignette, we describe how to run BANKSY with Seurat objects. If you use BANKSY in your research, please cite
BANKSY unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis
Vipul Singhal, Nigel Chou, Joseph Lee, Yifei Yue, Jinyue Liu, Wan Kee Chock, Li Lin, Yun-Ching Chang, Erica Mei Ling Teo, Jonathan Aow, Hwee Kuan Lee, Kok Hao Chen & Shyam Prabhakar
Nature Genetics, 2024
doi: 10.1038/s41588-024-01664-3
Website: https://prabhakarlab.github.io/Banksy
BANKSY is a method that incorporates neighborhood information for clustering spatial omics data. By doing so, BANKSY is able to
The amount of neighborhood information incorporated is controlled by a parameter
lambda
in [0,1], with higher values giving more weight to the neighbourhood
information during clustering.
The RunBanksy
function implemented with the SeuratWrappers package allows
users to run BANKSY with Seurat objects. We describe two options of running
RunBanksy
. The first is within Seurat's spatial framework (see
here and
here) and
requires a Seurat object and a lambda parameter as mandatory input. The second
option works with Seurat objects that do not have spatial information stored
within, and therefore requires an additional argument giving the locations of
the cell centroids or spots.
Caveat: ScaleData
should not be run after a call to RunBanksy
; RunBanksy
populates the scale.data
slot with the scaled BANKSY matrix. Calling ScaleData
after RunBanksy
performs gene-wise z-scaling, negating the effect of lambda
.
Prerequisites to install:
library(Banksy) library(Seurat) library(SeuratData) library(SeuratWrappers) library(ggplot2) library(gridExtra) library(pals) # Kelly palette for visualization mypal <- kelly()[-1]
We demonstrate how to run BANKSY within Seurat's spatial analysis framework with a mouse hippocampus Slide-seq v2 dataset from the SeuratData package.
After installing SeuratData, the data can be accessed as follows:
InstallData('ssHippo') ss.hippo <- LoadData("ssHippo")
ss.hippo <- LoadData("ssHippo")
We perform simple preprocessing by filtering beads with high mito percentage and keeping only beads within the 5th and 98th percentile of total UMI counts. To keep runtime of this vignette short, we downsample the data to 10,000 beads.
# Filtering ss.hippo[['percent.mt']] <- PercentageFeatureSet(ss.hippo, pattern = '^MT-') ss.hippo <- subset(ss.hippo, percent.mt < 10 & nCount_Spatial > quantile(ss.hippo$nCount_Spatial, 0.05) & nCount_Spatial < quantile(ss.hippo$nCount_Spatial, 0.98)) # Downsample set.seed(42) ss.hippo <- ss.hippo[,sample(colnames(ss.hippo), 1e4)]
Next, normalize the data and find variable genes:
# Normalize ss.hippo <- NormalizeData(ss.hippo) ss.hippo <- FindVariableFeatures(ss.hippo) ss.hippo <- ScaleData(ss.hippo)
To run BANKSY, we specify the following:
lambda
: a numeric value in [0,1]. With low values of lambda, BANKSY operates
in cell-typing mode, while high values of lambda find spatial domains.assay
and slot
: determines where to pull the expression data fromfeatures
: specifies features for downstream analysis. This can be 'all'
,
'variable'
or a subset of features. k_geom
: the number of neighbors that defines a cell's neighborhoodCall ?RunBanksy
for more details on function parameters.
# Run BANKSY ss.hippo <- RunBanksy(ss.hippo, lambda = 0.2, verbose=TRUE, assay = 'Spatial', slot = 'data', features = 'variable', k_geom = 15) ss.hippo
Note that the RunBanksy
function sets the default assay to BANKSY
(
determined by the assay_name
argument) and fills the scale.data
slot. Users
should not call ScaleData
on the BANKSY
assay as this negates the effects
of lambda
.
The rest of the pipeline is similar to the 'default' Seurat pipline. We scale the data and run dimensionality reduction with PCA and UMAP:
# Run PCA and UMAP ss.hippo <- RunPCA(ss.hippo, assay = 'BANKSY', features = rownames(ss.hippo), npcs = 30) ss.hippo <- RunUMAP(ss.hippo, dims = 1:30)
Next, find BANKSY clusters:
# Clustering ss.hippo <- FindNeighbors(ss.hippo, dims = 1:30) ss.hippo <- FindClusters(ss.hippo, resolution = 0.5)
Visualize the UMAP and Spatial plot:
# Viz grid.arrange( DimPlot(ss.hippo, pt.size = 0.25, label = TRUE, label.size = 3, repel = TRUE), SpatialDimPlot(ss.hippo, stroke = NA, label = TRUE, label.size = 3, repel = TRUE, alpha = 0.5, pt.size.factor = 2), ncol = 2 )
Find markers based on the BANKSY clusters and visualize them. Here, we find differentially expressed genes between the CA1 and CA3 regions.
# Find markers DefaultAssay(ss.hippo) <- 'Spatial' markers <- FindMarkers(ss.hippo, ident.1 = 4, ident.2 = 9, only.pos = F, logfc.threshold = 1, min.pct = 0.5) markers <- markers[markers$p_val_adj < 0.01,] markers genes <- c('ATP2B1', 'CHGB') SpatialFeaturePlot(ss.hippo, features = genes, pt.size.factor = 3, stroke = NA, alpha = 0.5, max.cutoff = 'q95')
One can also call RunBanksy
on a Seurat object created from counts by
providing the location of cell centroids or spots explicitly. In this case,
the locations must be stored as metadata. Here, we use a mouse hippocampus
VeraFISH dataset provided with the Banksy package.
data(hippocampus) head(hippocampus$expression[,1:5]) head(hippocampus$locations)
Construct the Seurat object by storing the locations of cell centroids as metadata. We keep cells with total count between 5th and 98th percentile:
# Create manually vf.hippo <- CreateSeuratObject(counts = hippocampus$expression, meta.data = hippocampus$locations) vf.hippo <- subset(vf.hippo, nCount_RNA > quantile(vf.hippo$nCount_RNA, 0.05) & nCount_RNA < quantile(vf.hippo$nCount_RNA, 0.98))
Next, we normalize the data by library size and scale the data:
# Normalize vf.hippo <- NormalizeData(vf.hippo, scale.factor = 100, normalization.method = 'RC') vf.hippo <- ScaleData(vf.hippo)
Now, run BANKSY. Here, we provide the column names of the x and y spatial
coordinates as stored in the metadata to dimx
and dimy
respectively:
# Run BANKSY vf.hippo <- RunBanksy(vf.hippo, lambda = 0.2, dimx = 'sdimx', dimy = 'sdimy', assay = 'RNA', slot = 'data', features = 'all', k_geom = 10)
Note that the RunBanksy
function sets the default assay to BANKSY
(
determined by the assay_name
argument) and fills the scale.data
slot. Users
should not call ScaleData
on the BANKSY
assay as this negates the effects
of lambda
.
Run PCA on the BANKSY matrix:
# PCA vf.hippo <- RunPCA(vf.hippo, assay = 'BANKSY', features = rownames(vf.hippo), npcs = 20)
Find BANKSY clusters:
# Cluster vf.hippo <- FindNeighbors(vf.hippo, dims = 1:20) vf.hippo <- FindClusters(vf.hippo, resolution = 0.5)
Visualise BANKSY clusters in spatial dimensions:
# Viz FeatureScatter(vf.hippo, 'sdimx', 'sdimy', cols = mypal, pt.size = 0.75) FeatureScatter(vf.hippo, 'sdimx', 'sdimy', cols = mypal, pt.size = 0.1) + facet_wrap(~ colors)
Find markers and visualise them. Here, we do so for a cluster defined by a thin
layer of cells expressing Gfap. We also write a simple function genePlot
that
plots marker genes in spatial dimensions.
# Find markers DefaultAssay(vf.hippo) <- 'RNA' markers <- FindMarkers(vf.hippo, ident.1 = 6, only.pos = TRUE) genePlot <- function(object, dimx, dimy, gene, assay = 'RNA', slot = 'scale.data', q.low = 0.01, q.high = 0.99, col.low='blue', col.high='red') { val <- GetAssayData(object, assay=assay, slot=slot)[gene,] val.low <- quantile(val, q.low) val.high <- quantile(val, q.high) val[val < val.low] <- val.low val[val > val.high] <- val.high pdf <- data.frame(x=object[[dimx]], y=object[[dimy]], gene=val) colnames(pdf) <- c('sdimx','sdimy', 'gene') ggplot(pdf, aes(x=sdimx,y=sdimy,color=gene)) + geom_point(size = 1) + theme_minimal() + theme(legend.title = element_blank()) + scale_color_gradient2(low = col.low, high = col.high) + ggtitle(gene) } genePlot(vf.hippo, 'sdimx', 'sdimy', 'Gfap')
This section demonstrate demonstrates multi-sample analysis. Such an approach is appropriate when analysing multiple spatial omics datasets with non-contiguous spatial coordinates, and when large batch effects are not present.
Here, we use a mouse hippocampus VeraFISH dataset provided with the Banksy package.
data(hippocampus) head(hippocampus$expression[,1:5]) head(hippocampus$locations)
For demonstration purposes, we create three separate datasets by splitting the data.
# Number of groups n_groups = 3 group_names = paste0('group', seq(n_groups)) group_size = 1000 starts = seq(1, by=group_size, length.out=n_groups) ends = starts + group_size - 1 # List of Seurat objects seu_list = lapply(seq(n_groups), function(i) { idx = seq(starts[i], ends[i]) seu = CreateSeuratObject( counts = hippocampus$expression[,idx], meta.data = data.frame(scale(hippocampus$locations[idx,], scale = FALSE)) ) # Set original identity of cell seu$orig.ident = group_names[i] seu }) seu_list
Perform normalisation for each dataset.
seu_list = lapply(seu_list, NormalizeData, scale.factor = 100, normalization.method = 'RC')
Merge the datasets. Note that the spatial coordinates overlap.
# Merge seu = Reduce(merge, seu_list) seu = JoinLayers(seu) # run this for Seurat v5 objects # Plot spatial coordinates colored by group plot(FetchData(seu, c('sdimx', 'sdimy')), col = factor(seu$orig.ident))
Now run BANKSY. For multi-sample analysis, the argument group
must be
provided, which specifies the name of the metadata column that gives the
assignment of each cell or spot to its original Seurat object. Here, we use
orig.ident
. Internally, providing the group
argument tells the function to
compute neighborhood matrices based on locations staggered by group
,
ensuring that cells from different spatial datasets do not overlap. The
staggered locations are stored in the metadata for sanity checking. The
split.scale
argument allows for within-group scaling, accounting for minor
differences in datasets.
# Grouping variable head(seu@meta.data) table(seu$orig.ident) # Run BANKSY seu = RunBanksy(seu, lambda = 0.2, assay = 'RNA', slot = 'data', dimx = 'sdimx', dimy = 'sdimy', features = 'all', group = 'orig.ident', split.scale = TRUE, k_geom = 15) # Staggered locations added to metadata head(seu@meta.data)
The rest of the workflow follows as before:
seu = RunPCA(seu, assay = 'BANKSY', features = rownames(seu), npcs = 30) seu = RunUMAP(seu, dims = 1:30) seu = FindNeighbors(seu, dims = 1:30) seu = FindClusters(seu, resolution = 1)
Visualise clusters:
mypal <- kelly()[-1] DimPlot(seu, pt.size = 0.25, label = TRUE, label.size = 3, cols = mypal)
FeatureScatter(seu, 'staggered_sdimx', 'staggered_sdimy', pt.size = 0.75, cols = mypal)
BANKSY can be used with Harmony for integrating multiple spatial omics datasets in the presence of strong batch effects.
Download the data.
library(spatialLIBD) library(ExperimentHub) library(harmony) ehub <- ExperimentHub::ExperimentHub() spe <- spatialLIBD::fetch_data(type = "spe", eh = ehub) imgData(spe) <- NULL assay(spe, "logcounts") <- NULL reducedDims(spe) <- NULL rowData(spe) <- NULL colData(spe) <- DataFrame( sample_id = spe$sample_id, clust_annotation = factor( addNA(spe$layer_guess_reordered_short), exclude = NULL, labels = seq(8) ), in_tissue = spe$in_tissue, row.names = colnames(spe) ) invisible(gc()) # Subset to first sample of each subject sample_names <- c("151507", "151669", "151673") spe_list <- lapply(sample_names, function(x) spe[, spe$sample_id == x]) rm(spe) invisible(gc())
Normalise the data and compute highly variable features.
# Convert to Seurat and Normalize data seu_list <- lapply(spe_list, function(x) { x <- as.Seurat(x, data = NULL) NormalizeData(x, scale.factor = 3000, normalization.method = 'RC') }) # Compute HVGs for each dataset and take the union hvgs <- lapply(seu_list, function(x) { VariableFeatures(FindVariableFeatures(x, nfeatures = 2000)) }) hvgs <- Reduce(union, hvgs) # Subset to HVGs seu_list <- lapply(seu_list, function(x) x[hvgs,]) seu <- Reduce(merge, seu_list) locs <- do.call(rbind.data.frame, lapply(spe_list, spatialCoords)) seu@meta.data <- cbind(seu@meta.data, locs) seu
Run BANKSY. When analysing multiple samples, the argument group
must be
provided, which specifies the name of the metadata column that gives the
assignment of each cell or spot to its original Seurat object. Here, we use
sample_id
. Internally, providing the group
argument tells the function to
compute neighborhood matrices based on locations staggered by group
,
ensuring that cells from different spatial datasets do not overlap. The
staggered locations are stored in the metadata for sanity checking.
Within-group scaling has little effect in the presence of strong batch effects,
hence, we set split.scale=FALSE
for efficiency.
# Grouping variable head(seu@meta.data) table(seu$sample_id) sdimx <- 'pxl_col_in_fullres' sdimy <- 'pxl_row_in_fullres' # Run BANKSY seu <- RunBanksy(seu, lambda = 0.2, assay = 'originalexp', slot = 'data', dimx = sdimx, dimy = sdimy, features = 'all', group = 'sample_id', split.scale = FALSE, k_geom = 6)
Compute a spatially-aware embedding with PCA on the BANKSY matrix, and run Harmony on this embedding.
seu <- RunPCA(seu, assay = 'BANKSY', features = rownames(seu), npcs = 10) seu <- RunHarmony(seu, group.by.vars='sample_id')
The rest of the workflow follows as before:
seu <- RunUMAP(seu, dims = 1:10, reduction = 'harmony') seu <- FindNeighbors(seu, dims = 1:10, reduction = 'harmony') seu <- FindClusters(seu, resolution = 0.4)
Visualise clusters:
DimPlot(seu, pt.size = 0.25, label = TRUE, label.size = 3, cols = mypal) FeatureScatter(seu, 'staggered_sdimx', 'staggered_sdimy', cols = mypal, pt.size = 0.75)
For more information, visit https://github.com/prabhakarlab/Banksy.
Vignette runtime
Sys.time() - t0
Session info
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.