In satijalab/seurat-wrappers: Community-Provided Methods and Extensions for the Seurat Object

knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE
)

t0 <- Sys.time()

Introduction

In this vignette, we describe how to run BANKSY with Seurat objects. If you use BANKSY in your research, please cite

BANKSY unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis

Vipul Singhal, Nigel Chou, Joseph Lee, Yifei Yue, Jinyue Liu, Wan Kee Chock, Li Lin, Yun-Ching Chang, Erica Mei Ling Teo, Jonathan Aow, Hwee Kuan Lee, Kok Hao Chen & Shyam Prabhakar

Nature Genetics, 2024

doi: 10.1038/s41588-024-01664-3

Website: https://prabhakarlab.github.io/Banksy

BANKSY is a method that incorporates neighborhood information for clustering spatial omics data. By doing so, BANKSY is able to

improve cell-type assignment in noisy data
distinguish subtly different cell-types stratified by microenvironment
identify spatial domains sharing the same microenvironment

The amount of neighborhood information incorporated is controlled by a parameter lambda in [0,1], with higher values giving more weight to the neighbourhood information during clustering.

Overview

The RunBanksy function implemented with the SeuratWrappers package allows users to run BANKSY with Seurat objects. We describe two options of running RunBanksy. The first is within Seurat's spatial framework (see here and here) and requires a Seurat object and a lambda parameter as mandatory input. The second option works with Seurat objects that do not have spatial information stored within, and therefore requires an additional argument giving the locations of the cell centroids or spots.

Caveat: ScaleData should not be run after a call to RunBanksy; RunBanksy populates the scale.data slot with the scaled BANKSY matrix. Calling ScaleData after RunBanksy performs gene-wise z-scaling, negating the effect of lambda.

Prerequisites to install:

library(Banksy)
library(Seurat)
library(SeuratData)
library(SeuratWrappers)

library(ggplot2)
library(gridExtra)
library(pals)

# Kelly palette for visualization
mypal <- kelly()[-1]

Running BANKSY within Seurat's spatial framework

We demonstrate how to run BANKSY within Seurat's spatial analysis framework with a mouse hippocampus Slide-seq v2 dataset from the SeuratData package.

After installing SeuratData, the data can be accessed as follows:

InstallData('ssHippo')
ss.hippo <- LoadData("ssHippo")

ss.hippo <- LoadData("ssHippo")

We perform simple preprocessing by filtering beads with high mito percentage and keeping only beads within the 5th and 98th percentile of total UMI counts. To keep runtime of this vignette short, we downsample the data to 10,000 beads.

# Filtering
ss.hippo[['percent.mt']] <- PercentageFeatureSet(ss.hippo, pattern = '^MT-')
ss.hippo <- subset(ss.hippo, percent.mt < 10 &
                    nCount_Spatial > quantile(ss.hippo$nCount_Spatial, 0.05) &
                    nCount_Spatial < quantile(ss.hippo$nCount_Spatial, 0.98))
# Downsample
set.seed(42)
ss.hippo <- ss.hippo[,sample(colnames(ss.hippo), 1e4)]

Next, normalize the data and find variable genes:

# Normalize
ss.hippo <- NormalizeData(ss.hippo)
ss.hippo <- FindVariableFeatures(ss.hippo)
ss.hippo <- ScaleData(ss.hippo)

To run BANKSY, we specify the following:

lambda: a numeric value in [0,1]. With low values of lambda, BANKSY operates in cell-typing mode, while high values of lambda find spatial domains.
assay and slot: determines where to pull the expression data from
features: specifies features for downstream analysis. This can be 'all', 'variable' or a subset of features.
k_geom: the number of neighbors that defines a cell's neighborhood

Call ?RunBanksy for more details on function parameters.

# Run BANKSY
ss.hippo <- RunBanksy(ss.hippo, lambda = 0.2, verbose=TRUE, 
                      assay = 'Spatial', slot = 'data', features = 'variable',
                      k_geom = 15)
ss.hippo

Note that the RunBanksy function sets the default assay to BANKSY ( determined by the assay_name argument) and fills the scale.data slot. Users should not call ScaleData on the BANKSY assay as this negates the effects of lambda.

The rest of the pipeline is similar to the 'default' Seurat pipline. We scale the data and run dimensionality reduction with PCA and UMAP:

# Run PCA and UMAP
ss.hippo <- RunPCA(ss.hippo, assay = 'BANKSY', features = rownames(ss.hippo), npcs = 30)
ss.hippo <- RunUMAP(ss.hippo, dims = 1:30)

Next, find BANKSY clusters:

# Clustering
ss.hippo <- FindNeighbors(ss.hippo, dims = 1:30)
ss.hippo <- FindClusters(ss.hippo, resolution = 0.5)

Visualize the UMAP and Spatial plot:

# Viz
grid.arrange(
    DimPlot(ss.hippo, pt.size = 0.25, label = TRUE, label.size = 3, repel = TRUE),
    SpatialDimPlot(ss.hippo, stroke = NA, label = TRUE, label.size = 3, 
                   repel = TRUE, alpha = 0.5, pt.size.factor = 2),
    ncol = 2
)

Find markers based on the BANKSY clusters and visualize them. Here, we find differentially expressed genes between the CA1 and CA3 regions.

# Find markers
DefaultAssay(ss.hippo) <- 'Spatial'
markers <- FindMarkers(ss.hippo, ident.1 = 4, ident.2 = 9, only.pos = F, 
                       logfc.threshold = 1, min.pct = 0.5)
markers <- markers[markers$p_val_adj < 0.01,]
markers
genes <- c('ATP2B1', 'CHGB')
SpatialFeaturePlot(ss.hippo, features = genes, pt.size.factor = 3, 
                   stroke = NA, alpha = 0.5, max.cutoff = 'q95')

Running BANKSY with locations provided explicitly

One can also call RunBanksy on a Seurat object created from counts by providing the location of cell centroids or spots explicitly. In this case, the locations must be stored as metadata. Here, we use a mouse hippocampus VeraFISH dataset provided with the Banksy package.

data(hippocampus)
head(hippocampus$expression[,1:5])
head(hippocampus$locations)

Construct the Seurat object by storing the locations of cell centroids as metadata. We keep cells with total count between 5th and 98th percentile:

# Create manually
vf.hippo <- CreateSeuratObject(counts = hippocampus$expression,
                               meta.data = hippocampus$locations)
vf.hippo <- subset(vf.hippo,
                   nCount_RNA > quantile(vf.hippo$nCount_RNA, 0.05) & 
                   nCount_RNA < quantile(vf.hippo$nCount_RNA, 0.98))

Next, we normalize the data by library size and scale the data:

# Normalize
vf.hippo <- NormalizeData(vf.hippo, scale.factor = 100, normalization.method = 'RC')
vf.hippo <- ScaleData(vf.hippo)

Now, run BANKSY. Here, we provide the column names of the x and y spatial coordinates as stored in the metadata to dimx and dimy respectively:

# Run BANKSY
vf.hippo <- RunBanksy(vf.hippo, lambda = 0.2, dimx = 'sdimx', dimy = 'sdimy', 
                      assay = 'RNA', slot = 'data', features = 'all', k_geom = 10)

Run PCA on the BANKSY matrix:

# PCA
vf.hippo <- RunPCA(vf.hippo, assay = 'BANKSY', features = rownames(vf.hippo), npcs = 20)

Find BANKSY clusters:

# Cluster
vf.hippo <- FindNeighbors(vf.hippo, dims = 1:20)
vf.hippo <- FindClusters(vf.hippo, resolution = 0.5)

Visualise BANKSY clusters in spatial dimensions:

# Viz
FeatureScatter(vf.hippo, 'sdimx', 'sdimy', cols = mypal, pt.size = 0.75)
FeatureScatter(vf.hippo, 'sdimx', 'sdimy', cols = mypal, pt.size = 0.1) + facet_wrap(~ colors)

Find markers and visualise them. Here, we do so for a cluster defined by a thin layer of cells expressing Gfap. We also write a simple function genePlot that plots marker genes in spatial dimensions.

# Find markers
DefaultAssay(vf.hippo) <- 'RNA'
markers <- FindMarkers(vf.hippo, ident.1 = 6, only.pos = TRUE)

genePlot <- function(object, dimx, dimy, gene, assay = 'RNA',
                     slot = 'scale.data', q.low = 0.01, q.high = 0.99,
                     col.low='blue', col.high='red') {
    val <- GetAssayData(object, assay=assay, slot=slot)[gene,]
    val.low <- quantile(val, q.low)
    val.high <- quantile(val, q.high)
    val[val < val.low] <- val.low
    val[val > val.high] <- val.high
    pdf <- data.frame(x=object[[dimx]], y=object[[dimy]], gene=val)
    colnames(pdf) <- c('sdimx','sdimy', 'gene')
    ggplot(pdf, aes(x=sdimx,y=sdimy,color=gene)) + geom_point(size = 1) + 
        theme_minimal() + theme(legend.title = element_blank()) +
        scale_color_gradient2(low = col.low, high = col.high) +
        ggtitle(gene)
}

genePlot(vf.hippo, 'sdimx', 'sdimy', 'Gfap')

Multi-sample analysis

This section demonstrate demonstrates multi-sample analysis. Such an approach is appropriate when analysing multiple spatial omics datasets with non-contiguous spatial coordinates, and when large batch effects are not present.

Here, we use a mouse hippocampus VeraFISH dataset provided with the Banksy package.

data(hippocampus)
head(hippocampus$expression[,1:5])
head(hippocampus$locations)

For demonstration purposes, we create three separate datasets by splitting the data.

# Number of groups
n_groups = 3
group_names = paste0('group', seq(n_groups))
group_size = 1000
starts = seq(1, by=group_size, length.out=n_groups)
ends = starts + group_size - 1

# List of Seurat objects
seu_list = lapply(seq(n_groups), function(i) {
  idx = seq(starts[i], ends[i])
  seu = CreateSeuratObject(
    counts = hippocampus$expression[,idx],
    meta.data = data.frame(scale(hippocampus$locations[idx,], scale = FALSE))
  )
  # Set original identity of cell
  seu$orig.ident = group_names[i]
  seu
})
seu_list

Perform normalisation for each dataset.

seu_list = lapply(seu_list, NormalizeData,
                  scale.factor = 100, normalization.method = 'RC')

Merge the datasets. Note that the spatial coordinates overlap.

# Merge
seu = Reduce(merge, seu_list)
seu = JoinLayers(seu) # run this for Seurat v5 objects

# Plot spatial coordinates colored by group
plot(FetchData(seu, c('sdimx', 'sdimy')), col = factor(seu$orig.ident))

Now run BANKSY. For multi-sample analysis, the argument group must be provided, which specifies the name of the metadata column that gives the assignment of each cell or spot to its original Seurat object. Here, we use orig.ident. Internally, providing the group argument tells the function to compute neighborhood matrices based on locations staggered by group, ensuring that cells from different spatial datasets do not overlap. The staggered locations are stored in the metadata for sanity checking. The split.scale argument allows for within-group scaling, accounting for minor differences in datasets.

# Grouping variable
head(seu@meta.data)
table(seu$orig.ident)

# Run BANKSY
seu = RunBanksy(seu, lambda = 0.2, assay = 'RNA', slot = 'data',
                dimx = 'sdimx', dimy = 'sdimy', features = 'all',
                group = 'orig.ident', split.scale = TRUE, k_geom = 15)

# Staggered locations added to metadata
head(seu@meta.data)

The rest of the workflow follows as before:

seu = RunPCA(seu, assay = 'BANKSY', features = rownames(seu), npcs = 30)
seu = RunUMAP(seu, dims = 1:30)
seu = FindNeighbors(seu, dims = 1:30)
seu = FindClusters(seu, resolution = 1)

Visualise clusters:

mypal <- kelly()[-1]
DimPlot(seu, pt.size = 0.25, label = TRUE, label.size = 3, cols = mypal)

FeatureScatter(seu, 'staggered_sdimx', 'staggered_sdimy', pt.size = 0.75, cols = mypal)

Spatial data integration with Harmony

BANKSY can be used with Harmony for integrating multiple spatial omics datasets in the presence of strong batch effects.

Download the data.

library(spatialLIBD)
library(ExperimentHub)
library(harmony)

ehub <- ExperimentHub::ExperimentHub()
spe <- spatialLIBD::fetch_data(type = "spe", eh = ehub)

imgData(spe) <- NULL
assay(spe, "logcounts") <- NULL
reducedDims(spe) <- NULL
rowData(spe) <- NULL
colData(spe) <- DataFrame(
  sample_id = spe$sample_id,
  clust_annotation = factor(
    addNA(spe$layer_guess_reordered_short),
    exclude = NULL, labels = seq(8)
  ),
  in_tissue = spe$in_tissue,
  row.names = colnames(spe)
)
invisible(gc())

# Subset to first sample of each subject
sample_names <- c("151507", "151669", "151673")
spe_list <- lapply(sample_names, function(x) spe[, spe$sample_id == x])
rm(spe)
invisible(gc())

Normalise the data and compute highly variable features.

# Convert to Seurat and Normalize data
seu_list <- lapply(spe_list, function(x) {
  x <- as.Seurat(x, data = NULL)
  NormalizeData(x, scale.factor = 3000, normalization.method = 'RC')
})

# Compute HVGs for each dataset and take the union
hvgs <- lapply(seu_list, function(x) {
  VariableFeatures(FindVariableFeatures(x, nfeatures = 2000))
})
hvgs <- Reduce(union, hvgs)

# Subset to HVGs
seu_list <- lapply(seu_list, function(x) x[hvgs,])
seu <- Reduce(merge, seu_list)

locs <- do.call(rbind.data.frame, lapply(spe_list, spatialCoords))
seu@meta.data <- cbind(seu@meta.data, locs)

seu

Run BANKSY. When analysing multiple samples, the argument group must be provided, which specifies the name of the metadata column that gives the assignment of each cell or spot to its original Seurat object. Here, we use sample_id. Internally, providing the group argument tells the function to compute neighborhood matrices based on locations staggered by group, ensuring that cells from different spatial datasets do not overlap. The staggered locations are stored in the metadata for sanity checking. Within-group scaling has little effect in the presence of strong batch effects, hence, we set split.scale=FALSE for efficiency.

# Grouping variable
head(seu@meta.data)
table(seu$sample_id)

sdimx <- 'pxl_col_in_fullres'
sdimy <- 'pxl_row_in_fullres'

# Run BANKSY
seu <- RunBanksy(seu, lambda = 0.2, assay = 'originalexp', slot = 'data',
                dimx = sdimx, dimy = sdimy, features = 'all',
                group = 'sample_id', split.scale = FALSE, k_geom = 6)

Compute a spatially-aware embedding with PCA on the BANKSY matrix, and run Harmony on this embedding.

seu <- RunPCA(seu, assay = 'BANKSY', features = rownames(seu), npcs = 10)
seu <- RunHarmony(seu, group.by.vars='sample_id')

The rest of the workflow follows as before:

seu <- RunUMAP(seu, dims = 1:10, reduction = 'harmony')
seu <- FindNeighbors(seu, dims = 1:10, reduction = 'harmony')
seu <- FindClusters(seu, resolution = 0.4)

Visualise clusters:

DimPlot(seu, pt.size = 0.25, label = TRUE, label.size = 3, cols = mypal)
FeatureScatter(seu, 'staggered_sdimx', 'staggered_sdimy', cols = mypal, pt.size = 0.75)