In satijalab/seurat: Tools for Single Cell Genomics

all_times <- list()  # store the time for each chunk
knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      now <<- Sys.time()
    } else {
      res <- difftime(Sys.time(), now, units = "secs")
      all_times[[options$label]] <<- res
    }
  }
}))
knitr::opts_chunk$set(
  tidy = TRUE,
  tidy.opts = list(width.cutoff = 85),
  fig.width = 10,
  message = FALSE,
  warning = FALSE,
  time_it = TRUE,
  error = TRUE
)

options(SeuratData.repo.use = "http://satijalab04.nygenome.org")
options(future.globals.maxSize = 8e9)

For very large datasets, the standard integration workflow can sometimes be prohibitively computationally expensive. In this workflow, we employ two options that can improve efficiency and runtimes:

Reciprocal PCA (RPCA)
Reference-based integration

The main efficiency improvements are gained in FindIntegrationAnchors(). First, we use reciprocal PCA (RPCA) instead of CCA, to identify an effective space in which to find anchors. When determining anchors between any two datasets using reciprocal PCA, we project each dataset into the others PCA space and constrain the anchors by the same mutual neighborhood requirement. All downstream integration steps remain the same and we are able to 'correct' (or harmonize) the datasets.

Additionally, we use reference-based integration. In the standard workflow, we identify anchors between all pairs of datasets. While this gives datasets equal weight in downstream integration, it can also become computationally intensive. For example when integrating 10 different datasets, we perform 45 different pairwise comparisons. As an alternative, we introduce here the possibility of specifying one or more of the datasets as the 'reference' for integrated analysis, with the remainder designated as 'query' datasets. In this workflow, we do not identify anchors between pairs of query datasets, reducing the number of comparisons. For example, when integrating 10 datasets with one specified as a reference, we perform only 9 comparisons. Reference-based integration can be applied to either log-normalized or SCTransform-normalized datasets.

This alternative workflow consists of the following steps:

Create a list of Seurat objects to integrate
Perform normalization, feature selection, and scaling separately for each dataset
Run PCA on each object in the list
Integrate datasets, and proceed with joint analysis

In general, we observe strikingly similar results between the standard workflow and the one demonstrated here, with substantial reduction in compute time and memory. However, if the datasets are highly divergent (for example, cross-modality mapping or cross-species mapping), where only a small subset of features can be used to facilitate integration, and you may observe superior results using CCA.

For this example, we will be using the "Immune Cell Atlas" data from the Human Cell Atlas which can be found here.

library(Seurat)
options(Seurat.object.assay.version = "v5")

After acquiring the data, we first perform standard normalization and variable feature selection.

bm280k.data <- Read10X_h5("../data/ica_bone_marrow_h5.h5")
bm280k <- CreateSeuratObject(counts = bm280k.data, min.cells = 100, min.features = 500)
bm280k[["RNA"]] <- split(bm280k[["RNA"]], f = bm280k$orig.ident)

# Preprocessing
bm280k <-  NormalizeData(bm280k, verbose = FALSE)
bm280k <-  FindVariableFeatures(bm280k, verbose = FALSE)

Next, select features for downstream integration, and run PCA on each object in the list, which is required for running the alternative reciprocal PCA workflow.

features <- VariableFeatures(bm280k)
bm280k <- ScaleData(bm280k, features = features, verbose = FALSE)
bm280k <- RunPCA(bm280k, features = features, verbose = FALSE)

Since this dataset contains both men and women, we will chose one male and one female (BM1 and BM2) to use in a reference-based workflow. We determined donor sex by examining the expression of the XIST gene.

bm280k <- IntegrateLayers(object = bm280k, 
                        method = RPCAIntegration,
                        reference = c(1, 2),
                        dims = 1:50,
                        verbose = F)

bm280k <- RunUMAP(bm280k, dims = 1:50)

DimPlot(bm280k, group.by = "orig.ident")

library(ggplot2)
plot <- DimPlot(bm280k, group.by = "orig.ident") + xlab("UMAP 1") + ylab("UMAP 2") + 
  theme(axis.title = element_text(size = 18), legend.text = element_text(size = 18)) + 
  guides(colour = guide_legend(override.aes = list(size = 10)))
ggsave(filename = "../output/images/bm280k_integrated.jpg", height = 7, width = 12, plot = plot, quality = 50)

write.csv(x = t(as.data.frame(all_times)), file = "../output/timings/seurat5_integration_large_datasets.csv")

Session Info

sessionInfo()

satijalab/seurat documentation built on Feb. 12, 2025, 12:09 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

satijalab/seurat
Tools for Single Cell Genomics

In satijalab/seurat: Tools for Single Cell Genomics

R Package Documentation

Browse R Packages

We want your feedback!

satijalab/seurat Tools for Single Cell Genomics

In satijalab/seurat: Tools for Single Cell Genomics

R Package Documentation

Browse R Packages

We want your feedback!

satijalab/seurat
Tools for Single Cell Genomics