Quickstart Tutorial

In this tutorial, we will construct a Symphony reference from two PBMC datasets from 2 technologies (10x 3'v1 and 3'v2), then map a third dataset from a new technology (10x 5') with Symphony. The analysis follows from Fig. 2 of the paper (but with downsampled datasets to fit within CRAN limits on subdirectory size).

Installation

Install Symphony with standard commands.

install.packages('symphony')

Once Symphony is installed, load it up!

library(symphony)

# Other packages for this tutorial
suppressPackageStartupMessages({
    # Analysis
    library(harmony)
    library(irlba)
    library(data.table)
    library(dplyr)

    # Plotting
    library(ggplot2)
    library(ggthemes)
    library(ggrastr)
    library(RColorBrewer)
})
plotBasic = function(umap_labels,                # metadata, with UMAP labels in UMAP1 and UMAP2 slots
                        title = 'Query',         # Plot title
                        color.by = 'cell_type',  # metadata column name for coloring
                        facet.by = NULL,         # (optional) metadata column name for faceting
                        color.mapping = NULL,    # custom color mapping
                        legend.position = 'right') {  # Show cell type legend

    p = umap_labels %>%
            dplyr::sample_frac(1L) %>% # permute rows randomly
            ggplot(aes(x = UMAP1, y = UMAP2)) + 
            geom_point_rast(aes(col = get(color.by)), size = 1, stroke = 0.4, shape = 16)
        if (!is.null(color.mapping)) { p = p + scale_color_manual(values = color.mapping) }

    # Default formatting
    p = p + theme_bw() +
            labs(title = title, color = color.by) + 
            theme(plot.title = element_text(hjust = 0.5)) +
            theme(legend.position=legend.position) +
            theme(legend.text = element_text(size=8), legend.title=element_text(size=12)) + 
            guides(colour = guide_legend(override.aes = list(size = 4))) + guides(alpha = 'none')

    if(!is.null(facet.by)) {
        p = p + facet_wrap(~get(facet.by)) +
                theme(strip.text.x = element_text(size = 12)) }    
    return(p)
}

# Colors for PBMCs
pbmc_colors = c("B" = "#66C2A5", "DC" = "#FC8D62", "HSC" = "#8DA0CB", "MK" = "#E78AC3", 
              "Mono_CD14" = "#A6D854", "Mono_CD16" = "#f2ec72", "NK" = "#62AAEA", 
              "T_CD4" = "#D1C656", "T_CD8" = "#968763")

Load the data

Get the expression and metadata.

load('../data/pbmcs_exprs_small.rda')
load('../data/pbmcs_meta_small.rda')

dim(pbmcs_exprs_small)
dim(pbmcs_meta_small)

pbmcs_meta_small %>% head(4)

Subset the dataset into reference and query.

idx_query = which(pbmcs_meta_small$donor == "5p") # use 5' dataset as the query
ref_exp_full = pbmcs_exprs_small[, -idx_query]
ref_metadata = pbmcs_meta_small[-idx_query, ]
query_exp = pbmcs_exprs_small[, idx_query]
query_metadata = pbmcs_meta_small[idx_query, ]

Build Symphony reference

There are two options for how to build a Symphony reference. Option 1 (buildReferenceFromHarmonyObj) is the more modular option, meaning that the user has more control over the preprocessing steps prior to reference compression. Option 2 (buildReference) builds a reference starting from expression, automating the procedure more but offering less flexibility.

We'll demonstrate both options below.

Option 1: Build from Harmony object (preferred method)

This option consists of more steps than Option 2 but allows your code to be more modular and flexible if you want to do your own preprocessing steps before the Harmony integration step. We recommend this option for most users.

It is important to generate vargenes_means_sds (containing variable gene means and standard deviations used to scale the genes) as well as save the loadings for the PCA step.

Starting with the reference expression,

ref_exp_full[1:5, 1:2] # Sparse matrix with the normalized genes x cells matrix

Select variable genes and subset reference expression by variable genes (the command below will select the top 1,000 genes per batch, then pool them)

var_genes = vargenes_vst(ref_exp_full, groups = as.character(ref_metadata[['donor']]), topn = 1000)
ref_exp = ref_exp_full[var_genes, ]
dim(ref_exp)

Calculate and save the mean and standard deviations for each gene

vargenes_means_sds = tibble(symbol = var_genes, mean = Matrix::rowMeans(ref_exp))
vargenes_means_sds$stddev = rowSDs(ref_exp, vargenes_means_sds$mean)
head(vargenes_means_sds)

Scale data using calculated gene means and standard deviations

ref_exp_scaled = scaleDataWithStats(ref_exp, vargenes_means_sds$mean, vargenes_means_sds$stddev, 1)

Run PCA (using SVD), save gene loadings (s$u)

set.seed(0)
s = irlba(ref_exp_scaled, nv = 20)
Z_pca_ref = diag(s$d) %*% t(s$v) # [pcs by cells]
loadings = s$u

Run Harmony integration. It is important to set return_object = TRUE.

set.seed(0)
ref_harmObj = harmony::HarmonyMatrix(
        data_mat = t(Z_pca_ref),  ## PCA embedding matrix of cells
        meta_data = ref_metadata, ## dataframe with cell labels
        theta = c(2),             ## cluster diversity enforcement
        vars_use = c('donor'),    ## variable to integrate out
        nclust = 100,             ## number of clusters in Harmony model
        max.iter.harmony = 20,    ## max number of iterations
        return_object = TRUE,     ## return the full Harmony model object
        do_pca = FALSE            ## don't recompute PCs
)

To run the next function buildReferenceFromHarmonyObj(), you need to input the saved gene loadings (loadings) and vargenes_means_sds.

# Compress a Harmony object into a Symphony reference
reference = buildReferenceFromHarmonyObj(
                        ref_harmObj,            # output object from HarmonyMatrix()
                        ref_metadata,           # reference cell metadata
                        vargenes_means_sds,     # gene names, means, and std devs for scaling
                        loadings,               # genes x PCs matrix
                        verbose = TRUE,         # verbose output
                        do_umap = TRUE,         # set to TRUE to run UMAP
                        save_uwot_path = './testing_uwot_model_1') # file path to save uwot model

Save Symphony reference for later mapping (modify with your desired output path)

saveRDS(reference, './testing_reference1.rds')

Let's take a look at what the reference object contains: meta_data: metadata vargenes: variable genes, means, and standard deviations used for scaling loadings: gene loadings for projection into pre-Harmony PC space R: Soft cluster assignments Z_orig: Pre-Harmony PC embedding Z_corr: Harmonized PC embedding centroids: locations of final Harmony soft cluster centroids cache: pre-calculated reference-dependent portions of the mixture model umap: UMAP coordinates save_uwot_path: path to saved uwot model (for query UMAP projection into reference UMAP coordinates)

str(reference)

The harmonized embedding is located in the Z_corr slot of the reference object.

dim(reference$Z_corr)
reference$Z_corr[1:5, 1:5]

Visualize reference UMAP

reference = readRDS('./testing_reference1.rds')
umap_labels = cbind(ref_metadata, reference$umap$embedding)
plotBasic(umap_labels, title = 'Reference', color.mapping = pbmc_colors)

Option 2: Build from scratch (starting with expression)

This option computes a reference object starting from expression in a unified pipeline, automating the preprocessing steps.

# Build reference
set.seed(0)
reference = symphony::buildReference(
            ref_exp_full,
            ref_metadata,
            vars = c('donor'),         # variables to integrate over
            K = 100,                   # number of Harmony clusters
            verbose = TRUE,            # verbose output
            do_umap = TRUE,            # can set to FALSE if want to run umap separately later
            do_normalize = FALSE,      # set to TRUE if input counts are not normalized yet
            vargenes_method = 'vst',   # method for variable gene selection ('vst' or 'mvp')
            vargenes_groups = 'donor', # metadata column specifying groups for variable gene selection 
            topn = 1000,               # number of variable genes to choose per group
            d = 20,                    # number of PCs
            save_uwot_path = './testing_uwot_model_2' # file path to save uwot model
)

# Save reference (modify with your desired output path)
saveRDS(reference, './testing_reference2.rds')

Visualize reference UMAP

reference = readRDS('./testing_reference2.rds')
umap_labels = cbind(ref_metadata, reference$umap$embedding)
plotBasic(umap_labels, title = 'Reference', color.mapping = pbmc_colors)

Map query

In order to map a new query dataset onto the reference, you will need a reference object saved from the steps above, as well as query cell expression and metadata.

The query dataset is assumed to have been normalized in the same manner as the reference cells (here, default is log(CP10k+1) normalization).

# Read in Symphony reference to map to
reference = readRDS('./testing_reference1.rds')
# Map query
query = mapQuery(query_exp,             # query gene expression (genes x cells)
                 query_metadata,        # query metadata (cells x attributes)
                 reference,             # Symphony reference object
                 do_normalize = FALSE,  # perform log(CP10k) normalization on query
                 do_umap = TRUE)        # project query cells into reference UMAP

Note: Symphony assumes that the query is normalized in the same manner as the reference. Our default implementation currently uses log(CP10k+1) normalization.

Let's take a look at what the query object contains: Z: query cells in reference Harmonized embedding Zq_pca: query cells in pre-Harmony reference PC embedding (prior to correction) R: query cell soft cluster assignments Xq: query cell design matrix for correction step umap: query cells projected into reference UMAP coordinates (using uwot) meta_data: metadata

str(query)

Predict query cell types using k-NN. Setting confidence = TRUE also returns the prediction confidence scores (proportion of neighbors with winning vote).

query = knnPredict(query,       # query object
                   reference,   # reference object
                   reference$meta_data$cell_type, # reference cell labels for training
                   k = 5,       # number of reference neighbors to use for prediction
                   confidence = TRUE)

Query cell type predictions are now in the cell_type_pred_knn column.

head(query$meta_data)

Visualization of mapping

# Sync the column names for both data frames
reference$meta_data$cell_type_pred_knn = NA
reference$meta_data$cell_type_pred_knn_prob = NA
reference$meta_data$ref_query = 'reference'
query$meta_data$ref_query = 'query'

# Add the UMAP coordinates to the metadata
meta_data_combined = rbind(query$meta_data, reference$meta_data)
umap_combined = rbind(query$umap, reference$umap$embedding)
umap_combined_labels = cbind(meta_data_combined, umap_combined)

# Plot UMAP visualization of all cells
plotBasic(umap_combined_labels, title = 'Reference and query cells', color.by = 'ref_query')

Plot the reference and query side by side.

plotBasic(umap_combined_labels, title = 'Reference and query cells', 
          color.mapping = pbmc_colors, facet.by = 'ref_query')

And that's a wrap! If you run into issues or have questions about Symphony or this tutorial, please open an issue on GitHub.

sessionInfo()


Try the symphony package in your browser

Any scripts or data that you put into this service are public.

symphony documentation built on Jan. 17, 2023, 1:13 a.m.