In sbrn3/disscat: Calculates Distances Between Single Cell RNA Seq Categories

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The aim of this section section is to quantify how different different categorical groups of cells are from each other. We have used a set of hierarchical clustering distance strategies to define the distance between groups in reduced dimensional space.

library(disscat)
library(Seurat)
library(SeuratData)

Data

For this we will be using data from the SeuratData package as it already has the cells sorted into categories.

# Download the data
InstallData("ifnb")

# load the data
data("ifnb")

# View the S4 Seurat object
ifnb

Subset the data to reduce time. This is purely for example data.

# Set seed for reproducability
set.seed(420)

# Sample 6000 cells
sampled.cells <- sample(x = rownames(ifnb@meta.data), size = 3000)

# Subset ifnb
ifnb <- subset(ifnb, cells = sampled.cells)

Quality Control

This follows the recommended standard pre-processing workflow from Seurat. This involved removing cells with too many or too few genes as these were considered signs of doublets and or low quality readings. Additionally cells were also screened for too many or too few molecules for similar reasons.

# Subset out cells with too few or too many features expressed
ifnb <- subset(ifnb, subset = nFeature_RNA > 200 & nFeature_RNA < 2500)

Normalisation

After removing unwanted cells from the dataset, the next step is to normalise the data. By default, we employ a global-scaling normalisation method LogNormalize that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. It is possible to run a better SCTransform integration workflow. This workflow can also be found in the getting started vignette.

# Normalise the data
ifnb <- NormalizeData(ifnb)

# Find the variable features  
ifnb <- FindVariableFeatures(ifnb, selection.method = "vst", nfeatures = 2000)

# Scale data 
ifnb <- ScaleData(ifnb, features = rownames(ifnb))

Distance Analysis

Calculating the distance between categorical groups requires reduced dimension space. Since UMAP is non-linear it is not useful for our distance analysis but it is still helpful for visualising our data.

ifnb <- RunPCA(ifnb, npcs = 30, verbose = FALSE)
ifnb <- RunUMAP(ifnb, reduction = "pca", dims = 1:30, verbose = FALSE)

Additionally it is not possible to compare the distance between two conditions split by a second factor if that second factor is not present in each condition. In our case it would be not possible to calculate the distance between stim and ctrl for each cell type, if each cell type was not represented in each group. Therefore we will need to filter out the factors that don't appear in each condition.

ifnb <- FilterSplits(ifnb, group = "stim", split_by = "seurat_annotations")

Visualisation

Before proceeding forward it may be important to define our problem and visualise our data. In this example dataset there are two groups, cells that were stimulated with interferon beta("STIM") and control cells("CTRL"). So in our analysis we will be trying to quantify the differences between these two groups. Additionally there are a number of different cell types so we can look at which cell types seem to be most effected by interferon beta.

DimPlot(ifnb, group.by = "seurat_annotations", reduction = "umap", label = TRUE)

However as mentioned before UMAP will only be used for visualisation. Distances will need to be calculated on PCA or PLSDA.

DimPlot(ifnb, group.by = "stim", reduction = "pca", label = TRUE)

This really highlights the ability of UMAP to learn a more meaningful organisation of clusters because when coloured by cell types ("seurat_annotations") the PCA plot has them all mixed together.

DimPlot(ifnb, group.by = "seurat_annotations", reduction = "pca")

Distance

There are a number of between group distance methods included in this package:

Additionally because it uses the dist function from stats it is able to use multiple methods for the distance between points. To clarify, the previous methods are methodologies for calculating the distance between two entire groups of cells (e.g. find the mean of each group and find the point to point distance between the two means). The options in the dist function are the method for calculating the point to point distance (e.g. euclidean, manhattan). More information about options can be found here.

Using the GroupDistance method we are able to calculate the distance between any combination of group pairs across different distance measurement strategies. The following code means "For the dataset ifnb, calculate the distances between the factors in 'stim', use the dimensionality reduction of 'pca', pca dimensions 1 to 4, the distance methods of centroid linkage and mahalanobis-like distance, the euclidian distance from point, and run for individually for each factor of 'seurat_annotations'. This means that we are finding the distances between stimulated and control for each cell type.

ifnb <- GroupDistance(ifnb,
              group = "stim",
              reduction = "pca",
              dims = 1:4,
              method = c("centroid", "mahalanobis"),
              distance = "euclidean",
              split_by = "seurat_annotations")

This information can be extracted from your Seurat object as a table or list depending on the type of calculation your previously ran.

GetDistance(ifnb, group = "stim", split_by = "seurat_annotations")

Or viewed as a bar plot. One of the parameters of the barplot is whether to rescale the different methods to make them comparable. If this is turned off it is possible for some methods to systematically generate larger numbers than others. Subsequently when rescale = TRUE it transforms each value into a z score in relation to the mean. In this case we centre the mean on 1 for each method. Therefore it is also possible for negative values if the distance for a particular factor is much smaller than that mean.

PlotDistance(seurat_object = ifnb,
              group = "stim",
              group_start = "CTRL",
              group_destination = "STIM",
              split_by = "seurat_annotations",
              method = "all",
              rescale = TRUE)

Unfortunatley most of these methods do not give us distance distributions, only representative distance values, thus not allowing us to generate error bars. For the average linkage we are actually able to return a distribution and generate error bars for each factor using the AverageDistance function.

# Calculate Average linkage distance
ifnb <- AverageDistance(ifnb,
                        group = "stim",
                        reduction = "pca",
                        dims = 1:4,
                        distance = "euclidean",
                        split_by = "seurat_annotations",
                        parallel = FALSE)

# BarPlot average linkage distance  with error bars
PlotAverageDistance(seurat_object = ifnb,
                    group = "stim",
                    group_start = "CTRL",
                    group_destination = "STIM",
                    split_by = "seurat_annotations",
                    error_bars = "sd",
                    mode = "bar",
                    rotate_x = 20)

Or alternatively as a density plot. In this circumstance this is quite messy because there are so many different cell types.

PlotAverageDistance(seurat_object = ifnb,
                    group = "stim",
                    group_start = "CTRL",
                    group_destination = "STIM",
                    split_by = "seurat_annotations",
                    mode = "density")

Plotting the average distance is effective in showing the distribution of the destination group however this value is also effected by the distribution of current group. Therefore we created 'total distance' which measures the within factor distances as well as the distances to the other factor.

For speed reasons we have created two different methods for total distance. If method = 'all' then it will calculate all the within start group distances for all cells. However if you choose method = 'one' then it will calculate the distance for only one cell. This allows for far fewer iterations and is far less computationally expensive. For this example we use 'one' because it is a good enough approximation of outcomes.

ifnb <- TotalDistance(seurat_object = ifnb,
                       group = "stim",
                       reduction = "pca",
                       dims = 1:4,
                       distance = "euclidian",
                       method = "one")

# Plot it
PlotTotalDistance(seurat_object = ifnb,
                  group = "stim",
                  group_start = "CTRL",
                  group_destination = "STIM")

So this data looks multimodal which suggest that there are mutliple distributions here. So we could try to split it by another factor.

ifnb <- TotalDistance(seurat_object = ifnb,
                       group = "stim",
                       reduction = "pca",
                       dims = 1:4,
                       distance = "euclidian",
                       split_by = "seurat_annotations",
                       method = "one")

# Plot it
cell.type.density <- PlotTotalDistance(seurat_object = ifnb,
                                        group = "stim",
                                        group_start = "CTRL",
                                        group_destination = "STIM",
                                        split_by = "seurat_annotations")

Another type of distance that may be interesting is how spread out an individual cluster is. This can be calculated by measuring the distance of all the cells to the centroid. Here we have implemented both the normal distance to a centroid as well as a mahalanobis distance.

ifnb <- InternalDistance(ifnb,
                         group = "stim",
                         method = "linkage", 
                         split_by = "seurat_annotations")

# Plot it
PlotInternalDistance(ifnb, group = "stim", 
                     split_by = "seurat_annotations", 
                     error_bars = "sd",
                     rotate_x = 30)

It is also possible to get the raw distance information

# Extract embeddings
embedding_data <- Embeddings(ifnb, reduction = "pca")

# Vector of all distances
head(internal_mahalanobis_distance(embedding_data, output = "raw"))

Two sample Test

Finally if you want to run a two sample test to decide if two groups are from the same distribution we have implemented a wrapper for Cramer. A nonparametric multivariate two sample test. This shows that the two groups are in fact different.

cramer.results <- RunCramer(ifnb, 
                            group = "stim", 
                            reduction = "pca", 
                            dims = 1:4)
cramer.results