In rogerswt/panoplyCF: Cell-based Analysis Using t-SNE on Cytometric Fingerprints

knitr::opts_chunk$set(echo = TRUE)

Background

The prevalence of high-dimensional flow (or mass) cytometry data is increasing at a rapid pace. The difficulties and limitations of manual analysis of such data are obvious to anyone who has tried to do that. The alternative is to find or create automated analysis approaches that facilitate this process.

There has been much recent interest in the use of manifold-learning algorithms to reduce the dimensionality of cell-based data to manageable levels. t-SNE and UMAP are two examples. One difficulty in common between these approaches is their computational complexity as the total number of events and the number of measured variables for each event increase. Another limitation, specific to t-SNE, is that a map learned on one data set cannot be applied to another data set without re-computing the imbedding.

PanoplyCF

I developed panoplyCF to address these limitations. PanoplyCF first computes high-resolution Cytometric Fingerprinting (CF) bins using FlowFP. It then computes the centroids of the bins in high-dimensional space by taking the medians for each independent variable for all events contained in each bin. It then computes the t-SNE embedding of the bin centroids. This has two advantages over conventional t-SNE:

it's fast. It only has to deal with thousands of bins rather than millions of events.
it's re-useable. New data can be inserted into an existing embedding via the CF bins.

The final step is to perform conventional hierarchical agglomerative clustering of the bins, using the 2-dimensional t-SNE map coordinates. The goal is to create clusters that are homogeneous in high-dimensional space as determined by the tightness of the distribution of the independent variables within each cluster.

How well this works depends on the nature of the data that the algorithm is fed. Manifold learning algorithms work best when there is a manifold present in the data to be learned. As discussed in (https://www.nature.com/articles/s41467-019-13055-y) some data are more "manifold-like", whereas some are more "cluster-like". If your data fall in the former category, panoplyCF may work well. If in the latter, you may have more luck with clustering approaches such as flowClust or flowSOM, or the companion to this package, \link[fluster]{https://github.com/rogerswt/fluster} which uses the same fingerprint binning approach, but clusters in high-dimensionality space instead of t-SNE embedding.

Getting started

There are currently only three methods in this package (more later!).

panoply()
decorate_sample_panoply()
panoply_map_sample()

the panoply() method

The first one performs the above-described panoplyCF calculations and returns an object of class "panoply".

# this is added so that flowCore::identifier() is available in the vigette environment
library(flowCore)

library(panoplyCF)
# load the example data
filename = system.file("extdata", "sampled_flowset_young.rda", package = "panoplyCF")
load(filename)   # fs_young
is(fs_young)

# choose the parameters to be included in the analysis.  In this case,
# exclude the scattering and LIVEDEAD markers as they were previously used to 
# gate the data.
pan_params = flowCore::colnames(fs_young)[c(7:9, 11:22)]
pan_params

# do the panoplyCF computation (takes a couple of minutes)
pan = panoply(fcs = fs_young, parameters = pan_params, nclust = 30)

# check out the resulting object
is(pan)
names(pan)

the decorate_sample_panoply() method

The second one renders a picture of the result.

decorate_sample_panoply(fcs = fs_young, panoply_obj = pan, colorscale = FALSE)

suppressMessages(decorate_sample_panoply(fcs = fs_young, panoply_obj = pan, colorscale = TRUE))

Looking at differential expression of clusters - panoply_map_sample()

Now that we have a panoply model, we can map other samples to it to look for differential expression. We've created the panoply model on the flowSet consisting of 10 sub-sampled instances from young people (a better idea might be to aggregate young and old samples and create the panoply model from that aggregate). We'll load up 10 sub-sampled instances from older people, and look for clusters that seem to differ between the two groups.

NOTE: in the interest of keeping example data small-ish, the data here are (a) subsampled, and (b) only 10 instances in each group, so this section isn't likely to find much of anything interesting but is included to illustrate the idea. We encourage you to download \link{https://flowrepository.org/id/FR-FCM-ZZGS} in its entirety and follow this idea on the full data set.

fn_old = system.file("extdata", "sampled_flowset_old.rda", package = "panoplyCF")
load(fn_old)   # fs_old

nclust = max(pan$clustering$clst)
ninst = length(fs_young) + length(fs_old)

# make a matrix to hold results
sadistics = matrix(NA, nrow = ninst, ncol = nclust)
k = 1
for (i in 1:length(fs_young)) {
  sadistics[k, ] = panoply_map_sample(fs_young[[i]], panoply_obj = pan)$fraction
  k = k + 1
}

for (i in 1:length(fs_old)) {
  sadistics[k, ] = panoply_map_sample(fs_old[[i]], panoply_obj = pan)$fraction
  k = k + 1
}

# calculate significance
pval = vector('numeric')
for (i in 1:ncol(sadistics)) {
  pval[i] = wilcox.test(sadistics[1:10, i], sadistics[11:20, i], exact = FALSE)$p.value
}

srt = sort(pval, index.return = TRUE)
srt$x
srt$ix

BIG CAVEAT: We're not adjusting P values for multiple comparisons (e.g. Bonferroni), because the example data set is too small. YOU SHOULD DO THIS FOR REAL WORK!!

Notice that the most significant cluster is cluster number 20. Looking back to the panoply spread, we see that this cluster is CD3+ CD8+ CD45RA+ CCR7+. This corresponds to CD8 Naive T cells, and they're lower in older folks compared to younger folks. Does this make sense to you immunologists out there?