In rogerswt/fluster: Fingerprint-based clustering of flow cytometry data

knitr::opts_chunk$set(echo = TRUE)

Background

The prevalence of high-dimensional flow (or mass) cytometry data is increasing at a rapid pace. The difficulties and limitations of manual analysis of such data are obvious to anyone who has tried to do that. The alternative is to find or create automated analysis approaches that facilitate this process.

The fluster package is closely related to another package, panoplyCF. Both packages rely on the speed and efficiency of reducing the number of items to cluster using Cytometric Fingerprinting (via package flowFP). The difference between the two is that, in the case of panoplyCF, clustering is done in a 2-dimensional space after manifold learning dimensionality reduction using the t-SNE algorithm. Fluster on the other hand performs hierarchical agglomerative clustering directly on the bin centroids. For visualization fluster provides a minimal spanning tree (MST) graph representation of the multivariate clusters to aid in interpretation.

Fluster

Fluster first computes high-resolution Cytometric Fingerprinting (CF) bins using FlowFP. It then computes the centroids of the bins in high-dimensional space by taking the medians for each independent variable for all events contained in each bin. The final step is to perform conventional hierarchical agglomerative clustering of the bins, using their high-dimensional centroids. The goal is to create clusters that are homogeneous in high-dimensional space as determined by the tightness of the distribution of the independent variables within each cluster.

Data can be easily mapped to the clusters via the tagging that flowFP does on instances, carried through the bin indices in the fluster object.

Getting started

There are currently xxx methods in this package (more later!).

fluster()
plot_fluster_graph()
plot_fluster_tsne()
fluster_map_sample()

the fluster() method

The first one performs the above-described fluster calculations and returns an object of class "fluster".

# this is added so that flowCore::identifier() is available in the vigette environment
library(flowCore)

library(fluster)
# load the example data
filename = system.file("extdata", "sampled_flowset_young.rda", package = "fluster")
load(filename)   # fs_young
is(fs_young)

# choose the parameters to be included in the analysis.  In this case,
# exclude the scattering and LIVEDEAD markers as they were previously used to
# gate the data.
flust_params = flowCore::colnames(fs_young)[c(7:9, 11:22)]
flust_params

# do the fluster computation (takes a couple of minutes)
fluster_obj = fluster(fcs = fs_young, parameters = flust_params, nclust = NULL)

# check out the resulting object
is(fluster_obj)
names(fluster_obj)

the plot_fluster_xxx() methods

The first, plot_fluster_graph(), renders a picture of the result using a minimum spanning tree representation:

plot_fluster_graph(fluster = fluster_obj)

suppressMessages(plot_fluster_graph(fluster = fluster_obj))

The other one, plot_fluster_tsne(), renders the result of fluster() as a tSNE embedding:

plot_fluster_tsne(fluster = fluster_obj)

suppressMessages(plot_fluster_tsne(fluster = fluster_obj))

Looking at differential expression of clusters - fluster_map_sample()

Now that we have a fluster model, we can map other samples to it to look for differential expression. We've created the fluster model on the flowSet consisting of 10 sub-sampled instances from young people (a better idea might be to aggregate young and old samples and create the fluster model from that aggregate). We'll load up 10 sub-sampled instances from older people, and look for clusters that seem to differ between the two groups.

NOTE: in the interest of keeping example data small-ish, the data here are (a) subsampled, and (b) only 10 instances in each group, so this section isn't likely to find much of anything interesting but is included to illustrate the idea. We encourage you to download \link{https://flowrepository.org/id/FR-FCM-ZZGS} in its entirety and follow this idea on the full data set.

fn_old = system.file("extdata", "sampled_flowset_old.rda", package = "fluster")
load(fn_old)   # fs_old

nclust = max(fluster_obj$clustering$clst)
ninst = length(fs_young) + length(fs_old)

# make a matrix to hold results
sadistics = matrix(NA, nrow = ninst, ncol = nclust)
k = 1
for (i in 1:length(fs_young)) {
  sadistics[k, ] = fluster_map_sample(fs_young[[i]], fluster_obj = fluster_obj)$fraction
  k = k + 1
}

for (i in 1:length(fs_old)) {
  sadistics[k, ] = fluster_map_sample(fs_old[[i]], fluster_obj = fluster_obj)$fraction
  k = k + 1
}

# calculate significance
pval = vector('numeric')
for (i in 1:ncol(sadistics)) {
  pval[i] = wilcox.test(sadistics[1:10, i], sadistics[11:20, i], exact = FALSE)$p.value
}

srt = sort(pval, index.return = TRUE)
srt$x
srt$ix

# what are the phenotypes of the significant clusters?
idx = which(pval <= 0.05)
opar = par(mfrow = c(2, 2), mar = c(2, 8, 2, 0) + 0.1)

for (i in idx) {
  fluster_phenobars(fluster_obj = fluster_obj, cluster = i)
}
par(opar)

# boxplots
opar = par(mfrow = c(2, 2), mar = c(4, 4, 2, 0) + 0.1)
for (i in idx) {
  clus = paste("cluster_", i, sep = "")
  tit = sprintf("%s (%.1e)", clus, pval[i])
  boxplot(sadistics[1:10, i], sadistics[11:20, i], 
          col = c("lightgreen", "pink"), 
          names = c("Young", "Old"),
          ylab = '', 
          main = tit)
}
par(opar)

BIG CAVEAT: We're not adjusting P values for multiple comparisons (e.g. Bonferroni), because the example data set is too small. YOU SHOULD DO THIS FOR REAL WORK!!

The "significant" clusters are 23, 24, 4 and 8. Clusters 23, 24 and 4 are CD3+ CD8+ CD45RA+ CCR7+, consistent with CD8 Naive T cells, and they're lower in older folks compared to younger folks. Cluster 8, which is up in old people, has a phenotype consistent with CD8 Effector Memory T cells.