knitr::opts_chunk$set(echo = TRUE)
The prevalence of high-dimensional flow (or mass) cytometry data is increasing at a rapid pace. The difficulties and limitations of manual analysis of such data are obvious to anyone who has tried to do that. The alternative is to find or create automated analysis approaches that facilitate this process.
There has been much recent interest in the use of manifold-learning algorithms to reduce the dimensionality of cell-based data to manageable levels. t-SNE and UMAP are two examples. One difficulty in common between these approaches is their computational complexity as the total number of events and the number of measured variables for each event increase. Another limitation, specific to t-SNE, is that a map learned on one data set cannot be applied to another data set without re-computing the imbedding.
I developed panoplyCF to address these limitations. PanoplyCF first computes high-resolution Cytometric Fingerprinting (CF) bins using FlowFP. It then computes the centroids of the bins in high-dimensional space by taking the medians for each independent variable for all events contained in each bin. It then computes the t-SNE embedding of the bin centroids. This has two advantages over conventional t-SNE:
The final step is to perform conventional hierarchical agglomerative clustering of the bins, using the 2-dimensional t-SNE map coordinates. The goal is to create clusters that are homogeneous in high-dimensional space as determined by the tightness of the distribution of the independent variables within each cluster.
How well this works depends on the nature of the data that the algorithm is fed. Manifold learning algorithms work best when there is a manifold present in the data to be learned. As discussed in (https://www.nature.com/articles/s41467-019-13055-y) some data are more "manifold-like", whereas some are more "cluster-like". If your data fall in the former category, panoplyCF may work well. If in the latter, you may have more luck with clustering approaches such as flowClust or flowSOM, or the companion to this package, \link[fluster]{https://github.com/rogerswt/fluster} which uses the same fingerprint binning approach, but clusters in high-dimensionality space instead of t-SNE embedding.
There are currently only three methods in this package (more later!).
The first one performs the above-described panoplyCF calculations and returns an object of class "panoply".
# this is added so that flowCore::identifier() is available in the vigette environment library(flowCore)
library(panoplyCF) # load the example data filename = system.file("extdata", "sampled_flowset_young.rda", package = "panoplyCF") load(filename) # fs_young is(fs_young) # choose the parameters to be included in the analysis. In this case, # exclude the scattering and LIVEDEAD markers as they were previously used to # gate the data. pan_params = flowCore::colnames(fs_young)[c(7:9, 11:22)] pan_params # do the panoplyCF computation (takes a couple of minutes) pan = panoply(fcs = fs_young, parameters = pan_params, nclust = 30) # check out the resulting object is(pan) names(pan)
The second one renders a picture of the result.
decorate_sample_panoply(fcs = fs_young, panoply_obj = pan, colorscale = FALSE)
suppressMessages(decorate_sample_panoply(fcs = fs_young, panoply_obj = pan, colorscale = TRUE))
Now that we have a panoply model, we can map other samples to it to look for differential expression. We've created the panoply model on the flowSet consisting of 10 sub-sampled instances from young people (a better idea might be to aggregate young and old samples and create the panoply model from that aggregate). We'll load up 10 sub-sampled instances from older people, and look for clusters that seem to differ between the two groups.
NOTE: in the interest of keeping example data small-ish, the data here are (a) subsampled, and (b) only 10 instances in each group, so this section isn't likely to find much of anything interesting but is included to illustrate the idea. We encourage you to download \link{https://flowrepository.org/id/FR-FCM-ZZGS} in its entirety and follow this idea on the full data set.
fn_old = system.file("extdata", "sampled_flowset_old.rda", package = "panoplyCF") load(fn_old) # fs_old nclust = max(pan$clustering$clst) ninst = length(fs_young) + length(fs_old) # make a matrix to hold results sadistics = matrix(NA, nrow = ninst, ncol = nclust) k = 1 for (i in 1:length(fs_young)) { sadistics[k, ] = panoply_map_sample(fs_young[[i]], panoply_obj = pan)$fraction k = k + 1 } for (i in 1:length(fs_old)) { sadistics[k, ] = panoply_map_sample(fs_old[[i]], panoply_obj = pan)$fraction k = k + 1 } # calculate significance pval = vector('numeric') for (i in 1:ncol(sadistics)) { pval[i] = wilcox.test(sadistics[1:10, i], sadistics[11:20, i], exact = FALSE)$p.value } srt = sort(pval, index.return = TRUE) srt$x srt$ix
BIG CAVEAT: We're not adjusting P values for multiple comparisons (e.g. Bonferroni), because the example data set is too small. YOU SHOULD DO THIS FOR REAL WORK!!
Notice that the most significant cluster is cluster number 20. Looking back to the panoply spread, we see that this cluster is CD3+ CD8+ CD45RA+ CCR7+. This corresponds to CD8 Naive T cells, and they're lower in older folks compared to younger folks. Does this make sense to you immunologists out there?
Please note that the package versions 0.1.x are alpha. It's likely to break. Please let me know what works and what doesn't.
Thanks, Wade wade.rogers@spcytomics.com
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.