In rogerswt/wadeTools: Wade's Hodgepodge of Flow Cytometry Functions

knitr::opts_chunk$set(
  collapse = TRUE,
  knitr.kable.NA = '',
  comment = ""
)

# load some libraries early to suppress messages for the vignette
library(flowCore)
library(flowViz)
library(flowFP)
library(splancs)
library(sp)
library(fields)
library(KernSmooth)
library(wadeTools)
library(viridis)

Purpose of this package

The wadeTools package is a collection of a wide variety of R functions that I've written over the years. They accomplish big or small tasks that I find that I do over and over again, so as in any software development work, it makes sense to write it once (carefully and with an eye towards re-useability) and use it often.

I had originally created a 'tools' directory which had task-specific subdirectories (e.g. visualization, compensation, transformation, etc.). Any time I found I had a function that seemed to have broader utility than just the project at hand, I dumped it into the tools hierarchy. Over time functions and categories accumulated. Mostly these had to do specifically with flow cytometry data processing, but a few didn't.

I've only recently turned this collection into a formal R package. Why? Because I wanted to share these tools with Cytomics Workshop participants - people who are interested in acquiring or honing skills in developing and applying advanced computational analysis methods for flow cytometry. In earlier iterations of the workshop I simply shared the tools directory. That's akin to dumping a box of random tools on the workbench and asking you to build a barn, with no a priori knowledge of what a plane, or a saw, or a chisel was used for, or how to use them. A package has several monumental advantages over this ad hoc approach to sharing code:

it's easy to distribute via Github,
it installs and loads just like any other R package,
each function is documented with a help page that shows its arguments and usage, and in many cases examples of its use,
it comes with this vignette - essentially a cook's tour of the package and how to get started with it.

Roadmap of functions

First, here's a table of functions exported by the package (there are many more "helper" functions that aren't intended to be exposed to the user but provide capabilities for the public exported functions). I've organized these public functions into categories (e.g. reading, processing, compensation, etc.), and highlighted the ones I find I use the most often in bold. Consider this table to be a sort of reference "roadmap" to what's available.

tfunc = read.csv("table_of_functions.csv")
knitr::kable(tfunc)

A simple example

Let's illustrate the use of the package with a simple example. Let's take a look at a single FCS file, which may be one of many files in a project. But for now, just consider this one file (we'll look at collections of files in a little bit). We will:

read the file
compensate the data using the SPILL matrix in the FCS header
transform the data biexponentially
take a look at a couple of pictures

We'll do this 2 ways. First, using standard flowCore functions, and second, using wadeTools (which of course uses flowCore under the hood). We'll retrieve the filename of the example data first, which we'll use for both methods:

filename = system.file("extdata", "example1.fcs", package = "wadeTools")

flowCore method

ff = read.FCS(filename)

# get the spillover matrix from the header
spill = keyword(ff)$SPILL
ff = compensate(ff, spillover = spill)

# apply the flowCore default biexponential transformation
bt = biexponentialTransform()
fft = transform(ff, transformList(colnames(ff)[7:22], bt))

# make a couple of pictures
plot(fft, c("FSC-A", "SSC-A"), main = "Scattering")
plot(fft, c("Green D 610/20-A", "Violet B 705/70-A"), main = "CD4/CD8")

wadeTools method

# set up a theme needed only for the vignette
mytheme = theme(plot.title = element_text(size = 10)) + 
  theme(axis.title.x = element_text(size = rel(0.75))) + 
  theme(axis.title.y = element_text(size = rel(0.75)))

ff = get_sample(fn = filename)

# make a couple of pictures
ggflow(ff, c("FSC-A", "SSC-A")) + ggtitle("Scattering") + mytheme
ggflow(ff, c("CD4PETR", "CD8Q705")) + ggtitle("CD4/CD8") + mytheme

Comparing the two methods, the wadeTools approach accomplishes all of the work in a single line of code, using get_sample(). The flowCore approach on the other hand requires several lines, and requires that the user master (and remember) the somewhat arcane syntax of the transform methodology. flowCore also has fairly primitive default graphics.

In wadeTools, the function get_sample() does a bunch of heavy lifting. It reads the file, applies the compensation matrix in the header, derails, does a linear transform on scattering parameters and a custom biexponential transformation on the fluorescence parameters, and swaps the conjugate names for the detector names. Of course, inside this function there's lots going on, but the nice thing is that's kind of hidden from you and you needn't worry about it on a day-to-day basis. And if you do want to worry about it you can always use get_sample() arguments to customize its behavior, and consult the source code to see what it's actually doing.

Compare the two sets of figures and you'll notice that the labeling of fluorescence parameters in the wadeTools version is a lot more user-friendly than the default labeling with flowCore (for example, flowCore labels parameters by their detector name, e.g. Green D 610/20-A, whereas wadeTools uses the user-defined labels, e.g. CD4PETR). You'll also notice that the rendering of the distributions is a lot more like what you're accustomed to looking at (especially if you're a FlowJo user). You will also note one more thing: the default biexponential transform in flowCore is not appropriate. It over-squashes the negatives, leading to a splitting around zero, which generates "false populations" (this really just means that the default parameters of the flowCore biexp are poorly chosen). This is a warning to be careful that your transformation makes sense. Finally, it's important to know that ggflow() by default assumes that scattering parameters have been linearly transformed and fluorescence parameters have been transformed using the biexponential transform implemented in wadeTools (which is consistent with get_sample defaults). If this isn't the case you can use ggflow function parameters to make an accurate picture.

So, not only is the wadeTools approach much simpler from a code perspective, the result is quite a bit nicer (IMHO).

Adding flexibility to get_sample()

As of version 0.3.2 of wadeTools, we've added some flexibility to get_sample().

First, whereas before if you wanted to perform compensation you were limited to using the internal spillover matrix in the header of the FCS file you specified. Now, you can add your own spillover matrix, but get_sample() will default to the internal matrix if you don't.

Second, you may now specify which transformation to use on your fluorscence channels (or more precisely, the "non-scattering" channels). Previously, get_sample() was hard-wired to apply the wadeTools::biexpTransform method and was restricted to using the adjustable parameter "a" set to the default value 0.002. Now, you can change this value, or switch to other non-linear transformations "asinh" or "log". In the case of asinh, you can keep the default cofactor value of 5, or specify a different value. The log transformation has no adjustable values like "a" or "cofactor", but you'll notice warnings which result from dropping negative or zero values. Here's an illustration.

ff1 = get_sample(filename)     # uses the default biexp transform with the default a = 0.002
ff2 = get_sample(filename, transform.method = 'asinh')     # uses the asinh transform with default cofactor = 5
ff3 = get_sample(filename, transform.method = 'log')       # uses the log transform, but warnings related to negative values
ff4 = get_sample(filename, transform.method = 'asinh', cofactor = 100)   # asinh, but overriding the default cofactor

tmptheme = theme(plot.title = element_text(size = 8))
ggflow(ff1, params = c("CD3Q605","CD4PETR")) + ggtitle("Default Transformation")  + tmptheme
ggflow(ff2, params = c("CD3Q605","CD4PETR"), trans_fl = "asinh") + ggtitle("asinh Transformation w/ cofactor = 5") + tmptheme
ggflow(ff3, params = c("CD3Q605","CD4PETR"), trans_fl = 'log') + ggtitle("Log Transformation") + tmptheme
ggflow(ff4, params = c("CD3Q605","CD4PETR"), trans_fl = 'asinh', cofactor = 100) + ggtitle("asinh Transformation w/ cofactor = 100") + tmptheme

Note that it's your responsibility to be consistent in how you've transformed your data (in get_sample()) and in how you visualize your data (in ggflow()).

Next: a simple pre-gating task

It is often (always?) the case that raw data needs to be pre-gated prior to any sophisticated computational analysis. For example, we're never interested in dead cells, or doublet events. We often include dump markers to exclude cells that aren't the ones we're targeting. It's important to get rid of the junk so that it doesn't bias or confuse downstream algorithms (being careful not to throw out the baby with the bath water!). Thus, even though we'd like to think that gating is a thing of the past, it's not.

Here, we'll talk about pre-gating, which is a cleanup step prior to clustering or some other means of computationally analyzing the data (get rid of the bath water, keep the baby). We'll use that term to distinguish what we're doing here from a conventional gating analysis, where the analyst specifies analytical gates that define all populations of interest, counting events inside those gates as the primary means of characterizing a sample and ignoring any events not in any of the analytical gates. Of course, you will recognize this as "hypothesis-driven" data analysis, where the gates define all hypotheses to be tested. Pre-gating is a preliminary step towards clustering analysis, which represents an "hypothesis-generating" approach.

What we aim to do is to create algorithmic gates that recognize and follow populations as they wander through multivariate space due to instrument or staining variability. Our goal is to (a) find major populations in an unbiased fashion, (b) do it automatically, so it can be applied to 100's or 1000's of files without user intervention and (c) do it reproducibly. Having accomplished this, we will persist the result of this pre-gating for further downstream processing via your clustering algorithm of choice (that's a computer geek way of saying, "we will save the results as, say, gated FCS files").

All that said, let's see if we can find the lymphocytes in our example file:

# first get rid of debris, which can confuse blob.boundary
thresh_debris_fsc = 0.25
thresh_debris_ssc = 0.25
bb = blob.boundary(ff = ff, parameters = c("FSC-A", "SSC-A"), 
                   x_range = c(thresh_debris_fsc, Inf), y_range = c(thresh_debris_ssc, Inf), 
                   location = c(2, 1), height = .5, convex = TRUE)

p = ggflow(ff, c("FSC-A","SSC-A"))
b = geom_path(bb, mapping = aes(x = `FSC-A`, y = `SSC-A`))
xl = geom_vline(aes(xintercept = thresh_debris_fsc), linetype = "dotted")
yl = geom_hline(aes(yintercept = thresh_debris_ssc), linetype = "dotted")
p + b + xl + yl

# check how many events are inside the gate
ff_gated = Subset(ff, polygonGate(.gate = bb))
nrow(ff)
nrow(ff_gated)

# Out of the 198k events in the original file, 101k are inside the blob boundary

blob.boundary is limited by the contrast in the data. In this case, we're looking for the blob nearest the location c(2, 1). You'll see with your eyes that there's a blob that looks to be about at c(1.5, 1.0). Since we've specified convex = TRUE, and height = .5, the function will find the blob we're looking for, and search for the largest convex contour that encloses it. The nearby blob causes that search to end at a fairly tight contour.

You should play around with blob.boundary on your data. Specifically, try different height and location values. Try setting convex to TRUE or FALSE, or log.transform to TRUE or FALSE (see the help page for blob.boundary), and watch what happens. Also experiment with pre-filtering (as in the above example with the use of the x_range and y_range arguments) to eliminate confounding but uninteresting portions of the distribution. In this case there is a tiny but intense peak down very close to 0,0. Such intense but essentially uninteresting components of a distribution are problematic for blob.boundary because it normalizes to the most intense peak. This pushes other portions of the distribution down to the point that they're hard to find. By first eliminating irrelevant regions, blob.boundary is able to re-normalize to more easily locate blobs of interest.

There is certainly an art to effectively using blob.boundary!

If you wanted this gate to be a bit more inclusive you could 'inflate' it a bit:

bb_inflated = inflate.contour(bb, dist = 0.25)
p = ggflow(ff = ff, params = c("FSC-A","SSC-A"))
b1 = geom_path(bb, mapping = aes(x = `FSC-A`, y = `SSC-A`))
b2 = geom_path(bb_inflated, mapping = aes(x = `FSC-A`, y = `SSC-A`), color = 'indianred2', size = 2)
p + b1 + b2   # show original blob in black, inflated blob as a fat red line
# re-gate
ff_gated = Subset(ff, polygonGate(.gate = bb_inflated))

# re-count gated events
nrow(ff_gated)

# note that the number of gated events rose from 101k to 121k due to the fact
# that the gate is bigger, and thus encircles ~12% more events

Inflating is often a useful adjunct to blob.boundary, since it's pretty hard to find the outer limits of blobs.

A More Complete Pre-gating Workflow

Let's now go back and design a complete pre-gating workflow for our example data file^[Please note that this example file is a downsampled version (due to size constraints) of a data file in a FlowRepository project. The original full dataset can be obtained from FlowRepository. These data are too large to be included in an R package, but I encourage you to download the full dataset and explore these ideas on the real data. I also strongly recommend using FlowRepositoryR to download the data.].

Here is the flowFrame we imported using get_sample():

pData(parameters(ff))[, 1:2]

This looks like a T cell panel, but in order not to miss the NK and NKT cells we'll not pre-gate on CD3, but rather will save it as an analytical parameter. Thus, a reasonable gating strategy might be as follows:

Eliminate non-stationary acquisition (clean.fp())
Eliminate non-singlet events (using scattering -H and -W)
Keep the lymphocytes (using FSC-A and SSC-A)
Eliminate dead cells (using the LIVEDEAD marker)

Gate 1: Stationary acquisition

# I will select SSC-A plus one parameter from each laser.
params = c("SSC-A", "KI67FITC", "CD127BV421", "CCR4AF647", "PD1PE")

# tag events in "bad" slices
tmp = clean.fp(ff = ff, parameters = params, show = TRUE)

# filter out the events in "bad" slices
ff_clean = Subset(tmp, rectangleGate("clean" = c(0.5, Inf)))

# summarize how much we lost
no = nrow(ff)
nc = nrow(ff_clean)
nd = no - nc
sprintf("eliminating %d events in bad slices (%.1f%%)", nd, 100 * nd / no)

Gate 2: Singlets

There are lots of ideas on how to do this. Here's one:

###################
# FSC singlets
###################
p = ggflow(ff_clean, c("FSC-H", "FSC-W"))
bb = blob.boundary(ff, parameters = c("FSC-H", "FSC-W"), height = .1, strongest = TRUE)
bb = get.hull(bb)                      # convex hull of bb
bb_infl = inflate.contour(bb, dist = .1)    # inflate a bit
fscw_thresh = max(bb_infl$`FSC-W`)
b1 = geom_path(bb, mapping = aes(x = `FSC-H`, y = `FSC-W`))
b2 = geom_path(bb_infl, mapping = aes(x = `FSC-H`, y = `FSC-W`), size = 1.5)
hline = geom_hline(mapping = aes(yintercept = fscw_thresh), color = "indianred2", size = 1.5, linetype = 'dotdash')
p + b1 + b2 + hline
ff_sing1 = Subset(ff_clean, rectangleGate("FSC-W" = c(-Inf, fscw_thresh)))


###################
# SSC singlets
###################
p = ggflow(ff_sing1, c("SSC-H", "SSC-W"))
bb = blob.boundary(ff, parameters = c("SSC-H", "SSC-W"), height = .1, strongest = TRUE)
bb = get.hull(bb)                      # convex hull of bb
bb_infl = inflate.contour(bb, dist = .1)    # inflate a bit
fscw_thresh = max(bb_infl$`SSC-W`)
b1 = geom_path(bb, mapping = aes(x = `SSC-H`, y = `SSC-W`))
b2 = geom_path(bb_infl, mapping = aes(x = `SSC-H`, y = `SSC-W`), size = 1.5)
hline = geom_hline(mapping = aes(yintercept = fscw_thresh), color = "indianred2", size = 1.5, linetype = 'dotdash')
p + b1 + b2 + hline
ff_sing2 = Subset(ff_sing1, rectangleGate("SSC-W" = c(-Inf, fscw_thresh)))

# summarize how much we lost
no = nrow(ff_clean)
nc = nrow(ff_sing2)
nd = no - nc
sprintf("eliminating %d non-singlet events (%.1f%%)", nd, 100 * nd / no)

Gate 3: Lymphocytes

# we'll do this just like in the earlier example
thresh_debris_fsc = 0.25
thresh_debris_ssc = 0.25
bb = blob.boundary(ff = ff, parameters = c("FSC-A", "SSC-A"), 
                   x_range = c(thresh_debris_fsc, Inf), y_range = c(thresh_debris_ssc, Inf), 
                   location = c(2, 1), height = .5, convex = TRUE)
bb_inflated = inflate.contour(bb, dist = 0.25)

p = ggflow(ff, c("FSC-A","SSC-A"))
b = geom_path(bb, mapping = aes(x = `FSC-A`, y = `SSC-A`))
xl = geom_vline(aes(xintercept = thresh_debris_fsc), linetype = "dotted")
yl = geom_hline(aes(yintercept = thresh_debris_ssc), linetype = "dotted")
p + b + xl + yl

ff_lymph = Subset(ff_sing2, polygonGate(.gate = bb_inflated))

Gate 4: Live events

# We will illustrate the use of Kernel Density Estimation of 1D probability distributions
# to find a LIVEDEAD threshold
kde = bkde(exprs(ff_lymph)[, "LIVEDEAD"], bandwidth = 0.05)   # bkde is from package KernSmooth
kde = normalize.kde(kde)

# find valley(s)
res = find.local.minima(kde)

# if more than one minimum, find the lowest one above, say, 2000 on the LIVEDEAD axis
thresh_ld = res$x[min(which(res$x > bx(2000)))]

ff_gated = Subset(ff_lymph, rectangleGate("LIVEDEAD" = c(-Inf, thresh_ld)))

# visualize
ggflow(ff_lymph, c("LIVEDEAD", "CD3Q605"), resolution = 'coarse', indicate_zero = FALSE) +
   geom_polygon(aes(x = x, y = 5 * y), data = as.data.frame(kde), alpha = .2, color = 'black') +
   geom_vline(xintercept = res$x, linetype = 'dotdash') +
   geom_vline(xintercept = thresh_ld, color = 'indianred2', linetype = 'dotdash', size = 1.5)

# summarize how many events corresponded to non-live events
no = nrow(ff_lymph)
nc = nrow(ff_gated)
nd = no - nc
sprintf("eliminating %d dead events (%.1f%%)", nd, 100 * nd / no)

more to come

work on collections (flowSets)

add phenoData from spreadsheets

draw custom graphs using bx()

rogerswt/wadeTools documentation built on Feb. 16, 2023, 7:47 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

rogerswt/wadeTools
Wade's Hodgepodge of Flow Cytometry Functions

In rogerswt/wadeTools: Wade's Hodgepodge of Flow Cytometry Functions

Purpose of this package

Roadmap of functions

A simple example

flowCore method

wadeTools method

Adding flexibility to get_sample()

Next: a simple pre-gating task

A More Complete Pre-gating Workflow

Gate 1: Stationary acquisition

Gate 2: Singlets

Gate 3: Lymphocytes

Gate 4: Live events

more to come

R Package Documentation

Browse R Packages

We want your feedback!

rogerswt/wadeTools Wade's Hodgepodge of Flow Cytometry Functions

In rogerswt/wadeTools: Wade's Hodgepodge of Flow Cytometry Functions

Purpose of this package

Roadmap of functions

A simple example

flowCore method

wadeTools method

Adding flexibility to get_sample()

Next: a simple pre-gating task

A More Complete Pre-gating Workflow

Gate 1: Stationary acquisition

Gate 2: Singlets

Gate 3: Lymphocytes

Gate 4: Live events

more to come

R Package Documentation

Browse R Packages

We want your feedback!

rogerswt/wadeTools
Wade's Hodgepodge of Flow Cytometry Functions