knitr::opts_chunk$set( collapse = TRUE, knitr.kable.NA = '', comment = "" )
# load some libraries early to suppress messages for the vignette library(flowCore) library(flowViz) library(flowFP) library(splancs) library(sp) library(fields) library(KernSmooth) library(wadeTools) library(viridis)
The wadeTools package is a collection of a wide variety of R functions that I've written over the years. They accomplish big or small tasks that I find that I do over and over again, so as in any software development work, it makes sense to write it once (carefully and with an eye towards re-useability) and use it often.
I had originally created a 'tools' directory which had task-specific subdirectories (e.g. visualization, compensation, transformation, etc.). Any time I found I had a function that seemed to have broader utility than just the project at hand, I dumped it into the tools hierarchy. Over time functions and categories accumulated. Mostly these had to do specifically with flow cytometry data processing, but a few didn't.
I've only recently turned this collection into a formal R package. Why? Because I wanted to share these tools with Cytomics Workshop participants - people who are interested in acquiring or honing skills in developing and applying advanced computational analysis methods for flow cytometry. In earlier iterations of the workshop I simply shared the tools directory. That's akin to dumping a box of random tools on the workbench and asking you to build a barn, with no a priori knowledge of what a plane, or a saw, or a chisel was used for, or how to use them. A package has several monumental advantages over this ad hoc approach to sharing code:
First, here's a table of functions exported by the package (there are many more "helper" functions that aren't intended to be exposed to the user but provide capabilities for the public exported functions). I've organized these public functions into categories (e.g. reading, processing, compensation, etc.), and highlighted the ones I find I use the most often in bold. Consider this table to be a sort of reference "roadmap" to what's available.
tfunc = read.csv("table_of_functions.csv") knitr::kable(tfunc)
Let's illustrate the use of the package with a simple example. Let's take a look at a single FCS file, which may be one of many files in a project. But for now, just consider this one file (we'll look at collections of files in a little bit). We will:
We'll do this 2 ways. First, using standard flowCore functions, and second, using wadeTools (which of course uses flowCore under the hood). We'll retrieve the filename of the example data first, which we'll use for both methods:
filename = system.file("extdata", "example1.fcs", package = "wadeTools")
ff = read.FCS(filename) # get the spillover matrix from the header spill = keyword(ff)$SPILL ff = compensate(ff, spillover = spill) # apply the flowCore default biexponential transformation bt = biexponentialTransform() fft = transform(ff, transformList(colnames(ff)[7:22], bt)) # make a couple of pictures plot(fft, c("FSC-A", "SSC-A"), main = "Scattering") plot(fft, c("Green D 610/20-A", "Violet B 705/70-A"), main = "CD4/CD8")
# set up a theme needed only for the vignette mytheme = theme(plot.title = element_text(size = 10)) + theme(axis.title.x = element_text(size = rel(0.75))) + theme(axis.title.y = element_text(size = rel(0.75)))
ff = get_sample(fn = filename) # make a couple of pictures ggflow(ff, c("FSC-A", "SSC-A")) + ggtitle("Scattering") + mytheme ggflow(ff, c("CD4PETR", "CD8Q705")) + ggtitle("CD4/CD8") + mytheme
Comparing the two methods, the wadeTools approach accomplishes all of the work in a single line of code, using get_sample(). The flowCore approach on the other hand requires several lines, and requires that the user master (and remember) the somewhat arcane syntax of the transform methodology. flowCore also has fairly primitive default graphics.
In wadeTools, the function get_sample() does a bunch of heavy lifting. It reads the file, applies the compensation matrix in the header, derails, does a linear transform on scattering parameters and a custom biexponential transformation on the fluorescence parameters, and swaps the conjugate names for the detector names. Of course, inside this function there's lots going on, but the nice thing is that's kind of hidden from you and you needn't worry about it on a day-to-day basis. And if you do want to worry about it you can always use get_sample() arguments to customize its behavior, and consult the source code to see what it's actually doing.
Compare the two sets of figures and you'll notice that the labeling of fluorescence parameters in the wadeTools version is a lot more user-friendly than the default labeling with flowCore (for example, flowCore labels parameters by their detector name, e.g. Green D 610/20-A, whereas wadeTools uses the user-defined labels, e.g. CD4PETR). You'll also notice that the rendering of the distributions is a lot more like what you're accustomed to looking at (especially if you're a FlowJo user). You will also note one more thing: the default biexponential transform in flowCore is not appropriate. It over-squashes the negatives, leading to a splitting around zero, which generates "false populations" (this really just means that the default parameters of the flowCore biexp are poorly chosen). This is a warning to be careful that your transformation makes sense. Finally, it's important to know that ggflow() by default assumes that scattering parameters have been linearly transformed and fluorescence parameters have been transformed using the biexponential transform implemented in wadeTools (which is consistent with get_sample defaults). If this isn't the case you can use ggflow function parameters to make an accurate picture.
So, not only is the wadeTools approach much simpler from a code perspective, the result is quite a bit nicer (IMHO).
As of version 0.3.2 of wadeTools, we've added some flexibility to get_sample().
First, whereas before if you wanted to perform compensation you were limited to using the internal spillover matrix in the header of the FCS file you specified. Now, you can add your own spillover matrix, but get_sample() will default to the internal matrix if you don't.
Second, you may now specify which transformation to use on your fluorscence channels (or more precisely, the "non-scattering" channels). Previously, get_sample() was hard-wired to apply the wadeTools::biexpTransform method and was restricted to using the adjustable parameter "a" set to the default value 0.002. Now, you can change this value, or switch to other non-linear transformations "asinh" or "log". In the case of asinh, you can keep the default cofactor value of 5, or specify a different value. The log transformation has no adjustable values like "a" or "cofactor", but you'll notice warnings which result from dropping negative or zero values. Here's an illustration.
ff1 = get_sample(filename) # uses the default biexp transform with the default a = 0.002 ff2 = get_sample(filename, transform.method = 'asinh') # uses the asinh transform with default cofactor = 5 ff3 = get_sample(filename, transform.method = 'log') # uses the log transform, but warnings related to negative values ff4 = get_sample(filename, transform.method = 'asinh', cofactor = 100) # asinh, but overriding the default cofactor tmptheme = theme(plot.title = element_text(size = 8)) ggflow(ff1, params = c("CD3Q605","CD4PETR")) + ggtitle("Default Transformation") + tmptheme ggflow(ff2, params = c("CD3Q605","CD4PETR"), trans_fl = "asinh") + ggtitle("asinh Transformation w/ cofactor = 5") + tmptheme ggflow(ff3, params = c("CD3Q605","CD4PETR"), trans_fl = 'log') + ggtitle("Log Transformation") + tmptheme ggflow(ff4, params = c("CD3Q605","CD4PETR"), trans_fl = 'asinh', cofactor = 100) + ggtitle("asinh Transformation w/ cofactor = 100") + tmptheme
Note that it's your responsibility to be consistent in how you've transformed your data (in get_sample()) and in how you visualize your data (in ggflow()).
It is often (always?) the case that raw data needs to be pre-gated prior to any sophisticated computational analysis. For example, we're never interested in dead cells, or doublet events. We often include dump markers to exclude cells that aren't the ones we're targeting. It's important to get rid of the junk so that it doesn't bias or confuse downstream algorithms (being careful not to throw out the baby with the bath water!). Thus, even though we'd like to think that gating is a thing of the past, it's not.
Here, we'll talk about pre-gating, which is a cleanup step prior to clustering or some other means of computationally analyzing the data (get rid of the bath water, keep the baby). We'll use that term to distinguish what we're doing here from a conventional gating analysis, where the analyst specifies analytical gates that define all populations of interest, counting events inside those gates as the primary means of characterizing a sample and ignoring any events not in any of the analytical gates. Of course, you will recognize this as "hypothesis-driven" data analysis, where the gates define all hypotheses to be tested. Pre-gating is a preliminary step towards clustering analysis, which represents an "hypothesis-generating" approach.
What we aim to do is to create algorithmic gates that recognize and follow populations as they wander through multivariate space due to instrument or staining variability. Our goal is to (a) find major populations in an unbiased fashion, (b) do it automatically, so it can be applied to 100's or 1000's of files without user intervention and (c) do it reproducibly. Having accomplished this, we will persist the result of this pre-gating for further downstream processing via your clustering algorithm of choice (that's a computer geek way of saying, "we will save the results as, say, gated FCS files").
All that said, let's see if we can find the lymphocytes in our example file:
# first get rid of debris, which can confuse blob.boundary thresh_debris_fsc = 0.25 thresh_debris_ssc = 0.25 bb = blob.boundary(ff = ff, parameters = c("FSC-A", "SSC-A"), x_range = c(thresh_debris_fsc, Inf), y_range = c(thresh_debris_ssc, Inf), location = c(2, 1), height = .5, convex = TRUE) p = ggflow(ff, c("FSC-A","SSC-A")) b = geom_path(bb, mapping = aes(x = `FSC-A`, y = `SSC-A`)) xl = geom_vline(aes(xintercept = thresh_debris_fsc), linetype = "dotted") yl = geom_hline(aes(yintercept = thresh_debris_ssc), linetype = "dotted") p + b + xl + yl # check how many events are inside the gate ff_gated = Subset(ff, polygonGate(.gate = bb)) nrow(ff) nrow(ff_gated) # Out of the 198k events in the original file, 101k are inside the blob boundary
blob.boundary is limited by the contrast in the data. In this case, we're looking for the blob nearest the location c(2, 1). You'll see with your eyes that there's a blob that looks to be about at c(1.5, 1.0). Since we've specified convex = TRUE, and height = .5, the function will find the blob we're looking for, and search for the largest convex contour that encloses it. The nearby blob causes that search to end at a fairly tight contour.
You should play around with blob.boundary on your data. Specifically, try different height and location values. Try setting convex to TRUE or FALSE, or log.transform to TRUE or FALSE (see the help page for blob.boundary), and watch what happens. Also experiment with pre-filtering (as in the above example with the use of the x_range and y_range arguments) to eliminate confounding but uninteresting portions of the distribution. In this case there is a tiny but intense peak down very close to 0,0. Such intense but essentially uninteresting components of a distribution are problematic for blob.boundary because it normalizes to the most intense peak. This pushes other portions of the distribution down to the point that they're hard to find. By first eliminating irrelevant regions, blob.boundary is able to re-normalize to more easily locate blobs of interest.
There is certainly an art to effectively using blob.boundary!
If you wanted this gate to be a bit more inclusive you could 'inflate' it a bit:
bb_inflated = inflate.contour(bb, dist = 0.25) p = ggflow(ff = ff, params = c("FSC-A","SSC-A")) b1 = geom_path(bb, mapping = aes(x = `FSC-A`, y = `SSC-A`)) b2 = geom_path(bb_inflated, mapping = aes(x = `FSC-A`, y = `SSC-A`), color = 'indianred2', size = 2) p + b1 + b2 # show original blob in black, inflated blob as a fat red line # re-gate ff_gated = Subset(ff, polygonGate(.gate = bb_inflated)) # re-count gated events nrow(ff_gated) # note that the number of gated events rose from 101k to 121k due to the fact # that the gate is bigger, and thus encircles ~12% more events
Inflating is often a useful adjunct to blob.boundary, since it's pretty hard to find the outer limits of blobs.
Let's now go back and design a complete pre-gating workflow for our example data file^[Please note that this example file is a downsampled version (due to size constraints) of a data file in a FlowRepository project. The original full dataset can be obtained from FlowRepository. These data are too large to be included in an R package, but I encourage you to download the full dataset and explore these ideas on the real data. I also strongly recommend using FlowRepositoryR to download the data.].
Here is the flowFrame we imported using get_sample():
pData(parameters(ff))[, 1:2]
This looks like a T cell panel, but in order not to miss the NK and NKT cells we'll not pre-gate on CD3, but rather will save it as an analytical parameter. Thus, a reasonable gating strategy might be as follows:
# I will select SSC-A plus one parameter from each laser. params = c("SSC-A", "KI67FITC", "CD127BV421", "CCR4AF647", "PD1PE") # tag events in "bad" slices tmp = clean.fp(ff = ff, parameters = params, show = TRUE) # filter out the events in "bad" slices ff_clean = Subset(tmp, rectangleGate("clean" = c(0.5, Inf))) # summarize how much we lost no = nrow(ff) nc = nrow(ff_clean) nd = no - nc sprintf("eliminating %d events in bad slices (%.1f%%)", nd, 100 * nd / no)
There are lots of ideas on how to do this. Here's one:
################### # FSC singlets ################### p = ggflow(ff_clean, c("FSC-H", "FSC-W")) bb = blob.boundary(ff, parameters = c("FSC-H", "FSC-W"), height = .1, strongest = TRUE) bb = get.hull(bb) # convex hull of bb bb_infl = inflate.contour(bb, dist = .1) # inflate a bit fscw_thresh = max(bb_infl$`FSC-W`) b1 = geom_path(bb, mapping = aes(x = `FSC-H`, y = `FSC-W`)) b2 = geom_path(bb_infl, mapping = aes(x = `FSC-H`, y = `FSC-W`), size = 1.5) hline = geom_hline(mapping = aes(yintercept = fscw_thresh), color = "indianred2", size = 1.5, linetype = 'dotdash') p + b1 + b2 + hline ff_sing1 = Subset(ff_clean, rectangleGate("FSC-W" = c(-Inf, fscw_thresh))) ################### # SSC singlets ################### p = ggflow(ff_sing1, c("SSC-H", "SSC-W")) bb = blob.boundary(ff, parameters = c("SSC-H", "SSC-W"), height = .1, strongest = TRUE) bb = get.hull(bb) # convex hull of bb bb_infl = inflate.contour(bb, dist = .1) # inflate a bit fscw_thresh = max(bb_infl$`SSC-W`) b1 = geom_path(bb, mapping = aes(x = `SSC-H`, y = `SSC-W`)) b2 = geom_path(bb_infl, mapping = aes(x = `SSC-H`, y = `SSC-W`), size = 1.5) hline = geom_hline(mapping = aes(yintercept = fscw_thresh), color = "indianred2", size = 1.5, linetype = 'dotdash') p + b1 + b2 + hline ff_sing2 = Subset(ff_sing1, rectangleGate("SSC-W" = c(-Inf, fscw_thresh))) # summarize how much we lost no = nrow(ff_clean) nc = nrow(ff_sing2) nd = no - nc sprintf("eliminating %d non-singlet events (%.1f%%)", nd, 100 * nd / no)
# we'll do this just like in the earlier example thresh_debris_fsc = 0.25 thresh_debris_ssc = 0.25 bb = blob.boundary(ff = ff, parameters = c("FSC-A", "SSC-A"), x_range = c(thresh_debris_fsc, Inf), y_range = c(thresh_debris_ssc, Inf), location = c(2, 1), height = .5, convex = TRUE) bb_inflated = inflate.contour(bb, dist = 0.25) p = ggflow(ff, c("FSC-A","SSC-A")) b = geom_path(bb, mapping = aes(x = `FSC-A`, y = `SSC-A`)) xl = geom_vline(aes(xintercept = thresh_debris_fsc), linetype = "dotted") yl = geom_hline(aes(yintercept = thresh_debris_ssc), linetype = "dotted") p + b + xl + yl ff_lymph = Subset(ff_sing2, polygonGate(.gate = bb_inflated))
# We will illustrate the use of Kernel Density Estimation of 1D probability distributions # to find a LIVEDEAD threshold kde = bkde(exprs(ff_lymph)[, "LIVEDEAD"], bandwidth = 0.05) # bkde is from package KernSmooth kde = normalize.kde(kde) # find valley(s) res = find.local.minima(kde) # if more than one minimum, find the lowest one above, say, 2000 on the LIVEDEAD axis thresh_ld = res$x[min(which(res$x > bx(2000)))] ff_gated = Subset(ff_lymph, rectangleGate("LIVEDEAD" = c(-Inf, thresh_ld))) # visualize ggflow(ff_lymph, c("LIVEDEAD", "CD3Q605"), resolution = 'coarse', indicate_zero = FALSE) + geom_polygon(aes(x = x, y = 5 * y), data = as.data.frame(kde), alpha = .2, color = 'black') + geom_vline(xintercept = res$x, linetype = 'dotdash') + geom_vline(xintercept = thresh_ld, color = 'indianred2', linetype = 'dotdash', size = 1.5) # summarize how many events corresponded to non-live events no = nrow(ff_lymph) nc = nrow(ff_gated) nd = no - nc sprintf("eliminating %d dead events (%.1f%%)", nd, 100 * nd / no)
work on collections (flowSets)
add phenoData from spreadsheets
draw custom graphs using bx()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.