flowPloidy: Flow Cytometry Histograms
In flowPloidy: Analyze flow cytometer data to determine sample ploidy

Introduction

The application of flow cytometry (FCM) to ploidy assessment is based on a simple concept. Cell nuclei are stained with DNA-binding fluorochromes, and the amount of DNA present is determined by measuring the fluorescence emitted when the nuclei are excited with light of a particular wavelength. Greater fluorescence means more DNA in the nucleus. That is, a diploid nucleus should produce half the fluorescence of a tetraploid nucleus, and a hexaploid nucleus should produce three times more fluorescence than the diploid.

Histogram Basics

Applying this concept is complicated by several factors:

nuclear DNA content varies over the cell cycle
secondary metabolites may interfere with fluorochrome staining, or may themselves fluoresce
tissue preparations are usually contaminated with non-nuclear cellular components, damaged nuclei, and other debris that fluoresce to greater or lesser extent
the accuracy of flow cytometers is limited, and values tend to shift or drift over the course of a run

We can illustrate these challenges by comparing an ideal FCM DNA histogram with simulated experimental data.

library(flowPloidy)
plot(x = 1, type = 'n', xlim = c(0, 256), ylim = c(0, 500),
     ylab = "Nuclei", xlab = "Fluorescence", main = "Ideal Histogram")
abline(h = 0)
lines(x = c(80, 80), y = c(0, 325))
lines(x = c(160, 160), y = c(0, 150))
tmpfun <- function(x) 100 * exp(-x/80)
curve(tmpfun, 80, 160, add = TRUE)
text(x = 80, y = 325, "G1 Peak", pos = 3)
text(x = 160, y = 150, "G2 Peak", pos = 3)
text(x = 120, y = 30, "S Phase", pos = 3)

In the ideal case, we have two clear, accurate cell groups, G1 and G2. The G1 ("gap 1") group represents cells in their 'normal' state. That is, they have the usual diploid DNA complement. The G2 ("gap 2") group represents cells that have duplicated their DNA, but haven't yet undergone mitosis. Consequently, they have twice the diploid DNA complement. In between the two groups is the S phase ("synthesis") region. This area of the plot represents cells that are in the process of duplicating their DNA, in preparation for mitosis.

(N.B. in the flow cytometry literature, clusters of cells, which I've called "groups" above, are often referred to as cell "populations")

If our histogram were this clean, we could easily determine the DNA content of our sample by comparing it to a known standard (see below). In practice, you'll never see a histogram that looks like this. A more realistic example follows:

set.seed(1234)
plot(x = 1, type = 'n', xlim = c(0, 256), ylim = c(0, 500),
     ylab = "Nuclei", xlab = "Fluorescence", main = "Empirical Histogram")
abline(h = 0)

vals <- rep(0, 257)
G1 <- 3000 * dnorm(0:256, mean = 80, sd = 80 * 0.05)
vals <- vals + G1
G2 <- 3000 * dnorm(0:256, mean = 160, sd = 160 * 0.05)
SPhase <- rep(0, 257) 
SPhase[80:160] <- 100 * exp((-80:-160)/80)
JIT <- 4
G1j <- jitter(G1, amount = JIT)
G2j <- jitter(G2, amount = JIT)
SPhasej <- jitter(SPhase, amount = JIT)
vals <- G1 + G2 + SPhase
Debris <- flowPloidy:::getSingleCutVals(vals, seq_along(vals), 1)/5
Debrisj <- jitter(Debris, amount = JIT)
G1j[G1j<0] <- 0
G2j[G2j<0] <- 0
SPhasej[SPhasej<0] <- 0
Debrisj[Debrisj<0] <- 0

vals <- G1 + G2 + SPhase + Debris
valsj <- G1j + G2j + SPhasej + Debrisj

Debris[0:4] <- 0
Debrisj[0:4] <- 0
vals[0:4] <- 0
valsj[0:4] <- 0

points(valsj, type = 'l')

text(x = 80, y = 450, "G1 Peak", pos = 3)
text(x = 160, y = 220, "G2 Peak", pos = 3)
text(x = 120, y = 100, "S Phase", pos = 3)
text(x = 40, y = 170, "Debris", pos = 3)
text(x = 240, y = 40, "Aggregates", pos = 3)

We can still see the G1 and G2 groups, but they aren't exact values. Instead we see that the fluorescence values for each group are distributed around a central peak. Similarly, the S phase is not a clean polygon. The jagged edges of the region are due in part to instrument error: we can't capture DNA content perfectly. In addition, some of that noise is "real". That is, the cells in our tissue sample are not proceeding through the cell cycle in a perfectly regular manner. Even if we could measure the DNA without error, the S-phase region wouldn't necessarily be a perfectly smooth region in the histogram.

In addition to the G1, G2, and S phase regions from our ideal histogram, real histograms also contain debris. This is a mixture of cellular debris, damaged nuclei, and other contaminants. The dyes we use bind best to DNA, so most of the debris, which contains relatively little DNA, is found on the left/lower fluorescence side of the histogram. We may also see aggregates at the right side of the plot. These are nuclei that have stuck together in the flow cytometer, and so appear as higher-ploidy nuclei.

Histogram Analysis

Given that we have a noisy histogram, how do we extract the values of interest from the data? There are two main approaches. The first, perhaps more intuitive, approach requires us to manually draw boxes around the peaks we're interested in, and have the computer calculate the parameters of interest for us. For the previous example, that would look like this:

plot(x = 1, type = 'n', xlim = c(0, 256), ylim = c(0, 500),
     ylab = "Nuclei", xlab = "Fluorescence",
     main = "Manual Histogram Analysis")
abline(h = 0)

points(valsj, type = 'l')
rect(70, 0, 95, 465, border = 2, lwd = 2)

There are two main problems with this approach. First, the placement of the box around a peak is subjective. One of the most important quality-checks for ploidy analysis is the coefficient of variation (CV) for a peak. We can easily lower the CV just by drawing a narrower box around our data. Consequently, our confidence in our results is undermined by not having an objective and repeatable method for setting the limits around a peak.

The second problem is that we are not accounting for the different types of cells that contribute to each peak. What we see as the G1 peak also contains debris and S-phase nuclei:

plot(x = 1, type = 'n', xlim = c(0, 256), ylim = c(0, 500),
     ylab = "Nuclei", xlab = "Fluorescence",
     main = "Histogram Components")
abline(h = 0)

points(valsj, type = 'l')
points(Debrisj, type = 'l', col = 3, lwd = 2)
points(G1j, type = 'l', col = 4, lwd = 2)
points(SPhasej, type = 'l', col = 2, lwd = 2)
text(x = 25, y = 50, col = 3, "Debris")
text(x = 95, y = 350, col = 4, "G1")
text(x = 120, y = 120, col = 2, "S-Phase")

When we draw a box around the G1 peak, we are also including debris and s-phase cells, which distorts the true distribution of the G1 nuclei DNA content.

The second approach to FCM histogram analysis uses non-linear regression to distinguish the different cell populations that contribute to the total histogram. The G1 and G2 peaks are modeled as normal curves. A variety of phenomenological and mechanistic functions are available to model the debris and S-phase components. See @bagwell_1993 and @watson_1992 for a more detailed presentation of the modeling process.

Using the modeling approach, the analysis of our example data looks like this:

plot(x = 1, type = 'n', xlim = c(0, 256), ylim = c(0, 500),
     ylab = "Nuclei", xlab = "Fluorescence",
     main = "Histogram Modeling")
abline(h = 0)

lines(valsj)
lines(Debris, col = 3, lwd = 2)
lines(G1, col = 4, lwd = 2)
lines(SPhase, col = 2, lwd = 2)
lines(G2, col = 5, lwd = 2)

text(x = 25, y = 50, col = 3, "Debris")
text(x = 95, y = 350, col = 4, "G1")
text(x = 120, y = 120, col = 2, "S-Phase")
text(x = 180, y = 100, col = 5, "G2")

This is a simulated example, but it illustrates the modeling process very clearly. We can extract the values of interest directly from the model components. The regression results include the mean value for the G1 peak (80), as well as its CV (5%). These values have taken into account the contribution of the debris curve and the s-phase nuclei, and we'll get the same results every time - they don't depend on where we decide to draw the box around each peak.

Standards

To simplify the presentation of histograms in the previous sections, I omitted the internal standards used in FCM analysis. This is a critical component of ploidy analysis, because the fluorescence values for a given quantity of DNA are not fixed. The measured values vary based on the tissue preparation and the flow cytometer. Even for the same tissue prep run on the same flow cytometer, the fluorescence values can vary over the course of a single day.

If we were to re-run the sample that yielded a mean value of 80 in our example, we might get a different result. Which means we can't interpret the raw fluorescence values. Instead, we include a tissue standard in our preparations, so we can compare our sample peak to the known value. While the absolute fluorescence values may shift over time, the ratio between our sample mean and the standard mean will stay the same.

Adding a standard to our example data, we get a histogram that looks like this:

plot(x = 1, type = 'n', xlim = c(0, 256), ylim = c(0, 500),
     ylab = "Nuclei", xlab = "Fluorescence",
     main = "Histogram with Standard")
abline(h = 0)

standard <- jitter(3000 * dnorm(0:256, mean = 130, sd = 400 * 0.01),
                   amount = 10)
standard[standard<0] <- 0
lines(valsj + standard)
text(x = 160, y = 400, "Standard")
text(x = 95, y = 400, "G1")
text(x = 180, y = 200, "G2")

To analyze this histogram, we add another normal curve to account for the standard sample. That allows us to compare our sample mean (80) to the standard mean (130). If the standard was Solanum lycopersicum, with a known 2C value of 1.96pg, then our sample has a 2C value of $\frac{80}{130} \times 1.96pg = 1.21pg$.