Quality Control and Preprocessing

Introduction

Tracking data can be difficult to analyze because of different errors and artifacts that can occur in the data and cause bias in the analysis. This vignette will explain how the package can be used to detect and correct for some of these errors. The topics will be:

  1. Checking track lengths and dealing with short tracks
  2. Detecting and correcting tissue drift
  3. Detecting and correcting tracking errors and imaging artifacts using angle analyses
  4. Detecting and correcting variation in timesteps

Dataset

First load the package:

library( celltrackR )
library( ggplot2 )
# Save current par() settings
oldpar <- par( no.readonly =TRUE )

The package contains a dataset of T cells imaged in a mouse peripheral lymph node using two photon microscopy. We will here use this dataset as an example of how to perform quality control.

The dataset consists of 21 tracks of individual cells in a tracks object:

str( TCells, list.len = 3 )

Each element in this list is a track from a single cell, consisting of a matrix with $(x,y,z)$ coordinates and the corresponding measurement timepoints:

head( TCells[[1]] )

1 Track length

1.1 Finding the length of tracks in the dataset

Especially for in vivo imaging, cells are often imaged in a limited imaging window and for a limited time. This can result in very short tracks that make the data difficult to analyze. We can check the distribution of track lengths in a dataset as follows:

# Each track has a coordinate matrix with one row per coordinate;
# The number of steps is the number of rows minus one.
track.lengths <- sapply( TCells, nrow ) - 1
hist( track.lengths, xlab = "Track length (#steps)" )
summary( track.lengths )

Alternatively, we can directly check the maximum track length (in number of coordinates, so the number of steps will be this value minus one):

# This is the number of coordinates, so the number of steps is one less.
maxTrackLength( TCells )

1.2 Dealing with short tracks

To prevent problems from short tracks, we can filter them out using the filterTracks() function. The first argument of this function is a function we that returns TRUE or FALSE for each track:

# nrow() of a track is always the number of steps plus one.
# For steps >= min.steps, we can substitute nrow > min.steps:
filter.criterion <- function( x, min.steps ){
  nrow(x) > min.steps
}

TCells.filtered <- filterTracks( filter.criterion, TCells, min.steps = 11 )
# Or shorthand: filterTracks( function(x) nrow(x) > min.steps, TCells )

Let's check the effect of the filtering step in the histogram:

# Find lengths of the filtered dataset
track.lengths.filtered <- sapply( TCells.filtered, nrow ) - 1

# Histograms of track lengths before and after
par( mfrow=c(1,2) )
hist( track.lengths, xlab = "Track length (#steps)",
      main = "Before filtering", breaks = seq(0, 40, by = 10 ) )
hist( track.lengths.filtered, xlab = "Track length (#steps)",
      main = "After filtering",  breaks = seq(0, 40, by = 10 ))

# Check how many tracks are left:
length( TCells.filtered )

The filtering has removed all tracks of 1-10 steps, leaving 17 out of 21 tracks in the dataset.

! Note : Filtering out short tracks can introduce bias in a dataset. See Beltman et al (2009) for details and for step-based analyses as alternative methods to deal with short tracks. See the vignette on analysis methods for an explanation of how to use these methods.

2 Detecting and correcting drift

! Note : The methods suggested below all detect a form of global directionality in a dataset. However, any global directionality may either be an artifact from drift, or a true phenomenon in the data (for example because cells are following a chemotactic gradient). To distinguish between the two, it is recommended to detect and correct for drift based on tracks from structures that do not move at all, instead of the tracks from the cells of interest. If such data is absent, the original tracks can be used -- but this may also remove any true directional bias present in the data.

2.1 Adding artificial tissue drift to the Tcell data

To detect global drift of a tissue of interest, let's add some drift to the TCell data:

# Drift is 0.05 micron/sec in each dimension
drift.speed <- c( 0.05, 0.05, 0.05 )
add.drift <- function( x, drift.vector )
{
  # separate timepoints and coordinates
  tvec <- x[,1]
  coords <- x[,-1]

  # compute movement due to drift.
  drift.matrix <- matrix( rep( drift.vector, nrow(coords) ),
                          ncol = ncol(coords), byrow= TRUE )
  drift.matrix <- drift.matrix * tvec

  # Add drift to coordinates
  x[,-1] <- coords + drift.matrix
  return(x)
}

# Create data with drift
TCells.drift <- as.tracks( lapply( TCells, add.drift, drift.speed ) )

# Plot both for comparison
par(mfrow=c(1,2) )
plot(TCells, main = "original data" )
plot(TCells.drift, main = "with tissue drift" )

# Overlay starting points to view directionality
plot( normalizeTracks(TCells), main = "original data" )
plot( normalizeTracks(TCells.drift), main = "with tissue drift" )

2.2 Detecting global directionality: hotellingsTest

Textor et al (2011) proposed using Hotelling's T-square test to detect directionality in a dataset. This test computes displacement vectors of all steps in the data, and tests if the mean displacement vector is significantly different from the null vector.

In using this test, it is important to realise that steps from within a track are not independent: cells usually exhibit some kind of persistent motion at least on a short time scale (that is: the same cell does not tend to go in completely opposite directions at to subsequent timepoints). This means that blindly extracting all steps violates the "independent observations" assumption of Hotelling's test.

For example, applying this to the TCell dataset:

hotellingsTest( TCells, plot = TRUE, col = "gray" )

there seems to be a significant directionality in the $y$ direction (as shown by the blue ellipse).

However, when we only consider steps that are some distance apart using the step.spacing argument:

hotellingsTest( TCells, plot = TRUE, col = "gray", step.spacing = 5 )

there is no longer evidence for directionality. (Note, however, that the power of the test is also reduced because the larger step spacing reduces the total number of steps).

By contrast, this directionality remains obvious in the dataset with drift:

hotellingsTest( TCells.drift, plot = TRUE, col = "gray", step.spacing = 5 )

! Note : The appropriate value of step.spacing depends on the level of persistence expected in the cells of interest, as well as on the time resolution of the experiment. An autocorrelation or autocovariance plot may provide insights in the persistence time for the cells of interest (see the vignette on analysis methods for details). If tracks of static structures are used to detect and correct for drift -- as is recommended to distinguish between drift and truly directed movement -- a step.spacing of zero is safe to use because there should be no inherent persistence for something that does not move.

Of note, by default hotellingsTest() is performed on a projection of the tracks on only the $x$ and $y$ dimensions. For 3D tracking data like the TCells, we can specify that all three dimensions should be taken into account (although plotting is not supported in 3D):

hotellingsTest( TCells, dim = c("x","y","z"), step.spacing = 5 )
hotellingsTest( TCells.drift, dim = c("x","y","z"), step.spacing = 5 )
# sp <- seq(0,10)
# 
# htest.nodrift <- lapply( sp, function(x) hotellingsTest( TCells, 
#                                                  dim = c("x","y","z"),
#                                                  step.spacing = x ))
# htest.drift <- lapply( sp, function(x) hotellingsTest( TCells.drift, 
#                                                  dim = c("x","y","z"),
#                                                  step.spacing = x ))
# 
# pval.nodrift <- sapply( htest.nodrift, function(x) x$p.value )
# pval.drift <- sapply( htest.drift, function(x) x$p.value )
# stat.nodrift <- sapply( htest.nodrift, function(x) unname(x$statistic) )
# stat.drift <- sapply( htest.drift, function(x) unname(x$statistic) )
# 
# d.drift <- data.frame( step.spacing = sp,
#                        exp = "drift",
#                        pval = pval.drift,
#                        Tstatistic = stat.drift )
# d.nodrift <- data.frame( step.spacing = sp,
#                        exp = "no drift",
#                        pval = pval.nodrift,
#                        Tstatistic = stat.nodrift )
# d <- rbind( d.drift, d.nodrift )
# 
# p1 <- ggplot( d, aes( x = step.spacing, y = pval, color = exp ) ) +
#   geom_line() +
#   geom_hline( yintercept = 0.05, color = "red" ) +
#   scale_color_manual( values = c( "drift" = "black", "no drift" = "gray")  ) +
#   scale_y_log10() +
#   theme_classic()
# 
# p2 <- ggplot( d, aes( x = step.spacing, y = Tstatistic, color = exp ) ) +
#   geom_line() +
#   scale_y_continuous( limits = c(0,NA) ) +
#   scale_color_manual( values = c( "drift" = "black", "no drift" = "gray")  ) +
#   theme_classic()
# 
# gridExtra::grid.arrange(p1,p2,ncol=2)

2.3 Detecting global directionality: angle analysis

Beltman et al (2009) proposed an analysis of angles versus distance between cell pairs to detect global directionality in a dataset. Use the function analyzeCellPairs to get a dataframe with for each pair of tracks in the dataset the angle (between their displacement vectors) and the distance (min distance between the tracks at any timepoint). Note that the distance is not defined for pairs of tracks that have no overlap in timepoints -- these will get an NA value.

# compute for both original as drift data
df.drift <- analyzeCellPairs( TCells.drift )
df.norm <- analyzeCellPairs( TCells )

# Plot
p.norm <- ggplot( df.norm, aes( x = dist, y = angle ) ) +
  geom_point( color = "gray40" ) +
  stat_smooth( span = 1, color = "black" )+
  labs( x = "distance between cell pairs",
        y = "angle between cell pairs",
        title = "original data") +
  geom_hline( yintercept = 90, color = "red" ) +
  theme_classic()

p.drift <- ggplot( df.drift, aes( x = dist, y = angle ) ) +
  geom_point( color = "gray40" ) +
  stat_smooth( span = 1, color = "black" )+
  labs( x = "distance between cell pairs",
        y = "angle between cell pairs",
        title = "data with drift") +
  geom_hline( yintercept = 90, color = "red" ) +
  theme_classic()

gridExtra::grid.arrange( p.norm, p.drift, ncol = 2 )

In the original data, the mean angle is roughly 90 degrees on average for cell pairs at any distance from each other, which is what we would expect if there is no global directionality in the data. In the data with drift, the average angle is lower than we would expect even at large distances between cell pairs -- which indicates that there is global directionality (even cells far apart move roughly in the same direction).

We can perform a similar analysis on the level of single steps to boost the power. The function analyzeStepPairs() finds all pairs of steps from different cells that occur at the same timepoint, and then computes distances (between the step starting points) and angles between them. Note that we use a function in the argument filter.steps to take into account only steps with a minimum displacement, to avoid noise from steps where the cell is pausing.

# compute for both original as drift data
df.drift <- analyzeStepPairs( TCells.drift, filter.steps = function(x) displacement(x)>2  )
df.norm <- analyzeStepPairs( TCells, filter.steps = function(x) displacement(x)>2  )

# Plot
p.norm <- ggplot( df.norm, aes( x = dist, y = angle ) ) +
  geom_point( color = "gray40", size = 0.5 ) +
  stat_smooth( span = 0.75, color = "black" )+
  labs( x = "distance between step pairs",
        y = "angle between step pairs",
        title = "original data") +
  geom_hline( yintercept = 90, color = "red" ) +
  theme_classic()

p.drift <- ggplot( df.drift, aes( x = dist, y = angle ) ) +
  geom_point( color = "gray40", size = 0.5 ) +
  stat_smooth( span = 0.75, color = "black" )+
  labs( x = "distance between step pairs",
        y = "angle between step pairs",
        title = "data with drift") +
  geom_hline( yintercept = 90, color = "red" ) +
  theme_classic()

gridExtra::grid.arrange( p.norm, p.drift, ncol = 2 )

Again we see a mean step pair angle that is below 90 degrees for the data with drift, which is not the case for the original data.

2.4 Correcting drift

If Hotelling's test and/or angle analyses provide evidence for drift in the dataset, this can be corrected by computing the mean step displacement:

# Get steps and find their displacement vectors
steps.drift <- subtracks( TCells.drift, 1 )
step.disp <- t( sapply( steps.drift, displacementVector ) )

# Get the mean
mean.displacement <- colMeans( step.disp )

# Divide this by the mean timestep to get a drift speed
drift.speed <- mean.displacement/timeStep( TCells.drift )
drift.speed

This is indeed roughly the drift speed we introduced in the data. Now remove this:

correct.drift <- function( x, drift.vector )
{
  # separate timepoints and coordinates
  tvec <- x[,1]
  coords <- x[,-1]

  # compute movement due to drift.
  drift.matrix <- matrix( rep( drift.vector, nrow(coords) ),
                          ncol = ncol(coords), byrow= TRUE )
  drift.matrix <- drift.matrix * tvec

  # Add drift to coordinates
  x[,-1] <- coords - drift.matrix
  return(x)
}

# Create data with drift
TCells.corrected <- as.tracks( lapply( TCells.drift, correct.drift, drift.speed ) )

# Compare
par( mfrow = c(1,2) )
plot( TCells, col = "gray", main = "uncorrected" )
plot( TCells.drift, col = "red", add = TRUE )
plot( TCells, col = "gray", main = "corrected" )
plot( TCells.corrected, col = "red", add = TRUE )

#
par( mfrow = c(1,2) )
plot( normalizeTracks(TCells), col = "gray", main = "uncorrected" )
plot( normalizeTracks(TCells.drift), col = "red", add = TRUE )
plot( normalizeTracks(TCells), col = "gray", main = "corrected" )
plot( normalizeTracks(TCells.corrected), col = "red", add = TRUE )

While the correction is not perfect, at least the tracks resemble their original shape more.

3 Detecting artifacts using angle analyses

3.1 Detecting double tracking: angle versus distance between cell pairs

To simulate a dataset with double tracking, take the first TCell track and keep it roughly the same, but with some small noise added to the coordinates:

# Take the track with id "2"
dup.track <- TCells[["2"]]

# Add some noise to coordinates
dup.track[,"x"] <- dup.track[,"x"] + rnorm( nrow(dup.track), sd = 0.5 )
dup.track[,"y"] <- dup.track[,"y"] + rnorm( nrow(dup.track), sd = 0.5 )
dup.track[,"z"] <- dup.track[,"z"] + rnorm( nrow(dup.track), sd = 0.5 )

# Wrap the track in a tracks object and add it to the TCell data with
# a unique id number
dup.track <- wrapTrack( dup.track )
names(dup.track) <- "22"
TCells.dup <- c( TCells, dup.track )

This can again be detected by plotting the angle between cell pairs versus the distance between cell pairs (see also section 2.3 above and Beltman et al (2009)):

df <- analyzeCellPairs( TCells.dup )

# label cellpairs that have both angle and distance below threshold
angle.thresh <- 90 # in degrees
dist.thresh <- 10 # this should be the expected cell radius
df$id <- paste0( df$cell1,"-",df$cell2 )
df$id[ !(df$angle < angle.thresh & df$dist < dist.thresh) ] <- "" 

# Plot
ggplot( df, aes( x = dist, y = angle ) ) +
  geom_point( color = "gray40" ) +
  geom_text( aes( label = id ), color = "red" ) +
  labs( x = "distance between cell pairs",
        y = "angle between cell pairs" ) +
  geom_hline( yintercept = angle.thresh, col = "blue",lty=2 ) +
  geom_vline( xintercept = dist.thresh, col = "blue", lty=2) +
  theme_classic()

We indeed find the pair of tracks with ids "2" (the original track) and "22" (the noisy duplicate of the original track). Plot these tracks specifically to check:

plot( TCells.dup[c("2","22")])

That indeed looks like double tracking.

3.2 Detecting tracking errors near border or imprecise z-calibration: distances and angles to border planes

A plot of angles and distances to the border planes of the imaging volume can help detect artifacts (Beltman et al (2009) ). We can compute these for each step using the functions distanceToPlane and angleToPlane.

To specify a border plane, we first need three points on that plane. We can either define the imaging volume manually or estimate it using boundingBox:

tracks <- TCells
bb <- boundingBox( tracks )
bb

Let's take the two borders in the z-dimension as an example. The lower z-plane contains points (minx,miny,minz), (maxx,miny,minz), (maxx,maxy,minz). The upper z-plane is at distance (maxz-minz) from the lower plane.

# Define points:
lower1 <- c( bb["min","x"], bb["min","y"], bb["min","z"] )
lower2 <- c( bb["max","x"], bb["min","y"], bb["min","z"] )
lower3 <- c( bb["max","x"], bb["max","y"], bb["min","z"] )
zsize <- bb["max","z"] - bb["min","z"]

# Compute angles and distances of steps to this plane.
single.steps <- subtracks( tracks, 1 )
angles <- sapply( single.steps, angleToPlane, p1 = lower1, 
                  p2 = lower2, p3 = lower3 )
distances <- sapply( single.steps, distanceToPlane, p1 = lower1,
                     p2 = lower2, p3 = lower3 )
df <- data.frame( angles = angles,
                  distances = distances )

# Plot
ggplot( df, aes( x = distances,
                 y = angles ) ) +
  geom_point( color = "gray40" ) +
  stat_smooth( method = "loess", span = 1, color = "black" ) +
  geom_hline( yintercept = 32.7, color = "red" ) +
  scale_x_continuous( limits=c(0,zsize ), expand = c(0,0) ) +
  theme_classic()

The mean angle to the border plane should be roughly 32.7 degrees, and indeed this seems to be the case. Tracking errors near the border plane would result in a lower average angle near the left and right side of the plots, whereas imprecise z-calibration would result in a systematic deviation from the 32.7 degree angle at any distance to the plane (Beltman et al (2009) ).

4 Detecting and correcting variation in time resolution

4.1 Detecting variation in timesteps

Track analysis methods usually assume a constant time interval $\Delta t$ between consecutive images in a time-lapse microscopy dataset. In reality, there are mostly at least small fluctuations in the $\Delta t$ between consecutive images. To find the $\Delta t$ of each step, we first extract single steps using the subtracks() function (see also the vignette on analysis methods):

# Extract all subtracks of length 1 (that is, all "steps")
single.steps <- subtracks( TCells, 1 )

# The output is a new tracks object with a unique track for each step
# in the data (no longer grouped by the original cell they came from):
str( single.steps, list.len = 3 )

After extracting the single steps, we can find the median $\Delta t$ using the timeStep() function:

median.dt <- timeStep( TCells )
median.dt

We then find the actual $\Delta t$ of each step by applying the duration() function to each step in the single.steps object:

step.dt <- sapply( single.steps, duration )
str(step.dt)

And visualize the difference with the median $\Delta t$ in a histogram (expressed as a percentage of $\Delta t$:

dt.diff.perc <- (step.dt - median.dt) * 100 / median.dt
hist( dt.diff.perc, xlab = "dt (percentage difference from median)" )

Thus, the TCell dataset indeed contains fluctuations in $\Delta t$, but they are small: none are more than 1 percent of the median $\Delta t$ in the data.

4.2 Example: detecting missing data in tracks

As an example of a dataset containing gaps in tracks, we will introduce an artificial error by removing some timepoints in a few tracks in the TCell data:

# This function randomly removes coordinates from a track dataset with probability "prob"
remove.points <- function( track, prob=0.1 ){

    tlength <- nrow( track )
    remove.rows <- sample( c(TRUE,FALSE), tlength, replace=TRUE,
                           prob = c(prob, (1-prob) ) )
    track <- track[!remove.rows,]
  return(track)
}

# Apply function to dataset to randomly remove coordinates in the data
TCells.gap <- as.tracks( lapply( TCells, remove.points ) )

Create the same histogram as before:

# median dt of the new data
median.dt.gap <- timeStep( TCells.gap )

# duration of the individual steps
steps.gap <- subtracks( TCells.gap, 1 )
step.dt.gap <- sapply( steps.gap, duration )

# express difference as percentage of median dt
dt.diff.perc.gap <- (step.dt.gap - median.dt.gap) * 100 / median.dt.gap
hist( dt.diff.perc.gap, xlab = "dt (percentage difference from median)" )

Now, the percentage difference with the median $\Delta t$ is often much larger.

This has an effect on for example the step-based displacement:

T1.step.disp <- sapply( single.steps, displacement )
T1.gap.disp <- sapply( steps.gap, displacement )

lapply( list( original = T1.step.disp, gaps = T1.gap.disp ), summary )

Note that the mean and median displacement are slightly higher due to the gaps, because steps with a gap in between are actually two steps and thus have a larger displacement.

For a simple step-based analysis of displacement, we can use normalizeToDuration() to correct for differences in time resolution:

T1.norm.disp <- sapply( single.steps, normalizeToDuration( displacement ) )
T1.norm.gap.disp <- sapply( steps.gap, normalizeToDuration( displacement ) )
lapply( list( original = T1.norm.disp, gaps = T1.norm.gap.disp ), summary )

The difference in displacements is now gone. For more complicated analyses, one might want to correct the gaps in the dataset via splitting tracks or via interpolation -- see the next section.

4.3 Correcting gaps or variation in timesteps

If a dataset contains gaps, as detected in the histogram of $\Delta t$, we can correct these using the repairGaps function, where we can select one of three correction methods using the how argument:

  1. Dropping all tracks with gaps (how = "drop")
  2. Splitting tracks around the gaps(how = "split")
  3. Interpolating the track (how = "interpolate")

For example:

# Repair gaps by splitting or interpolation, the number of tracks is 
# different after each fix
split.gap <- repairGaps( TCells.gap, how = "split" )
interpolate.gap <- repairGaps( TCells.gap, how = "interpolate" )

c( "after splitting" = length( split.gap),
   "after interpolation" = length( interpolate.gap ) )

4.4 Comparing experiments with a different time resolution

Finally, if we wish to compare tracks from experiments imaged at a different time resolution $\Delta t$, we can use the function interpolateTrack() to estimate the position of each cell at any set of timepoints of interest.

For example, let's create a second T cell dataset where we keep only every second timepoint (to effectively increase $\Delta t$ twofold while keeping the same dataset):

T2 <- subsample( TCells, k = 2 )

When we now perform a step-based analysis of displacement, the two will be different:

# displacement
T1.steps <- subtracks( TCells, 1 )
T1.disp <- sapply( T1.steps, displacement )

T2.steps <- subtracks( T2, 1 )
T2.disp <- sapply( T2.steps, displacement )

lapply( list( T1 = T1.disp, T2 = T2.disp ), summary )

To correct for the difference in time resolution, we can interpolate both datasets at a fixed time resolution:

# interpolate both datasets at the time resolution of the neutrophils
dt <- timeStep( TCells )
interpolate.dt <- function( x, dt, how = "spline" ){
  trange <- range( timePoints( wrapTrack( x ) ) )
  tvec <- seq( trange[1], trange[2], by = dt )
  x <- interpolateTrack( x, tvec, how = how )
  return(x)
}

T1.corrected <- as.tracks( lapply( TCells, interpolate.dt, dt = dt ) )
T2.corrected <- as.tracks( lapply( T2, interpolate.dt, dt = dt ) )

# Check the effect on the displacement statistics:
T1.corr.steps <- subtracks( T1.corrected, 1 )
T1.corr.disp <- sapply( T1.corr.steps, displacement )

T2.corr.steps <- subtracks( T2.corrected, 1 )
T2.corr.disp <- sapply( T2.corr.steps, displacement )

lapply( list( T1 = T1.disp, T2 = T2.disp, 
              T1.corr = T1.corr.disp, T2.corr = T2.corr.disp ), 
        summary )

The difference is now much smaller.

# Reset par() settings
par(oldpar)


Try the celltrackR package in your browser

Any scripts or data that you put into this service are public.

celltrackR documentation built on April 14, 2020, 6:42 p.m.