Inference in the DeLorean model
In DeLorean: Estimates Pseudotimes for Single Cell Expression Data

library(knitr)
library(knitcitations)
library(rmarkdown)
#
# knitr options
#
opts_chunk$set(
    fig.path = 'figures/inference-',
    stop_on_error = TRUE,
    fig.width = 12.5,
    fig.height = 8)
#
# Citations
#
cleanbib()
cite_options(
    # hyperlink = 'to.doc',
    hyperlink = TRUE,
    # style = 'html',
    # citation_format = 'text',
    citation_format = "pandoc",
    cite.style = "numeric",
    check.entries = TRUE)
    # hyperlink = TRUE)
bib <- read.bibtex("DeLorean.bib")

devtools::load_all('../..')
rmarkdown::render('DeLorean-inference.Rmd')

library(DeLorean)
library(ggplot2)
#
# Stylesheet
#
options(markdown.HTML.stylesheet = system.file(file.path('Rmd', 'foghorn.css'),
                                               package="DeLorean"))

extrafont::loadfonts(quiet=TRUE)

mrc.colors <- c(
    rgb(138, 121, 103, maxColorValue=255),
    rgb(217, 165, 41 , maxColorValue=255),
    rgb(153, 152, 40 , maxColorValue=255),
    rgb(117, 139, 121, maxColorValue=255),
    rgb(33 , 103, 126, maxColorValue=255),
    rgb(208, 114, 50 , maxColorValue=255),
    rgb(106, 59 , 119, maxColorValue=255),
    rgb(130, 47 , 90 , maxColorValue=255)
)

font.family <- "Verdana"
font.theme <- theme_update(text=element_text(family=font.family))
theme_set(font.theme)

The DeLorean package uses the Stan modelling language and inference engine. The Stan inference engine implements a Hamiltonian Monte Carlo algorithm which samples from the full posterior of the model. However, the posterior of the DeLorean model is multimodal and multimodal posteriors can be difficult to sample from. The greatest source of multimodality are the pseudotime parameters. The ordering of the cells in pseudotime space has many local optima. Moving between these optima can be difficult for the sampling algorithm as cells need to reverse their pseudotemporal order. This is difficult if the expression levels in the two cells are not close.

To mitigate the difficulties associated with multimodality, DeLorean employs a strategy that searches for good initial values for its MCMC chains. DeLorean fixes the initial values of all parameters except for the pseudotime parameters, $\tau_c$, to reasonable initial values as estimated by the empirical Bayes hyperparameters. DeLorean samples sets of the $\tau_c$ from their prior and calculates the log probability of the model using each set. The sets with the highest log probabilities are used to initialise distinct MCMC chains.

Stan: proven Hamiltonian MCMC algorithm; efficient; parallel. Full posterior good: capture uncertainty

Introduction

DeLorean uses a probabilistic model to estimate pseudotimes in cross-sectional time-series. The basic idea is that a dynamic regulatory system can be characterised by the expression profiles of its genes. That is, as the system moves between states the genes exhibit characteristic behaviours consistent with the regulatory network that they encode. We are interested in inferring these networks from expression data. Typically the gene expression data we capture about the system state is cross-sectional in nature. This is because the cell (or population of cells) are destroyed as part of the assay. We would prefer to have longitudinal data whereby a cell is tracked through time and the expression measurements are made on the same biological object at distinct time points. However with current technologies this is difficult to achieve.

Biological systems are typically noisy and stochastic in nature. In many systems we have reason to believe each cell may progress at its own rate. This is a particular problem for cross-sectional data as the expression measurements made at a particular time point are no longer directly comparable. The DeLorean model estimates pseudotimes that are designed to mitigate this effect. The pseudotime for a cell represents how far through the system the cell has progressed. The difference between the pseudotime and the observed time represents how quickly or slowly the cell has progressed relative to the other cells. Once a pseudotime has been estimated for each cell it is easy to infer expression profiles for all the assayed genes. However, the estimation of pseudotimes is an underdetermined problem and many plausible estimates are possible. The DeLorean model aims to resolve this by balancing the smoothness of the expression profiles against the noise levels in the measurements. The model expects gene expression profiles to be smooth over time. That is we assume genes do not frequently change in their expression levels. This assumption is crucial to resolve different interpretations of the expression data. On the one hand, any given expression data could be explained by very smooth expression levels with high levels of noise. Here the noise would capture almost all the variation in the signal. However, on the other hand, extremely dynamic expression profiles can explain the data with very low noise levels.

As an example, suppose we have data for a few cells taken at time points 0h, 20h, 40h, 60h and 80h. When the expression of a gene is plotted against these time points, the expression profile can look quite noisy.

# (ggplot(.data.m, aes(x=time, y=mu+expr, color=obstime))
    # + geom_point(alpha=.6, size=5)
    # + xlab("Time")
    # + ylab("Expression")
    # + scale.obs.time
    # + scale.x
    # + font.theme
    # + facet_grid(variable ~ .)
    # + guides(color=FALSE)
# )
(ggplot(.data, aes(x=to.hours(as.integer(obstime)), y=mu+expr, color=obstime))
    + geom_point(alpha=.6, size=5)
    + xlab("Cell capture time")
    + ylab("Expression")
    + scale.obs.time
    + scale.x
    + font.theme
    + guides(color=FALSE)
)

However if the expression data is plotted against estimated pseudotimes it is possible to dramatically reduce the noise.

(ggplot(.data, aes(x=to.hours(tau), y=mu+expr, color=obstime))
    + geom_point(alpha=.6, size=5)
    + xlab("Pseudotime")
    + ylab("Expression")
    + scale.obs.time
    + scale.x
    + font.theme
    + guides(color=FALSE)
)

In the case of one gene it is trivial to find pseudotimes that reduce the noise. However when many genes are considered simultaneously the problem becomes far more interesting. In this case, it is difficult to find good pseudotimes that present us with smooth low-noise expression profiles across many genes and cells.

If the model does not enforce smoothness on the expression profiles, the data can be explained with low levels of noise.

print(plot.predictions(make.predictions(sigma.gp=2 , sigma.noise=.3, length.scale=.1)))

If too much smoothness is enforced, the model requires high noise levels to explain the expression profiles.

print(plot.predictions(make.predictions(sigma.gp=.3, sigma.noise=2 , length.scale=1)))

The model tries to balance the smoothness against the noise to achieve expression profiles that are reasonably smooth but have low noise levels.

print(plot.predictions(make.predictions(sigma.gp=2 , sigma.noise=.3, length.scale=1)))

Data

The DeLorean model is fit to a matrix of expression data $x_{g,c}$ for $G$ genes (rows) in $C$ cells (columns). Each cell $c$ has been captured at a time point $k_c \in {\kappa_1,\dots,\kappa_T}$. The expression measurements are modelled using Gaussian processes (r citet(bib[["rasmussen_gaussian_2006"]])). Expression values often have a roughly normal distribution on a logarithmic scale and because of this it is normally suitable to log-transform the absolute expression values before fitting the DeLorean model.

Model

The model can be split into several parts: one part represents the gene expression profiles; another part represents the pseudotimes associated with each cell; and another part links the expression data to the profiles.

Gene expression profiles

The expression profiles are modelled using Gaussian processes. The expression profile of each gene $g$ is a draw from a Gaussian process $$ x_{g}() \sim \mathcal{GP}(\phi_g(), \Sigma_g(,)) $$ where $\phi_g$ is a (constant) gene-specific mean function, and $\Sigma_g$ is a gene-specific covariance function. $$ \phi_g \sim \mathcal{N}(\mu_\phi, \sigma_\phi) \ \Sigma_g(\tau_1, \tau_2) = \psi_g \Sigma_\tau(\tau_1, \tau_2) + \omega_g \delta_{\tau_1,\tau_2} $$ Here $\Sigma_\tau$ is a covariance function that defines the covariance structure over the pseudotimes, that is it imposes the smoothness constraints that are shared across genes; $\psi_g$ parameterises the amount of temporal variation this gene profile has; and $\omega_g$ models the noise levels for this gene. $$ \log \psi_g \sim \mathcal{N}(\mu_\psi, \sigma_\psi) \ \log \omega_g \sim \mathcal{N}(\mu_\omega, \sigma_\omega) \ $$

Pseudotimes

The pseudotime for a cell, $\tau_c$, is given a prior centred on the time the cell was captured. $$ \tau_c \sim \mathcal{N}(k_c, \sigma_\tau) $$ Each $\tau_c$ is used in the calculation of the covariance structure over pseudotimes $\Sigma_\tau$. $\Sigma_\tau$ is taken to be a Matern${3/2}$ covariance function. Our experience shows that this function captures our smoothness constraints well although any reasonable covariance function could be used. $$ \Sigma\tau(\tau_1, \tau_2) = \textrm{Matern}_{3/2}(r=\frac{|\tau_1 - \tau_2|}{l}) = (1 + \sqrt{3}r) \exp[-\sqrt{3}r] $$ where $l$ is a length-scale hyperparameter.

Expression data

The model links the expression data to the expression profiles by evaluating the profiles at the pseudotimes and adjusting for a cell size factor, $S_c$. $$ x_{g,c} = x_g(\tau_c) + S_c $$ The cell size factors represent technical and biological effects such as sequencing depth and lysis efficiency and account for data in which average expression varies by cell. In our experience this is a common effect in single cell data and should be accounted for. We place a prior on the cell sizes that are estimated by the model. $$ S_c \sim \mathcal{N}(\mu_S, \sigma_S) $$

Hyperparameters

All of the hyperparameters $\mu_\phi, \sigma_\phi, \mu_\psi, \sigma_\psi, \mu_\omega, \sigma_\omega, \mu_S, \sigma_S$ are estimated by an empirical Bayes procedure (see a separate vignette). The hyperparameters $l, \sigma_\tau$ are supplied directly by the user of the DeLorean package.