Motivation

Discretization of single cell data for network inference and queries prevents understanding of system dynamics.

Causal Bayesian networks can capture the structure of signaling dynamics. Ideally, we would like to query the network to gain deeper insight into the dynamics of the system. When we learn causal Bayesian networks from single cell data such as CYTOF, the common practice is to discretize the data. Putting aside the challenges of network inference and the question of quality of the inferred network, probability queries on such a network require working with multiple many-dimensional conditional probability tables, each with several parameters that give very little intuition to the underlying biological mechanism.

Proposed contribution

Build CPDs based on statistical models with parameters with dynamic system interpretation.

We can solve this by defining continuous CPDs that are structurally appropriate to the nature of spectra data, and with parameters that are interpretable in terms of the kinetic rates. The contribution would be to go beyond network inference to helping biologists construct and/or validate dynamic models of signaling from snap-shot time-point 'omic's-style data.

For example, biologists will be able to derive p-values and confidence intervals for rate-based parameters, and test the hypothesis that a signaling protein is indeed impacted by its parents through the assumed biochemical mechanism.

Directly modeling the CYTOF data without first applying discretization or some other information-reducing measure would compare to the application of Poisson or negative binomial models to RNAseq data or log-linear mixed models to proteomics data. Relative to those methods, a novel feature of this proposed work would be build such models as conditional probability distibutions in a Bayesian network, where predictors of a protein include the variables representing its parent proteins in the network, rather than just fixed experimental treatments. Further, unlike with statistical methods for RNAseq and proteomics that just model statistical association, the model parameters have a interpretation in terms of signaling dynamics.

Proposed Methodology

Use the generalized linear model or nonlinear least squares to fit child-parent tuples in the network.

The steps would be as follows, for every child-parents tuple in the network:

  1. Make an assumption on the biochemical mechanism linking the parents to the child, perhaps using prior systems biology knowledge.
  2. Assume a kinetic model representing the mechanism, for example first-order kinetics.
  3. Determine a statistical model with model parameters based on the rate parameters in the kinetic model.
  4. Fit the statistical model on data and conduct inference on the rate parameters.

Finally, we fit each tuple model from data, and do inference on the estimated parameters.

Step 3 involves the traditional statistical workflow of trying a statistical model, evaluating the goodness-of-fit, and adjusting the model or rejecting it and trying another. However, requiring that model parameters have a signaling dynamics interpretation imposes a constraint on the space of model applicable. The following introduces some of my progress in exploring this constrained space.

Background on the generalized linear model

The generalized linear model (GLM) generalizes ordinary linear regression to response variables that have various types of distributions (not just normal). It is comprised of two components, a linear predictor;

$\eta = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p$,

and a link function:

$g(\mu) = \eta$,

where $\mu$ is the mean response, and g is a monotone function over the domain of $\eta$.

The most well known examples include linear regression where $g$ is the identity function, and logistic regression where g is the logit function. Along with selecting a link function, you make an assumption about the distribution appropriate for modeling the response variable. Which link and distributional assumption are appropriate depends on the nature of the data. However, there are common pairings, such as the identity link with the normal distribution and the logit link with the binomial distribution.

My current work has been to explore kinetic models of phosphorylation that can be reformulated as $g(\mu) = \eta$. Here $\mu$ is the mean of a response variable derived from single cell measurements of a protein and assumed to quantify abundance of the phosphorylated form of the protein. $g$ is a common glm link function. \eta is a linear combination of the parent protein measurements and rate parameters. To go from the kinetic model to the $g(\mu) = \eta$ formulation, transformations need to be applied to the predictors (parent proteins measurements), response (child protein measurements) and the kinetic rate parameters.

Example: Linear regression on first-order kinetics model

Given the network edge X -> Y, under an assumption of first order kinetics, we can represent the relationship with the following ODE:

$\frac{dY_p}{dt} = \alpha X_p Y_u - \gamma Y_p$

$Y_p$ and $Y_u$ denote the abundance of phosphorylated/unphosphorylated Y respectively, and $X_p$ and $X_u$ have the same denotation for X. $\alpha X_p Y_u$ is the rate of phosphorylation of Y by X, and $\gamma Y_p$ is the rate of dephosphorylation of Y. $\alpha$ and $\gamma$ are rate parameters.

This has the following steady state solution:

$\frac{Y_p}{Y_u + Y_p} = f\left(\frac{\alpha}{\gamma} X_p \right)$ where $f = \frac{u}{1+u}$

Below I denote the total amount of phosphorylated and unphosphorylated Y as M, so the above becomes:

$\frac{Y_p}{M} = f\left(\frac{\alpha}{\gamma} X_p \right)$

I can create simulator for this model, using random draws from a gamma distribution to simulate values for $X_p$. After simulating $Y_p$ I add some noise (using R's jitter function) so things look more realistic.

library(dplyr)
library(magrittr)
f <- function(u) u / (1 + u)
#' a alpha
#' g gamma
#' M upper bound on Y_p, M = Y_p + Y_u
#' n number of desired observations
sim_data1 <- function(n, a, g, M){
  X_p <- rgamma(n, shape = 2, scale = 2)
  Y_p <- (M *  f(((a / g) * X_p))) %>%
            jitter(500) %>%
            {ifelse(. < 0, 0, .)} %>%
            {ifelse(. >= M, M, .)}
  data.frame(X_p, Y_p)  
}
sim_data1(10, .6, 2, 15)

In the following examples I simulate 100 observations (though a single cell experiment would provide much more). Here I set $\alpha$ to .6 and $\gamma$ to 2. I fix I also add some noise to the response $Y_p$ (using R's 'jitter' function). I always assume that M is known, here I set it to 15.

set.seed(4)
.data <- sim_data1(100, .6, 2, 15) %T>% plot

Modeling with linear regression

The first step in modeling with a GML is to reformulate $\frac{Y_p}{M} = \frac{\frac{\alpha}{\gamma} X_p}{1 + \frac{\alpha}{\gamma} X_p}$ into an equation with a common link function on the LHS, and a linear predictor on the RHS.

First, for purposes of identifiability, I need to reduce the parameter set. First let $\frac{ \alpha}{\gamma} = \beta$. Despite the reduction in parameters, $\beta$ has an interpretation in terms of signaling dynamics; the ratio of the phosphorylation rate parameter to the dephosphorylation rate parameter.

$\frac{Y_p}{M} = \frac{ \beta X_p }{1 + \beta X_p}$

$\frac{M}{Y_p} = \frac{1 + \beta X_p}{\beta X_p}$

$\frac{M}{Y_p} = \frac{1 + \beta X_p}{\beta X_p}$

$\frac{M}{Y_p} = \frac{1}{\beta X_p} + 1$

To get a linear function I do transformations.

$\dot{Y_p} = 1 + \tilde{\beta} \dot{X_p}$ where $\tilde{\beta} = \beta^{-1} = \frac{\gamma}{\alpha}$, $\dot{Y_p} = MY_p^{-1}$, and $\dot{X_p} = X_p^{-1}$

With this form I can try applying the following linear regression model:

$\dot{Y_p} = 1 + \tilde{\beta} \dot{X_p} + \epsilon$ where $\epsilon \sim N(0, \sigma^2)$

So the model will estimate the inverse of $\beta$, the ratio of the dephosphorylation rate parameter to the phosphorylation rate parameter. Since I set $\alpha$ to .6 and $\gamma$ to 2, the true value of $\tilde{\beta}$ is 3.3333 and $\beta$ is .2. Note the intercept is fixed at 1 with an offset.

.data_linear <- .data %>%
  mutate(Ydot = 15 * Y_p^-1, Xdot = X_p^-1) # Transform the variables for linear regression
fit <- lm(Ydot ~ offset(rep(1, nrow(.data))) + Xdot, .data_linear)

Visualizing the fit:

.plotdata <- .data[order(.data$X_p), ] 
.plotdata <- mutate(.plotdata, Xdot = X_p^-1)
with(.plotdata, plot(X_p, Y_p, main = "Model fit"))
with(.plotdata, lines(X_p, 15 * predict(fit, newdata = .plotdata)^-1, col = "darkgreen"))
legend("bottomright",legend=c("Linear Regression"), lwd=c(2), col=c("darkgreen"),lty=c(1), cex=c(.5))

The model fit provides robust inference:

summary(fit)

The biologist can use this model out to perform hypothesis tests on the parameters. Here the estimate of $\dot{\beta}$ is r coef(fit)["Xdot"], so for $\beta$ it is r coef(fit)["Xdot"]^-1. Since the estimate of $\dot{\beta}$ is a maximum likelihood estimator, the inference easily extends to $\beta = \frac{\alpha}{\gamma}$. For example, we can generate confidence intervals for $\beta$:

structure((confint(fit)[2, ]^-1)[c(2, 1)], names = c("0.5 % ", "99.5 %"))

But clearly this is not working. The above figure shows a poor fit. The interval does not include .3. Inference fails because the assumption of constant variance is violated, as seen in plotting the linear predictor against the transformed response:

with(.data_linear, plot(Xdot, Ydot))

If I chose to continue working with this model, I could apply techniques to account for increasing variance. But for the purposes of this analysis, I will demonstrate some alternative models.

Modeling with Gamma GLM

There are several reasons the gamma distribution GLM may be a good choice for modeling this kind of kinetic model. Firstly, $\frac{Y_p}{M} = \frac{\frac{\alpha}{\gamma} X_p}{1 + \frac{\alpha}{\gamma} X_p}$ is a specific case of the Michaelis-Menton function $\mu = \frac{k_0 x}{1 + k_1 x}$. This function is a good match for the gamma GLM because both it and the mean function of the gamma GLM have upper bounds. Secondly, the gamma GLM variance function can handle increasing variance in the response. Lastly, while linear regression involved an inverse transformation on the response, the gamma GLM is commonly matched with the inverse link $g(\mu) = \mu^{-1}$, so no transformation is neccessary.

.data_gamma <- mutate(.data, Ydot = Y_p / 15, Xdot = X_p^-1)
gamma_fit <- glm(Ydot ~  (offset(rep(1, nrow(.data))) + Xdot), family=Gamma(link="inverse"),
                 .data_gamma)

Visualizing the fit:

.plotdata <- .data[order(.data$X_p), ] 
.plotdata <- mutate(.plotdata, Xdot = X_p^-1)
with(.plotdata, plot(X_p, Y_p, main = "Model fit"))
with(.plotdata, lines(X_p, 15 * predict(fit, type = "response", newdata = .plotdata)^-1, col = "darkgreen"))
with(.plotdata, lines(X_p, 15 * predict(gamma_fit, type = "response", newdata = .plotdata), col ="darkred"))
legend("bottomright",legend=c("Linear Regression", "Gamma GLM"), 
       lwd=c(2, 2), col=c("darkgreen", "darkred"),lty=c(1,1), cex=.5)

Robust inference is provided:

summary(gamma_fit)

The estimate of $\frac{\alpha}{\beta} =$ r as.numeric(coef(gamma_fit)["Xdot"] ^ -1). If we accepted this model, we could use the F-test to test hypotheses on the value of $\beta = \frac{\alpha}{\gamma}$. We can also generate confidence intervals.

structure((confint(gamma_fit, level = .99)[2, ]^-1)[c(2, 1)], names = c("0.5 % ", "99.5 %"))

Modeling with logistic regression

The binomial distribution models number of successes in a set of trials. Suppose for a given protein, we consider each instance of the protein as a trial, and when the protein is phosphorylated, it is a success. $Y_p + Y_u = M$ is the total number of trials (which we assume we know).

We can apply logistic regression with a logit link $g(\mu) = log \left( \frac{\mu}{1-\mu} \right)$.

Converting the steady state model to this form:

$\frac{Y_p}{M} = \frac{\beta X_p}{1 + \beta X_p}$

$\frac{Y_p}{M} = \frac{exp(log(\beta X_p)}{1 + exp(log(\beta X_p))}$

$logit(\frac{Y_p}{M}) = log(\beta X_p)$

$logit(\frac{Y_p}{M}) = log(\beta) + log(X_p)$

$logit(\frac{Y_p}{M}) = \dot{\beta} + \dot{X_p}$ where $\dot{\beta} = log(\beta)$ and $\dot{X_p} = log(X_p)$. So I perform logistic regression of $\frac{Y_p}{M}$ against a log transformed $X_p$ with no coefficient parameter and estimate the intercept parameter $\dot{\beta}$, which is the log of $\beta$.

.data_logistic <- .data %>%
  mutate(Xdot = log(X_p)) 
logistic_fit <- suppressWarnings(glm(cbind(Y_p, 15 - Y_p) ~ offset(Xdot), 
                                     family = binomial(link="logit"), .data_logistic))

Visualizing the fit:

.plotdata <- .data[order(.data$X_p), ] 
.plotdata <- mutate(.plotdata, Xdot = X_p^-1)
with(.plotdata, plot(X_p, Y_p, main = "Model fit"))
with(.plotdata, lines(X_p, 15 * predict(fit, type = "response", newdata = .plotdata)^-1, 
                      col = "darkgreen"))
with(.plotdata, lines(X_p, 15 * predict(gamma_fit, type = "response", newdata = .plotdata),
                      col ="darkred"))
.plotdata <- mutate(.plotdata, Xdot = log(X_p))
with(.plotdata, lines(X_p, 15 * predict(logistic_fit, type = "response", newdata = .plotdata), 
                      col ="darkblue"))
legend("bottomright",legend=c("Linear Regression", "Gamma GLM", "Logistic Regression"), 
       lwd=c(2, 2, 2), col=c("darkgreen", "darkred", "darkblue"), lty=c(1,1, 1), cex=.5)

Again robust inference is provided on an MLE estimate.

summary(logistic_fit)

Calculating the estimate of the original $\beta$

beta_tilde <- as.numeric(coefficients(logistic_fit))
beta <- exp(beta_tilde)
beta

Generating a confidence interval:

exp(suppressWarnings(confint(logistic_fit, level = .99)))

Quasi-likelihood models

Discretization abstracted away variation in CYTOF data. It remains an open question of how to model the variance in raw CYTOF data. Quasi-likelihood models, a variance-flexible extention of GLMs, may prove useful here. Supposing CYTOF data were to have more variability than the above models would expect, we can explicity model the variance with quasi-likelihood models, and have equivalent or comparable computational efficiency.

Modeling CPD for proteins with multiple network parents

Determining the statistical model for proteins with multiple parents depends heavily on the mechanism. Suppose the network has a V-structure X1 -> Y <- X2. When the parents' effect on the child is based on an AND relationship, modeling is straightforward. With first order dynamics, an AND model can be represented as:

$\frac{dY_p}{dt} = \alpha X1_p X2_p Y_u - \gamma Y_p$.

The steady state has the same Michaelis-Menton functional form as the models above:

$\frac{Y_p}{M} = \frac{\frac{\alpha}{\gamma} X1_p X2_p}{1 + \frac{\alpha}{\gamma} X1_p X2_p}$

The multiplicative relationship between $X1_p$ and $X2_p$ in the AND mechanism means we can apply those models. However, when the relationship is an OR relationship as with:

$\frac{Y_p}{M} = f\left(\frac{\alpha_1}{\gamma} X1_p + \frac{\alpha_1}{\gamma} X2_p\right)$,

the relationship between $X1_p$ and $X2_p$ is additive, and therefore log and the inverse transforms will not provide a linear predictor.

So in the OR case, as with other kinetic models that will not work with GLM, nonlinear least squares (NLS) is the logical choice. NLS would use least squares optimization to fit the model

$\frac{Y_p}{M} = f\left(\frac{\alpha_1}{\gamma} X1_p + \frac{\alpha_1}{\gamma} X2_p\right) +\epsilon$

where $\epsilon$ is an error term. For statistical inference, we'd assume the error term has a normal distribution.

Let $\beta_1 = \alpha_1/\gamma, \beta_2 = \alpha_2/\gamma$

Creating a data simulator:

#' a1 alpha_1
#' a2 alpha_2
#' g gamma
#' M upper bound on Y_p = Y_p + Y_u
#' n number of desired observations
sim_data2 <- function(n, a1, a2, g, M){
  X1_p <- rgamma(n, shape = 2, scale = 2)
  X2_p <- rgamma(n, shape = 2, scale = 2)
  Y_p <- M *  f(((a1 / g) * X1_p + (a2 / g) * X2_p)) %>%
            jitter(500) %>%
            {ifelse(. < 0, 0, .)} %>%
            {ifelse(. >= M, M, .)}
  data.frame(X1_p, X2_p, Y_p)
}
sim_data2(10, .06, .08, 2, 40)

So if a1 = .06, a2 = .08, and g = 2, then b1 = .03, b2 = .04. Let M = 40.

.data <- sim_data2(200, .06, .08, 2, 40) 
start_list <- list(b1 = .08, b2 = .1) # NLS requires specification of starting values
fit <- nls(Y_p ~ 40 * f(b1 * X1_p + b2 * X2_p), .data, start_list, 
    algorithm = "port", lower = c(0, 0), upper = c(1, 1))
summary(fit)

Confidence intervals:

confint(fit, level = .99)

However this interval wouldn't be conservative enough, because plots reveal the residual distribution has fatter tails than would be expected with a normal distribution.

qqnorm(residuals(fit))

We could remedy this with iterated least squares fitting.

Summary

The goal is to develop a statistical models for conditional probability distributions in a causal network with biologically interpretable parameters that can be inferred from continuous single cell quantifications of signaling protein activity such as with CYTOF. My purpose here has to describe what the contribution would be and demonstrate what the methodology might look like.

Selection of the statistical model depends entirely on the preceeding step of establishing some understanding the biological mechanism connecting child nodes and parent nodes in the network, and selecting an appropriate kinetic model of that connection. The methods explored above all rely on the assumption of measuring signaling at steady state and a kinetic model with a steady state solution. This requires us to discuss when this assumption would hold up and when it would fail.

Finally, with simulated data, interpretation of data in terms of abundance and concentration of signaling proteins is straightforward. However, quantifications of spectra from real data involves preserving assumptions about the relationship between spectra and abundance while applying transformations to the data that make the variance easier to model. This was not an issue when data was merely discretized, and indeed it is the primary motivation for discretization. Solving this would be a major contribution in itself.



robertness/pathflow documentation built on May 27, 2019, 10:33 a.m.