In RGLab/MIMOSA2: MIMOSA with baseline

Introduction

Intracellular Cytokine Staining (ICS) assays are single-cell, cytometric assays that measure individual-level antigen-specific T-cell responses to stimulation. The assays are frequently used in vaccine trials to assess whether inidviduals have responded to the vaccine, i.e. whether their immune system has developed cellular memory to the vaccine insert.

In its simplest form, an ICS assay provides two measurements: a measurement from a control sample (non-stimulated) and a measurement from a stimulated sample (stimulated with antigen from the vaccine). The readout from each is the number of cytokine-producing cells and the total number of T-cells assayed.

To determine if an individual exhibits a response, the proportion of cytokine producing cells is compared between stimulation and control. If the proportion of cells in the stimulated sample is greater than in the non-stimulated sample, then the subject is deemed to have an antigen-specific response to the stimulation.

The MIMOSA model (Finak et al. Biostatistics, 2013) was developed to analyze such data. It implements a Bayesian hierarchical model that models the cell counts rather than the proportion, and shares infomration across subjects to regularize small proportions and increase sensitivity and specificity.

MIMOSA2 extends the original model by allowing for time-based measurments in the form of baseline and post-vaccine measurments. The ICS assay is performed at two timepoints: the post-vaccine timepoint, but also at baseline (pre-vaccination). We are interested in identifying individuals that exhibit antigen-specific but also vaccine-specific responses. For example, an individual may respond at both timepoints with equal magnitude, in which case they would be considered to exhibit a non-specific response, or they may respond at the post-vaccine timepoint only, or more strongly at the post-vaccine than pre-vaccine timepoint, both of which would be indicative of antigen-specific and vaccine-specific responses. This package implements a five-component mixture model that allows for variations of the above possibilities in order to identify individuals with vaccine-specific responses. The model is described in more detail below.

MIMOSA extended to model baseline.

We extend the MIMOSA model to allow fitting of two timepoints (not just post-vaccine).

For each subject, $i$, we observe a vector of counts $y_i : (n^{(j,t)}{i})$, where $t \in {0,1}$ indexes the timepoint and $j \in {u,s}$ indexes the stimulation. The counts represnt the number of positive cells for a cytokine or cytokine combination under study. We also observe the corresponding totals: $(N^{j,t}{i})$

We model the counts as binomially distributed with unknown subject-specific proportions: $n^{(s,1)}{i} \sim Binomial(N^{(s,1)}{i}, p^{(s,1)i})$, $n^{(s,0)}{i} \sim Binomial(N^{(s,0)}{i}, p^{(s,0)_i})$, and $n^{(u,1)}{i} \sim Binomial(N^{(u,1)}{i}, p^{(u,1)_i})$, $n^{(u,0)}{i} \sim Binomial(N^{(u,0)}_{i}, p^{(u,0)_i})$.

The unstimulated background and the stimulated samples are allowed to be different between timepoints.

As with MIMOSA, a given subject $i$ is a responder at a given timepoint $t$ if: $p^{(s,t)} > p^{(u,t)}$ and the subject is a non-responder if $p^{(s,t)} = p^{(u,t)}$. By the nature of the data, the proportions from stimulated samples will only be greater than or equal to the proportions from the non-stimulated samples.

As an additional complication, we are interested in vaccine-specific effects, where the post-vaccine stimulation at $t = 1$ is greater than the pre-vaccine stimulation at $t = 0$, such that $p^{(s,1)}_i - p^{(u,1)}_i > p^{(s,0)}_i - p^{(u,0)}_i$

We define a latent indicator $z = {1..k}$. If $z = 1$ the subject exhibits a vaccine-specific response and if $z > 1$, the subjects exhibits a non-specific response, or a non-response.

We set $k=11$, with the following model components.
$\mathcal{M}1 &: p_i^{(s,1)} \ne p_i^{(u,1)} \ne p_i^{(s,0)} \ne p_i^{(u,0)}$
$\mathcal{M}_2 &: p_i^{(s,1)} \ne p_i^{(u,1)} \ne p_i^{(s,0)} = p_i^{(u,0)}$
$\mathcal{M}_3 &: p_i^{(s,1)} = p_i^{(s,0)} \ne p_i^{(u,1)} \ne p_i^{(u,0)}$
$\mathcal{M}_4 &: p_i^{(s,1)} \ne p_i^{(s,0)} \ne p_i^{(u,1)} = p_i^{(u,0)}$
$\mathcal{M}_5 &: p_i^{(s,1)} = p_i^{(u,1)} \ne p_i^{(s,0)} = p_i^{(u,0)}$
$\mathcal{M}_6 &: p_i^{(s,1)} = p_i^{(u,1)} \ne p_i^{(s,0)} \ne p_i^{(u,0)}$
$\mathcal{M}_7 &: p_i^{(s,1)} = p_i^{(s,0)} = p_i^{(u,1)} = p_i^{(u,0)}$
$\mathcal{M}_8 &: p_i^{(s,1)} = p_i^{(s,0)} \ne p_i^{(u,1)} = p_i^{(u,0)}, p_u \gt p_s$
$\mathcal{M}_9 &: p_i^{(s,1)} = p_i^{(s,1)} = p_i^{(u,1)}, p_i^{(u,0)}> \text{rest}$
$\mathcal{M}_10 &: p_i^{(s,1)} = p_i^{(s,0)} = p_i^{(u,0)} , p_i^{(u,1)}> \text{rest}$ $\mathcal{M}_11 &: p_i^{(s,1)} = p_i^{(s,0)}, p_i^{(u,1)} = p_i^{(u,0)}, p_s \gt p{u}$

Under models 1-4, we have responses, with the principal constraint: $p_i^{(s,1)} - p_i^{(u,1)} - p_i^{(s,0)} + p_i^{(u,0)}$ > 0

For model 1, we have the additional constraint that: $p_i^{(s,0) > p_i^{(u,0)}$
For model 2, we have:
$p_i^{(s,1) > p_i^{(u,1)}$
For model 3:
$p_i^{(u,0) > p_i^{(u,1)}$
For model 4:
$p_i^{(s,1) > p_i^{(s,0)}$

Only observations that match these constraints (empirically) will have the possibility to be assigned to these components.

The Beta distributions are parameterized as $\mu = \alpha+\beta, \phi = \alpha + \beta$. Marginalizing out the $p$s, the posterior distributions of the counts are Beta-binomial with parameters $n + \alpha$ and $N - n + \beta$.

The latent $z_i$ are modeled as independent draws from a categorical distribution with $k=5$, and parameter vector $w$.

By treating the $z_i$ as missing data, we can fit the model in an Empirical-Bayes fashion using Expectation-Maximization and estimate the parameter vector $\theta = (\mu^{(s,1)},\mu^{(s,0)},\mu^{(u,1)},\mu^{(u,0)},\phi^{(j,t)}, w)$.

The complete data log-likelihood is given by: $l(\theta|z,y) = \sum_{i,k} z_{i,k} l_k(\theta|y_i) + z_{i,k}\log(w_k)$

The $w_k$ are the mixing proportions, with the constraint that $\sum_k w_k = 1$.

Given an estimate of the parameters $\tilde{\theta} = {\tilde{\mu}^{(s,1)},\tilde{\mu}^{(s,0)},\tilde{\mu}^{(u,t)},\tilde{\phi}^{(j,t)}, \tilde{w}}$ and the data, we compute the posterior probabilities of response in the E step as $Pr(z_i = k | y, \tilde{\theta}) = \frac{\tilde{w} L_k(\tilde{\theta} | y_i)}{\sum_k \tilde{w} L_k(\tilde{\theta} | y_i)}$ In the M-step we estimate the parameters given the current assignments of the observations to model components. Because components 1 through 4 represent responders and components 5 through 8 represent non-responders, the $i$th subject is a responder ($r=1$) with probability $Pr(r = 1 | z_i) = \sum_{k \in {1..4}} z_{i,k}$, where $k$ indexes model components, and $Pr(r = 0 | z_i) = 1-Pr(r = 1 | z_i)$ is the probability that subject $i$ is a non-responder. An observation is given a hard assignment to a responder model component if $Pr(r = 1 | z_i) > Pr(r = 0 | z_i)$ and a non-responder model component otherwise. The specific model component is set to $argmax_k(z_{i,k})$, conditioning on $r$. The component weights are estimated as $\tilde{w} = \sum_i \tilde{z_i}/I$, and the remaining model parameters are estimated using numerical optimization with R's optim routine. To avoid convergence issues, the $\mu$ are fit on the $logit$ scale, while $\phi$ is fit on the $log$ scale.

Usage

library(MIMOSA2)
library(data.table)
library(ggplot2)

We simulate from the model.

set.seed(100)
s = simulate_MIMOSA2(effect=2e-4, baseline_stim_effect=5e-5,baseline_background = 2e-5,bg_effect =2e-5,phi=c(50000,50000,50000,50000),P=100,rng=c(120000,150000))

s is a list with components names(s). Ntot is a matrix with column names "ns1" "nu1" "ns0" and "nu0". These correspond to the total T-cell counts for stimulated and unstimulated conditions at time 0 and time 1 (pre-vaccine and post-vaccine). There is also an entry "truth" which holds the ground truth model component from which each subject was simulated.

tbl = table(data.frame(s$truth))
tbl = data.frame(tbl)
setnames(tbl, c("Var","Freq"))
knitr::kable(data.frame(tbl))

"R" corresponds to responders. "NR" corresponds to non-responders, and "NV1", and "NV2" correspond to non-vaccine specific responses.

The data is input into the MIMOSA2 modeling function.

# fit = with(s, MIMOSA2(Ntot=Ntot,ns1=nu1,nu1=ns1,ns0=nu0,nu0=ns0,tol=1e-5))
fit = with(s, MIMOSA2(Ntot=Ntot,ns1=ns1,nu1=nu1,ns0=ns0,nu0=nu0,tol=1e-5))

The output of MIMOSA2 is a list with components names(fit). The "z" is matrix with the probability that each observation arises from each component. "inds" holds the hard assignment of each observation to each model component. The hard assignments are done hierarchically, so if the probability that an observation arises from the two responder components is greater than the probability it arises from the non-responder or non-specific response components, then it is assigned to the response component with the greatest probability.

The confusion matrix from the model fit is shown below.

table(fit=c("R1","R2","R3","R4","NR1","NR2","NR3","NSR","NSR2","NSR3","NSR4")[max.col(fit$z)],truth = s$truth)

We plot the results.

Boxplot(obj=fit,truth=s$truth)
p =with(s,{p = do.call(cbind,mget(colnames(Ntot)))/Ntot;colnames(p)=c("pu1","ps1","pu0","ps0");data.frame(p,truth)})

ggplot(melt(p,value.var=c("pu1","ps1","pu0","ps0")))+geom_boxplot(outlier.color=NA)+facet_wrap(~variable)+geom_jitter()+aes(x=truth,y=value)

ggplot(p)+geom_boxplot(outlier.color=NA)+geom_jitter()+aes(x=truth,y=ps1-pu1-ps0+pu0)

We have r sum(table(fit=as.numeric(colnames(fit$inds)[max.col(fit$inds)]),truth = s$truth)[1:4,1:4]) false positives, and r sum(table(fit=as.numeric(colnames(fit$inds)[max.col(fit$inds)]),truth = s$truth)[-c(1:4),c("R1","R2","R3","R4")]) false negatives.

We can generate ROCs comparing against a standard test of odds ratios.

# ortest = with(s,ORTest(Ntot,nu1,ns1,nu0,ns0))
ortest = with(s,ORTest(Ntot,ns1,nu1,ns0,nu0))

toplot = ROC(or_test=ortest, fit=fit,truth=s$truth%in%c("R1","R2","R3","R4"))
ROCPlot(toplot)+guides(color=guide_legend(nrow=2))

fdr = function (z) 
{
    fdr <- rep(0, nrow(z))
    o <- order(z[, 2], decreasing = TRUE)
    fdr[o] <- (cumsum(z[o, 1])/1:nrow(z))
    return(fdr)
}
RRvTh = function(thresh,truth,add=FALSE,lty=3,...){
  th = thresh
  o = order(th,decreasing=FALSE)
  tpc = cumsum(truth[o])
  fpc = cumsum(!truth[o])
  tpc = tpc/sum(truth)
  fpc = fpc/sum(!truth)
  if(add){
      lines(y=tpc,th[o],type="l",ylab="Rate",xlab="FDR Threshold",col="blue",lty=ifelse(add,lty,1))
  }else{
      plot(y=tpc,th[o],type="l",ylab="Rate",xlab="FDR Threshold",col="blue",...)
  }
  lines(y=fpc,th[o],type="l",col="red",lty=ifelse(add,lty,1))
}
RRvTh(fdr(cbind(1-rowSums(fit$z[,1:4]),rowSums(fit$z[,1:4]))),s$truth%in%c("R1","R2","R3","R4"),xlim=c(0,0.1),ylim=c(0,1))
RRvTh(ortest,s$truth%in%c("R1","R2","R3","R4"),add=TRUE)
legend("topright",lty=c(1,2),c("MIMOSA","OR Test"))

The ROC curves show that MIMOSA2 performs better than a one-sided test for a difference in log-odds ratios.

Control of the false discovery rate.

fdr_th = fdr(cbind(1-rowSums(fit$z[,1:4]),rowSums(fit$z[,1:4])))
o = order(fdr_th,decreasing=FALSE)
plot(y=fdr_th[o],x=cumsum(!s$truth[o]%in%c("R1","R2","R3","R4"))/1:length(s$truth),type="l",ylab="observed fdr",xlab="nominal fdr",xlim=c(0,1),ylim=c(0,1),col="blue")
abline(0,1,lty=2)
o = order(ortest,decreasing=FALSE)
lines(y=ortest[o],x=cumsum(!s$truth[o]%in%c("R1","R2","R3","R4"))/1:100,type="l",ylab="observed fdr",xlab="nominal fdr",col="red")
abline(0,1,lty=2)
legend("bottomrigh",c("MIMOSA2","ORTest"),col=c("blue","red"),lty=1)