In kevinmcgregor/micore: Microbiome Covariance Regression

knitr::opts_chunk$set(echo = TRUE)

Microbiome Covariance Regression (MiCoRe) allows the estimation of how OTU co-occurrence networks vary with respect to a covariate profile using principles of covariance regression. This work was developed in the Greenwood Lab at McGill University.

Installation

MiCoRe can be installed easily from Github. Note that the name of the R package is all lowercase: \texttt{micore}.

if (!require(devtools)) {
  install.packages("devtools")
  library(devtools)
}
install_github("kevinmcgregor/micore", dependencies=TRUE)

The model

The goal of MiCoRe is to estimate how covariance matrices vary with repsect to a covariate profile in the context of microbiome data. Assume that the matrix $\Yb_{n\times (p+1)}$ contains the counts of $p+1$ taxa over $n$ samples. Taxon $p+1$ will be used as a reference taxon, and will not be included in the estimated covariance matrices. The matrix $\Xb_{n\times q}$ contains the $q-1$ covariates over which the covariance (or precision) matrix is assumed to vary, along with an intercept column. The vector $\xb_i=(1,x_{i1},\dots,x_{i(q-1)})^\top$ contains the covariates for individual $i$. We assume a multinomial logistic regression framework for the taxon counts. We denote the total count for individual $i$ as $M_i=\sum_{j=1}^{p+1} \Yb_{ij}$. We also assume that the true proportions of all the taxa in individual $i$'s microbiome is $\pib_i=(\pi_{i1},\dots,\pi_{i(p+1)})$, with $0<\pi_{ij}<1$ for all $j\in{1,\dots,p+1}$ and ${\sum_{j=1}^{p+1} \pi_{ij}=1}$. Then we assume the observed counts for individual $i$, denoted by $\Yb_{i\cdot}$, follow a multinomial distribution. The full model is written as. \begin{align} \label{eqn:model} \Yb_{i\cdot} | \etab_{i\cdot}, \Ab, \Bb, \gamma_i, \Psib, \Gammab &\sim \mbox{Multinomial}(M_i, \pib_i) \nonumber \ \etab_{i\cdot} | \Ab, \Bb, \gamma_i, \Psib, \Gammab &\sim \mbox{Normal}\left([\Ab+\gamma_i \Bb]\xb_i, \Psib\right) \nonumber \ \Cb=(\Ab,\Bb) | \Psib, \Gammab &\sim \mbox{Matrix-Normal}\left(\Cb_0, \Psib, \Gammab \right) \nonumber \ \Psib &\sim \mbox{inv-Wishart}\left(\nu_\Psi,\Psib_0\right) \nonumber \ \Gammab &\sim \mbox{inv-Wishart}\left(\nu_\Gamma,\Gammab_0\right) \nonumber \ \gamma_i &\sim \mbox{Normal}(0,1). \end{align} where the proportions $\pib_i$ are parameterized using a matrix of latent parameters, $\etab_{n\times p}$, whose elements are denoted by $\eta_{ij}$: \begin{align} \pib_i &= \left( \frac{\exp(\eta_{i1})}{1+\sum_{j=1}^p \exp(\eta_{ij})}, \dots, \frac{\exp(\eta_{ip})} {1+\sum_{j=1}^p \exp(\eta_{ij})}, \frac{1}{1+\sum_{j=1}^p \exp(\eta_{ij})} \right). \nonumber \end{align} \noindent The elements of $\etab$ can be thought of as the additive log-ratio transformed proportions with respect to the reference taxon $p+1$: \begin{align} \etab_{i\cdot} &= \left[ \log\left(\frac{\pi_{i1}}{\pi_{i(p+1)}}\right), \dots, \log\left(\frac{\pi_{ip}}{\pi_{i(p+1)}}\right) \right],\nonumber \end{align} where $\etab_{i\cdot}$ represents row $i$ of $\etab$.

Interpretations of parameters

Parameter interpretations come from marginalizing out the individual-specific term $\gamma_i$. The expected value for $\etab_{i\cdot}$ (i.e. the additive log-ratio transformed proportions for individual $i$) is written as: \begin{align} \E(\etab_{i\cdot} | \Ab, \Bb, \Psib, \Gammab) &= \Ab \xb_i . \nonumber \end{align} Hence, $\Ab$ characterizes how the covariates in $\xb_i$ affect the expected value of the additive log-ratio transformed proportions for individual $i$, and ultimately the relative abundances of the taxa for individual $i$. Likewise, the covariance matrix for $\etab_{i\cdot}$ is calculated as: \begin{align} \label{eqn:var} \var(\etab_{i\cdot} | \Ab, \Bb, \Psib, \Gammab) &= \Psib + \Bb\xb_i\xb_i^\top\Bb^\top \nonumber \ &= \Sigmab_{\xb_i} . \end{align} The matrix $\Sigmab_{\xb_i}$, or perhaps its corresponding correlation matrix, can then be used to define a taxon co-occurrence network for individual $i$ based on the covariates. In this expression, $\Psib$ can be thought of as a baseline covariance matrix and $\Bb$ describes how the covariates in $\xb_i$ affect $\Sigmab_{\xb_i}$.

Running MiCoRe

After installing the \texttt{micore} package, running the method is simple. Let's simulate some data and run the function. Note that you need to supply the model matrix $\Xb$, and you specifically need to give it an intercept column. Also note that, in this example, we run only 500 burn-in and 500 MCMC samples, but in practice, you should likely run for longer. For example, the default is 4000 burn-in and 4000 MCMC samples. After installing the \texttt{micore} package, running the method is simple. Let's load in some data and run the function. Note that you need to supply the model matrix $\Xb$, and you specifically need to give it an intercept column. This can be done easily using the \texttt{model.matrix()} function. Also note that, in this example, we run only 500 burn-in and 500 MCMC samples, but in practice, you should likely run for longer. For example, the default is 4000 burn-in and 4000 MCMC samples.

We'll run \texttt{micore} and include BMI in the model matrix.

n <- 100
p <- 5
q <- 2
# Simulating data
x <- rnorm(n)
# Model matrix with intercept column
X <- cbind(1, x)
counts <- matrix(0, n, p+1)
for (i in 1:n) {
  counts[i,] <- rmultinom(1, size=100, prob=rep(1,p+1))
}
# Loading in sample American Gut dataset included in package
library(micore)
data(amgut)
counts <- amgut$counts
X <- model.matrix(~BMI, amgut$clin.dat)
# Number of burn-in samples and number of MCMC samples to save
n.burn <- 500
n.samp <- 500
# Running micore
library(micore)
mc.fit <- micore(counts, X, n.burn = n.burn, n.samp = n.samp, 
                 n.chain=4, n.cores=4, verbose=TRUE)

Note that the \texttt{micore} object contains one list element for each MCMC chain run. In this example, we ran 4 chains, so each chain's data can be accessed like so:

# Chain 1
tmp <- mc.fit[[1]]
attributes(tmp)
# Chain 2
tmp <- mc.fit[[2]]
attributes(tmp)
# etc...

MCMC samples from any of the chains can be extracted from any of the parameters directly from this object. Each parameter is an array where the first dimension represents the . For example, we can extract the $\Bb$ parameter from chain 3:

# Extracting the B parameter from chain 3
B.3 <- mc.fit[[3]]$B
dim(B.3)
# Get 101th sample of B in chain 3
B.3[101,,]

The names of the parameters available to extract are: \begin{itemize} \item \texttt{eta}: $\etab$, the additive log-ratio transformed proportions \item \texttt{Psi}: $\Psib$, the baseline covariance matrix \item \texttt{A}: $\Ab$, the fixed effect'' parameter \item \texttt{B}: $\Bb$, therandom effect'' parameter \item \texttt{gamma}: $\gamma_i$, $i\in1,\dots,n$, the individual-specific parameter (not to be confused with \texttt{Gamma} with a capital G) \item \texttt{Gamma}: $\Gammab$ the column covariance matrix in the Matrix-Normal prior (not to be confused with \texttt{Gamma} with a lowercase g) \end{itemize} The MCMC samples from all chains can be merged together for a particular parameter in order to run summary statistics on all MCMC samples from the parameter:

# Merging all 4 chains into single array
B.merge <- mergeChains(mc.fit, par="B")
# Mean of B over all chains
apply(B.merge, 2:3, mean)

Getting estimated OTU abundances

Though the \texttt{micore} object contains all samples from all parameters from all chains, these data structures can be difficult to work with directly. This package contains a functions to get estimates (and credible intervals) of the OTU abundances for a given covariate profile $\xb_i$. These estimates can be done on either the additive log-ratio scale or on the proportions scale:

# Want to estimate OTU abundances for individual with x=1.3.
# Create the covariate profile with intercept
x.try <- matrix(c(1, 1.3), nrow=1)
# Get predicted abundances on additive log-ratio scale
p1 <- predict(mc.fit, newdata=x.try)
p1$fit
# Credible intervals
p1$quant
# Get predicted abundances on proportions scale
p2 <- predict(mc.fit, newdata=x.try, type="prop")
p2$fit
# Credible intervals
p2$quant

Getting estimated covariance matrix (or precision, correlation, partial correaltion matrix)

It's also possible to directly extract the estimated covariance matrix based on a covariate profile $\xb_i$. The function \texttt{getPredCov()} allows the user to provide a covariate profile and will return the corresponding estimated covariance, precision, correlation, or partial correlation matrix along with interval estimates. Note that the returned object saves the estimate value in and array in the \texttt{fit} slot, and the first dimension corresponds to the individuals that covariances matrices are being calculated for.

# Get estimated covariance matrix using covariate profile defined earlier...
cov1 <- getPredCov(mc.fit, newdata=x.try)
# Extract the covariance matrix for the first individual in x.try
cov1$fit[1,,]
# Credible intervals
cov1$quant$`2.5%`[1,,]
cov1$quant$`97.5%`[1,,]
# Partial correlation instead
pc1 <- getPredCov(mc.fit, newdata=x.try, type="pcor")
# Extract the partial correlation matrix for the first individual in x.try
pc1$fit[1,,]
# Credible intervals
pc1$quant$`2.5%`[1,,]
pc1$quant$`97.5%`[1,,]

Model diagnostics

When running the model, you should always run multiple chains (the default is 4), to check convergence of the parameters. This can be investigated using the \texttt{trplot()} function. Let's check the traceplot for the $\Bb_{11}$ matrix element:

trplot(mc.fit, par="B", ind=c(1,1))

In this example, the chains have not yet converged, meaning that it is possible that not enough MCMC samples have been run. In this case it would be wise to increase the parameters \texttt{n.burn} and \texttt{n.samp} in the \texttt{micore()} function.

It is also a good idea to check convergence of \texttt{eta}, as this parameter relies on the adaptive Metropolis step.

trplot(mc.fit, par="eta", ind=c(1,1))

Several parameters for the adaptive Metropolis sampler can be changed from their defaults; these parameters are specified as list elements of the \texttt{adapt.control} argument to \texttt{micore}. \begin{itemize} \item \texttt{init}: (default 0.1) The initial stepsize for the adaptive Metropolis parameters. \item \texttt{a}: (default 0.5) The adaptation rate. A higher value means the adaptations will vanish more quickly with the number of MCMC steps. \item \texttt{sigma.zero}: (default 1) The initial value of the adaptive Metropolis variance scaling parameter. \end{itemize} Sometimes the \texttt{a} parameter must be changed to get convergence for \texttt{eta}.

# Running micore with a different value for "a" parameter in adaptive Metropolis.  
# Setting a = 0.3.
mc.fit <- micore(counts, X, n.burn = n.burn, n.samp = n.samp, 
                 n.chain=4, n.cores=4, verbose=TRUE,
                 adapt.control = list(a=0.3))