stopifnot(require(knitr))
opts_chunk$set(
  comment=NA, 
  message = FALSE, 
  warning = FALSE, 
  eval = params$EVAL,
  dev = "png",
  dpi = 500,
  fig.asp = 0.618,
  fig.width = 14,
  out.width = "100%",
  fig.align = "center"
)
library("NBZIMM")

Introduction

This vignette focuses on demonstrations of Zero-inflated Negative Binomial Mixed Models (ZINBMMs) in analyzing sparse longitudinal microbiome count data with NBZIMM.

The NBZIMM provides functions for setting up and fitting zero-inflated Negative binomial mixed models. We have created the function glmm.zinb and mms for setting up and fitting the proposed ZINBMMs. The functions glmm.zinb and mms works by repeated calls to the function lme for fitting linear mixed models in the recommended package nlme in R. ZINBMMs could directly analyze sparse longitudinal metagenomics count data. We develop an efficient EM-IWLS algorithm to fit the ZINBMMs.

In this vignette we'll use the glmm.zinb function and mms function in the NBZIMM package to demonstrate the analysis of a public available dataset from Romero et al (2014).

Model

In a longitudinal studie, we denote $c_{ijh}$ as the observed count for the $h-th$ taxon from the $j-th$ sample of the $i-th$ subject, where $\ j = 1,\dots,\ n_i;\ i = 1,\dots,\ n$. Let $n$ be the total number of subjects and $n_i$ be the total number of samples for each subject. We analyze one taxon at each time. For notational simplification, we denote $y_{ij} = c_{ijh}$ for any given taxon $h$. To address the sparsity issue in some microbiome taxa, we assume that $y_{ij}$ may come from the zero-inflated negative binomial (ZINB) distribution. The distribution could be written as below:

$$y_{ij} \sim \left{ \begin{array} {rrr} 0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ & &\ with\ probability\ p_{ij} \ NB(y_{ij}|\mu_{ij}, \theta) & = \tfrac{\Gamma(y_{ij} +\theta)}{\Gamma(\theta)y_{ij}!} (\tfrac{\theta}{\mu_{ij} +\theta})^{\theta} (\tfrac{\mu_{ij}}{\mu_{ij} +\theta})^{y_{ij}} & with\ probability\ 1-p_{ij} \end{array} \right.$$

where $\mu_{ij}$ and $\theta$ are the mean and dispersion parameters in the negative binomial distribution. $p_{ij}$ is the unknown probability that $y_{ij}$ is from the zero state, which are assumed to relate some covariates through the logit link function:

$$ logit(p_{ij}) = Z_{ij}\alpha $$
where $Z_{ij}$ includes some covariates that are potentially associated with the zero state. The simplest zero-inflation model includes only the intercept in $Z_{ij}$, resulting in the same probability of belonging to the zero state for all zeros. We also can add the random-effect terms into the above model:

$$ logit(p_{ij}) = Z_{ij}\alpha+G_{ij}a_i $$

where the random effects $a_i$ are assumed to follow a multivariate normal distribution: $$a_i \sim N(0, \Psi_a)$$ where $\Psi_a$ is the variance-covariance matrix for the random effects. It can be a general positive-definite matrix to model the correlation structure among the random covariates. The most simplified case is to restrict it to be a diagonal matrix by assuming that the random effects are independent.

At the same time, the means $\mu_{ij}$ are expressed as:

$$ log(\mu_{ij}) = log(T_{ij})+X_{ij}\beta+G_{ij}b_i $$

where $log(T_{ij})$ is the offset that corrects for the variation of the total sequence reads. $X_{ij}=(1,X_i, t_{ij}, X_i^st_{ij})$, $X_i^s$ is the variable of interest in $X_i$, for example, an indicator variable for the case group and the control group, and $G_{ij}=(1, t_{ij})$; $\beta=(\beta_0, \beta_1, \beta_2, \beta_3)^T$ is the vector of fixed effects (i.e., population-level effects), including an intercept, the effects of the host variables $X_i$, the over-all time effect, and the interaction $X_i^s$ and $t_{ij}$ ; $b_i = (b_{0i}, b_{1i})^T$ is the vector of random effects (i.e., subject-level effects), including the random intercept $b_{0i}$ and the random time effect $b_{1i}$. The vector of the random effects is usually assumed to follow a multivariate normal distribution (Pinheiro and Bates, 2000;McCulloch and Searle, 2001):

$$b_i \sim N(0, \Psi_b)$$

The EM-IWLS Algorithm for Fitting the ZINBMMs

We propose an EM-IWLS algorithm to fit the ZINBMMs. The EM-IWLS algorithm combines the IWLS (Iterative Weighted Least Squares) algorithm developed by Zhang et al. (2018) into EM algorithm. We can relax the assumption of independent within-subject errors to account for special within-subject correlation structures: $$y_{ij}=X_{ij}\beta+G_{ij}b_i+(1-\hat{\xi_{ij}})^{-1/2}e_{ij},\ b_i \sim N(0, \Psi),\ e_i=(e_{i1},\ ...,\ e_{in_i})`\sim N(0, \sigma^2R_i) $$ where $\xi_{ij}$ is the indicator variable for zero-inflated distribution, and $R_i$ is a correlation matrix, which describes dependence among observations. Pinheiro and Bates (2000) describes several ways to specify the correlation matrix $R_i$, all of which can be incorporated into our ZINBMMs.

Demonstration with Example

This example used a published dataset from Romero et al.(2014). It was a case-control longitudinal study collected 22 pregnant women who delivered at term and 32 non-pregnant women, each measured at multiple time points. The data consist of two data components: OTU and SampleData. OTU contains microbiome count data for 900 samples with 143 taxa; SampleData contains data of sample variables for all the samples with 9 variables, including subject ID, pregnant status, total sequencing read, age, race, etc.

Here we import the dataset from the package NBZIMM and check the clinical variables first.

data(Romero) # see help("Romero")
names(Romero)
otu = Romero$OTU
sam = Romero$SampleData

N = sam[, "Total.Read.Counts"]        
Days = sam$GA_Days; Days = scale(Days)
Age = sam$Age; Age = scale(Age)
Race = sam$Race
preg = sam$pregnant; table(preg)

subject = sam[, "Subect_ID"]; table(subject)

non = nonzero(y = otu, total = N, plot = F)
nonzero.p = non[[1]]

data = data.frame(Days=Days, Age=Age, Race=Race, preg=preg, N=N, subject=subject)

Analysis of a single taxon each time

We then show the analysis of a single taxon (Lactobacillus) with the function glmm.nb.

The group variable of interest is pregnancy women vs non-pregnant women. We consider the sample collection time as the time variable and also include covariates, age, race. Different models can be used to analyze this dataset. The first model could be a simple random intercept without interaction term. An offset term is needed to adjust for the library size N when analyzing a longitudinal microbiome dataset.

y = otu[, names(nonzero.p)[1]] # Lactobacillus

f = glmm.zinb(y ~ Days + Age + Race + preg + offset(log(N)), random = ~ 1 | subject, data = data)

summary(f)
y = otu[, names(nonzero.p)[1]] # Lactobacillus

f = glmm.zinb(y ~ Days + Age + Race + preg + offset(log(N)), random = ~ 1 | subject, data = data)

We can summarize the results using summary function.

summary(f)

Now, we can add the interaction term. The time interaction for Lactobacillus between two groups is significant.

f = glmm.zinb(y ~ Days*preg + Age + Race  + offset(log(N)), random = ~ 1 | subject, data = data)

summary(f)
f = glmm.zinb(y ~ Days*preg + Age + Race + offset(log(N)), random = ~ 1 | subject, data = data)

We can summarize the results using summary function.

summary(f)

Or, we can add the random slope in the model.

f = glmm.zinb(y ~ Days*preg + Age + Race  + offset(log(N)), random = list(subject = pdDiag(~Days)), data = data)
f = glmm.zinb(y ~ Days*preg + Age + Race + offset(log(N)), random = list(subject = pdDiag(~Days)), data = data)

We can also incorporate autoregressive of order 1, AR(1) for correlation matrix $R_i$ to describe dependence among observations.

f = glmm.zinb(y ~ Days*preg + Age + Race  + offset(log(N)), random = ~ 1 | subject, correlation = corAR1(), data = data)
f = glmm.zinb(y ~ Days*preg + Age + Race + offset(log(N)), random = ~ 1 | subject, correlation = corAR1(), data = data)

Analysis of all taxa with a given nonzero proportion

On the contrary, if you have many taxa of interest, it is easier to analyze them all at one time with the function mms. The following example analyze all the taxa with the proportion of non-zero values $>$ min.p.

f = mms(y = otu, fixed = ~ Days + Age + Race + preg + offset(log(N)), 
random = ~ 1 | subject,
min.p = 0.2, method = "zinb", data = data)
f = mms(y = otu, fixed = ~ Days + Age + Race + preg + offset(log(N)), 
random = ~ 1 | subject,
min.p = 0.2, method = "zinb", data = data)

Visualize the results

Then we can generate a plot to view the significant taxa associated with various covariates and corresponding p-values.

out = fixed(f)$dist
out = out[out[,2]!="(Intercept)", ]
res = out[, 3:5]

par(mfrow = c(1, 1), cex.axis = 1, mar = c(2, 14, 4, 4))
plot.fixed(res=res, threshold=0.05, gap=500, col.pts=c("black", "grey"),
cex.axis=1, cex.var=1)

A heat map could also be generated to present the results for each covariate.

out = fixed(f)$dist
out = out[out[,2]!="(Intercept)", ]
heat.p(df=out, zigzag=c(T,F), abbrv=c(T,F), margin=c(1.5,10), y.size=8)
knitr::include_graphics('Heatmap for zinbmms vignette.jpeg')

References

Pinheiro, J.C., and Bates, D.M. (2000) "Mixed-Effects Models in S and S-PLUS", Springer.

Venables, W. N. and Ripley, B. D. (2002) "Modern Applied Statistics with S". Fourth edition. Springer.

Xinyan Zhang, Himel Mallick, Xiangqin Cui, Andrew K. Benson, and Nengjun Yi (2017) Negative Binomial Mixed Models for Analyzing Microbiome Count Data. BMC Bioinformatics 18(1):4.

Xinyan Zhang, Yu-Fang Pei, Lei Zhang, Boyi Guo, Amanda Pendegraft, Wenzhuo Zhuang and Nengjun Yi (2018) Negative Binomial Mixed Models for Analyzing Longitudinal Microbiome Data. Frontiers in Microbiology.

Romero, R. et al. (2014) The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women. Microbiome 2:4.



nyiuab/NBZIMM documentation built on April 21, 2022, 7 a.m.