stopifnot(require(knitr))
opts_chunk$set(
  comment=NA, 
  message = FALSE, 
  warning = FALSE, 
  eval = params$EVAL,
  dev = "png",
  dpi = 500,
  fig.asp = 0.618,
  fig.width = 14,
  out.width = "100%",
  fig.align = "center"
)
library("NBZIMM")

Introduction

This vignette focuses on demonstrations of Negative Binomial Mixed Models in analyzing over-dispersed longitudinal microbiome count data with NBZIMM.

The NBZIMM provides functions for setting up and fitting negative binomial mixed models and zero-inflated negative binomial and Gaussian models. We have created the function glmm.nb and mms for setting up and fitting the proposed NBMMs. The functions glmm.nb and mms works by repeated calls to the function lme for fitting linear mixed models in the recommended package nlme in R, and allows for any types of random effects and within-subject correlation structures as described in the package nlme.The outputs from the function glmm.nb can be summarized by functions in nlme. The methods can be used to analyze overdispersed microbiome count data with multilevel data structures (for example, clustered and longitudinal studies).

In this vignette we'll use the glmm.nb function and mms function in the NBZIMM package to demonstrate the analysis of a public available dataset from Romero et al (2014).

Model

Longitudinal studies collect multiple subjects and measure each subject at several time points (i.e., samples). Assume that there are $n$ subjects, and subject $i$ is measured at $n_i$ time points $t_{ij};\ j = 1,\dots,\ n_i;\ i = 1,\dots,\ n$. Our goal of longitudinal microbiome studies is to detect associations between the microbiome counts and the host variables, and characterize the time trends of microbiome abundance within subjects and between subjects.We analyze each microbiome taxon at a time. For notational simplification, we denote $y_{ij} = c_{ijh}$ for any given taxon $h$. Since the microbiome count outcome is over-dispersed, we use negative binomial models. In our NBMMs, the counts $y_{ij}$ are assumed to follow the negative binomial distribution:

$$y_{ij} \sim NB(y_{ij}|\mu_{ij}, \theta)=\tfrac{\Gamma(y_{ij} +\theta)}{\Gamma(\theta)y_{ij}!} (\tfrac{\theta}{\mu_{ij} +\theta})^{\theta} (\tfrac{\mu_{ij}}{\mu_{ij} +\theta})^{y_{ij}} ,$$

where $\theta$ is the dispersion parameter that controls the amount of over-dispersion, and $\mu_{ij}$ are the means. The means $\mu_{ij}$ are related to the variables via the logarithm link function:

$$ log(\mu_{ij}) = log(T_{ij})+X_{ij}\beta+Z_{ij}b_i $$

where $log(T_{ij})$ is the offset that corrects for the variation of the total sequence reads, $X_{ij}=(1,X_i, t_{ij}, X_i^st_{ij})$, $X_i^s$ is the variable of interest in $X_i$, for example, an indicator variable for the case group and the control group, and $Z_{ij}=(1, t_{ij})$; $\beta=(\beta_0, \beta_1, \beta_2, \beta_3)^T$ is the vector of fixed effects (i.e., population-level effects), including an intercept, the effects of the host variables $X_i$, the over-all time effect, and the interaction $X_i^s$ and $t_{ij}$ ; $b_i = (b_{0i}, b_{1i})^T$ is the vector of random effects (i.e., subject-level effects), including the random intercept $b_{0i}$ and the random time effect $b_{1i}$. The vector of the random effects is usually assumed to follow a multivariate normal distribution (Pinheiro and Bates, 2000;McCulloch and Searle, 2001):

$$b_i \sim N(0, \Psi)$$ where $\Psi$ is the variance-covariance matrix, which can be a general positive-definite matrix that accounts for the correlation of the random covariates.

Accounting for Within-Subject Correlations

The IWLS (Iterative Weighted Least Squares) algorithm developed by Zhang et al. (2018) can be used to fit the above NBMMs. We can relax the assumption of independent within-subject errors to account for special within-subject correlation structures: $$z_{ij}=log(T_{ij})+X_{ij}\beta+Z_{ij}b_i+w_{ij}^{-1/2}e_{ij},\ b_i \sim N(0, \Psi),\ e_i=(e_{i1},\ ...,\ e_{in_i})`\sim N(0, \sigma^2R_i) $$ where $z_{ij}$ and $w_{ij}$ are the pseudo-responses and the pseudo-weights, as described in Zhang et al. (2017), and $R_i$ is a correlation matrix, which describes dependence among observations. Pinheiro and Bates (2000) describes several ways to specify the correlation matrix $R_i$, all of which can be incorporated into our NBMMs.

Demonstration with Example

This example used a published dataset from Romero et al.(2014). It was a case-control longitudinal study collected 22 pregnant women who delivered at term and 32 non-pregnant women, each measured at multiple time points. The data consist of two data components: OTU and SampleData. OTU contains microbiome count data for 900 samples with 143 taxa; SampleData contains data of sample variables for all the samples with 9 variables, including subject ID, pregnant status, total sequencing read, age, race, etc.

Here we import the dataset from the package NBZIMM and check the clinical variables first.

data(Romero) # see help("Romero")
names(Romero)
otu = Romero$OTU
sam = Romero$SampleData

N = sam[, "Total.Read.Counts"]        
Days = sam$GA_Days; Days = scale(Days)
Age = sam$Age; Age = scale(Age)
Race = sam$Race
preg = sam$pregnant; table(preg)

subject = sam[, "Subect_ID"]; table(subject)

non = nonzero(y = otu, total = N, plot = F)
nonzero.p = non[[1]]

data = data.frame(Days=Days, Age=Age, Race=Race, preg=preg, N=N, subject=subject)

Analysis of a single taxon each time

We then show the analysis of a single taxon (Lactobacillus) with the function glmm.nb.

The group variable of interest is pregnancy women vs non-pregnant women. We consider the sample collection time as the time variable and also include covariates, age, race. Different models can be used to analyze this dataset. The first model could be a simple random intercept without interaction term. An offset term is needed to adjust for the library size N when analyzing a longitudinal microbiome dataset.

y = otu[, names(nonzero.p)[1]] # Lactobacillus

f = glmm.nb(y ~ Days + Age + Race + preg + offset(log(N)), random = ~ 1 | subject, data = data)

summary(f)
y = otu[, names(nonzero.p)[1]] # Lactobacillus

f = glmm.nb(y ~ Days + Age + Race + preg + offset(log(N)), random = ~ 1 | subject, data = data)

We can summarize the results using summary function.

summary(f)

Now, we can add the interaction term. The time interaction for Lactobacillus between two groups is significant.

f = glmm.nb(y ~ Days*preg + Age + Race  + offset(log(N)), random = ~ 1 | subject, data = data)

summary(f)
f = glmm.nb(y ~ Days*preg + Age + Race + offset(log(N)), random = ~ 1 | subject, data = data)

We can summarize the results using summary function.

summary(f)

Or, we can add the random slope in the model.

f = glmm.nb(y ~ Days*preg + Age + Race  + offset(log(N)), random = list(subject = pdDiag(~Days)), data = data)
f = glmm.nb(y ~ Days*preg + Age + Race + offset(log(N)), random = list(subject = pdDiag(~Days)), data = data)

We can also incorporate autoregressive of order 1, AR(1) for correlation matrix $R_i$ to describe dependence among observations.

f = glmm.nb(y ~ Days*preg + Age + Race  + offset(log(N)), random = ~ 1 | subject, correlation = corAR1(), data = data)
f = glmm.nb(y ~ Days*preg + Age + Race + offset(log(N)), random = ~ 1 | subject, correlation = corAR1(), data = data)

Analysis of all taxa with a given nonzero proportion

On the contrary, if you have many taxa of interest, it is easier to analyze them all at one time with the function mms. The following example analyze all the taxa with the proportion of non-zero values $>$ min.p.

f = mms(y = otu, fixed = ~ Days + Age + Race + preg + offset(log(N)), 
        random = ~ 1 | subject,
        min.p = 0.2, method = "nb", data = data)
f = mms(y = otu, fixed = ~ Days + Age + Race + preg + offset(log(N)), 
        random = ~ 1 | subject,
        min.p = 0.2, method = "nb", data = data)

Visualize the results

Then we can generate a plot to view the significant taxa associated with various covariates and corresponding p-values.

out = fixed(f)$dist
out = out[out[,2]!="(Intercept)", ]
res = out[, 3:5]

par(mfrow = c(1, 1), cex.axis = 1, mar = c(2, 14, 4, 4))
plot.fixed(res=res, threshold=0.05, gap=500, col.pts=c("black", "grey"),
           cex.axis=1, cex.var=1)

A heat map could also be generated to present the results.

out = fixed(f)$dist
out = out[out[,2]!="(Intercept)", ]
heat.p(df=out, zigzag=c(T,F), abbrv=c(T,F), margin=c(1.5,10), y.size=8)
knitr::include_graphics('Heatmap for nbmms vignette.jpeg')

References

Pinheiro, J.C., and Bates, D.M. (2000) "Mixed-Effects Models in S and S-PLUS", Springer.

Venables, W. N. and Ripley, B. D. (2002) "Modern Applied Statistics with S". Fourth edition. Springer.

Xinyan Zhang, Himel Mallick, Xiangqin Cui, Andrew K. Benson, and Nengjun Yi (2017) Negative Binomial Mixed Models for Analyzing Microbiome Count Data. BMC Bioinformatics 18(1):4.

Xinyan Zhang, Yu-Fang Pei, Lei Zhang, Boyi Guo, Amanda Pendegraft, Wenzhuo Zhuang and Nengjun Yi (2018) Negative Binomial Mixed Models for Analyzing Longitudinal Microbiome Data. Frontiers in Microbiology.

Romero, R. et al. (2014) The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women. Microbiome 2:4.



nyiuab/NBZIMM documentation built on April 21, 2022, 7 a.m.