MZILN: Conditional regression for microbiome analysis based on...

View source: R/MZILN.R

MZILNR Documentation

Conditional regression for microbiome analysis based on multivariate zero-inflated logistic normal model

Description

The MZILN function is for estimating and testing the associations of abundance ratios with covariates. \loadmathjax

The regression model for MZILN() can be expressed as follows: \mjdeqn\log\bigg(\frac\mathcalY_i^k\mathcalY_i^K+1\bigg)|\mathcalY_i^k>0,\mathcalY_i^K+1>0=\alpha^0k+\mathcalX_i^T\alpha^k+\epsilon_i^k,\hspace0.2cmk=1,...,K where

  • \mjeqn\mathcal

    Y_i^k is the AA of taxa \mjeqnk in subject \mjeqni in the entire ecosystem.

  • \mjeqn\mathcal

    Y_i^K+1 is the reference taxon (specified by user).

  • \mjeqn\mathcal

    X_i is the covariate matrix for all covariates including confounders.

  • \mjeqn\alpha

    ^k is the regression coefficients along with their 95% confidence intervals that will be estimated by the MZILN() function.

High-dimensional \mjeqnX_i is handled by regularization.

When using this function, most of the time, users just need to feed the first four inputs to the function: experiment_dat, microbVar, refTaxa and allCov. All other inputs can just take their default values.

Usage

MZILN(
  experiment_dat,
  microbVar = "all",
  refTaxa,
  allCov = NULL,
  sampleIDname = NULL,
  adjust_method = "BY",
  fdrRate = 0.05,
  paraJobs = NULL,
  bootB = 500,
  taxDropThresh = 0,
  standardize = FALSE,
  sequentialRun = FALSE,
  verbose = TRUE
)

Arguments

experiment_dat

A SummarizedExperiment object containing microbiome data and covariates (see example on how to create a SummarizedExperiment object). The microbiome data can be absolute abundance or relative abundance with each column per sample and each row per taxon/OTU/ASV (or any other unit). No imputation is needed for zero-valued data points. The covariates data contains covariates and confounders with each row per sample and each column per variable. The covariates data has to be numeric or binary. Categorical variables should be converted into dummy variables.

microbVar

This takes a single or vector of microbiome variable names (e.g., taxa, OTU and ASV names) of interest for the numerator of the ratios. Default is "all" meaning all microbiome variables (except denominator) will be analyzed as numerators. If a subset of microbiome variables is specified, the output will only contain the specified ratios, and p-value adjustment for multiple testing will only be applied to the subset.

refTaxa

Denominator taxa names specified by the user for the targeted ratios. This could be a vector of names.

allCov

All covariates of interest (including confounders) for estimating and testing their associations with the targeted ratios. Default is 'NULL' meaning that all covariates in covData are of interest.

sampleIDname

Name of the sample ID variable in the data. In the case that the data does not have an ID variable, this can be ignored. Default is NULL.

adjust_method

The adjusting method for p value adjustment. Default is "BY" for dependent FDR adjustment. It can take any adjustment method for p.adjust function in R.

fdrRate

The false discovery rate for identifying taxa/OTU/ASV associated with allCov. Default is 0.05.

paraJobs

If sequentialRun is FALSE, this specifies the number of parallel jobs that will be registered to run the algorithm. If specified as NULL, it will automatically detect the cores to decide the number of parallel jobs. Default is NULL.

bootB

Number of bootstrap samples for obtaining confidence interval of estimates for the high dimensional regression. The default is 500.

taxDropThresh

The threshold of number of non-zero sequencing reads for each taxon to be dropped from the analysis. The default is 0 which means taxon without any sequencing reads will be dropped from the analysis.

standardize

This takes a logical value TRUE or FALSE. If TRUE, the design matrix for X will be standardized in the analyses and the results. Default is FALSE.

sequentialRun

This takes a logical value TRUE or FALSE. Default is TRUE. It can be set to be "FALSE" to increase speed if there are multiple taxa in the argument 'refTaxa'.

verbose

Whether the process message is printed out to the console. The default is TRUE.

Value

A list with two elements.

  • full_results: The main results for MZILN containing the estimation and testing results for all associations between all taxa ratios with refTaxan being the denominator and all covariates in allCov. It is a dataframe with each row representing an association, and ten columns named as "ref_tax", "taxon", "cov", "estimate", "SE.est", "CI.low", "CI.up", "adj.p.value", "unadj.p.value", and "sig_ind". The columns correspond to the denominator taxon, numerator taxon, covariate name, association estimates, standard error estimates, lower bound and upper bound of the 95% confidence interval, adjusted p value, and the indicator showing whether the association is significant after multiple testing adjustment.

  • metadata: The metadata is a list containing total time used in minutes, FDR rate, and multiple testing adjustment method used.

References

Li et al.(2018) Conditional Regression Based on a Multivariate Zero-Inflated Logistic-Normal Model for Microbiome Relative Abundance Data. Statistics in Biosciences 10(3): 587-608

Examples


## Not run: 
library(IFAA)
## A makeup example data from Scratch. 60 taxon, 40 subjects, 3 covariates

set.seed(1)

## create an ID variable for the example data
ID=seq_len(40)

## generate three covariates x1, x2, and x3, with x2 binary
x1<-rnorm(40)
x2<-rbinom(40,1,0.5)
x3<-rnorm(40)
dataC<-data.frame(cbind(ID,x1,x2,x3))

## Coefficients for x1, x2, and x3 among 60 taxa.
beta_1<-c(0.1,rep(0,59))
beta_2<-c(0,0.2,rep(0,58))
beta_3<-rnorm(60)
beta_mat<-cbind(beta_1,beta_2,beta_3)

## Generate absolute abundance for 60 taxa in ecosystem.
dataM_eco<-floor(exp(10+as.matrix(dataC[,-1])%*%t(beta_mat) + rnorm(2400,sd=0.05)))

## Generate sequence depth and generate observed abundance
Ci<-runif(40,0.01,0.05)
dataM<-floor(apply(dataM_eco,2,function(x) x*Ci))
colnames(dataM)<-paste0("rawCount",1:60)

## Randomly introduce 0 to make 25% sparsity level.
dataM[sample(seq_len(length(dataM)),length(dataM)/4)]<-0

dataM<-data.frame(cbind(ID,dataM))

## The following steps are to create a SummarizedExperiment object.
## If you already have a SummarizedExperiment format data, you can
## ignore the following steps and directly feed it to the IFAA function.

## Merge two dataset by ID variable
data_merged<-merge(dataM,dataC,by="ID",all=FALSE)

## Seperate microbiome data and covariate data, drop ID variable from microbiome data
dataM_sub<-data_merged[,colnames(dataM)[!colnames(dataM)%in%c("ID")]]
dataC_sub<-data_merged[,colnames(dataC)]

## Create SummarizedExperiment object
test_dat<-SummarizedExperiment::SummarizedExperiment(
assays=list(MicrobData=t(dataM_sub)), colData=dataC_sub)

## Again, if you already have a SummarizedExperiment format data, you can
## ignore the above steps and directly feed it to the IFAA function.

## Run MZILN function
set.seed(123) # For full reproducibility

## If interested in the taxa ratios, say `"rawCount1", "rawCount2", 
## and "rawCount3"` over "rawCount10", one can do: 

results <- MZILN(
experiment_dat = test_dat,
microbVar = c("rawCount1", "rawCount2", "rawCount3"),
refTaxa = c("rawCount10"),
allCov = c("x1", "x2", "x3"),
sampleIDname = "ID",
fdrRate = 0.05
)

## results can be extracted as follows:

summary_res_ratio <- results$full_results
summary_res_ratio

## To explore all ratios with "rawCount10" being the denominator, 
## one can do:

results <- MZILN(experiment_dat = test_dat,
                refTaxa=c("rawCount10"),
                allCov=c("x1","x2","x3"),
                sampleIDname="ID",
                fdrRate=0.05)
## to extract the results for all ratios with rawCount10 as the denominator:
summary_res<-results$full_results
## to extract results for the ratio of a specific taxon (e.g., rawCount45) over rawCount10:
target_ratio=summary_res[summary_res$taxon=="rawCount45",]
## to extract all of the ratios having significant associations:
sig_ratios=subset(summary_res,sig_ind==TRUE)

## End(Not run)

IFAA documentation built on Jan. 12, 2023, 5:07 p.m.