MZILN | R Documentation |
The MZILN function is for estimating and testing the associations of abundance ratios with covariates. \loadmathjax
The regression model for MZILN()
can be expressed as follows:
\mjdeqn\log\bigg(\frac\mathcalY_i^k\mathcalY_i^K+1\bigg)|\mathcalY_i^k>0,\mathcalY_i^K+1>0=\alpha^0k+\mathcalX_i^T\alpha^k+\epsilon_i^k,\hspace0.2cmk=1,...,K
where
Y_i^k is the AA of taxa \mjeqnk in subject \mjeqni in the entire ecosystem.
Y_i^K+1 is the reference taxon (specified by user).
X_i is the covariate matrix for all covariates including confounders.
^k is the regression coefficients along with their 95% confidence intervals that will be estimated by the MZILN()
function.
High-dimensional \mjeqnX_i is handled by regularization.
When using this function, most of the time, users just need to feed the first four inputs to the function: experiment_dat
, microbVar
, refTaxa
and allCov
. All other inputs can just take their default values.
MZILN(
experiment_dat,
microbVar = "all",
refTaxa,
allCov = NULL,
sampleIDname = NULL,
adjust_method = "BY",
fdrRate = 0.05,
paraJobs = NULL,
bootB = 500,
taxDropThresh = 0,
standardize = FALSE,
sequentialRun = FALSE,
verbose = TRUE
)
experiment_dat |
A SummarizedExperiment object containing microbiome data and covariates (see example on how to create a SummarizedExperiment object). The microbiome data can be absolute abundance or relative abundance with each column per sample and each row per taxon/OTU/ASV (or any other unit). No imputation is needed for zero-valued data points. The covariates data contains covariates and confounders with each row per sample and each column per variable. The covariates data has to be numeric or binary. Categorical variables should be converted into dummy variables. |
microbVar |
This takes a single or vector of microbiome variable names (e.g., taxa, OTU and ASV names) of interest for the numerator of the ratios. Default is "all" meaning all microbiome variables (except denominator) will be analyzed as numerators. If a subset of microbiome variables is specified, the output will only contain the specified ratios, and p-value adjustment for multiple testing will only be applied to the subset. |
refTaxa |
Denominator taxa names specified by the user for the targeted ratios. This could be a vector of names. |
allCov |
All covariates of interest (including confounders) for estimating and testing their associations with the targeted ratios. Default is 'NULL' meaning that all covariates in covData are of interest. |
sampleIDname |
Name of the sample ID variable in the data. In the case that the data does not have an ID variable, this can be ignored. Default is NULL. |
adjust_method |
The adjusting method for p value adjustment. Default is "BY" for dependent FDR adjustment. It can take any adjustment method for p.adjust function in R. |
fdrRate |
The false discovery rate for identifying taxa/OTU/ASV associated with |
paraJobs |
If |
bootB |
Number of bootstrap samples for obtaining confidence interval of estimates for the high dimensional regression. The default is |
taxDropThresh |
The threshold of number of non-zero sequencing reads for each taxon to be dropped from the analysis. The default is |
standardize |
This takes a logical value |
sequentialRun |
This takes a logical value |
verbose |
Whether the process message is printed out to the console. The default is TRUE. |
A list with two elements.
full_results
: The main results for MZILN containing the estimation and testing results for all associations between all taxa ratios with refTaxan being the denominator and all covariates in allCov
. It is a dataframe with each row representing an association, and ten columns named as "ref_tax", "taxon", "cov", "estimate", "SE.est", "CI.low", "CI.up", "adj.p.value", "unadj.p.value", and "sig_ind". The columns correspond to the denominator taxon, numerator taxon, covariate name, association estimates, standard error estimates, lower bound and upper bound of the 95% confidence interval, adjusted p value, and the indicator showing whether the association is significant after multiple testing adjustment.
metadata
: The metadata is a list containing total time used in minutes, FDR rate, and multiple testing adjustment method used.
Li et al.(2018) Conditional Regression Based on a Multivariate Zero-Inflated Logistic-Normal Model for Microbiome Relative Abundance Data. Statistics in Biosciences 10(3): 587-608
library(IFAA)
## A makeup example data from Scratch. 60 taxon, 40 subjects, 3 covariates
set.seed(1)
## create an ID variable for the example data
ID=seq_len(40)
## generate three covariates x1, x2, and x3, with x2 binary
x1<-rnorm(40)
x2<-rbinom(40,1,0.5)
x3<-rnorm(40)
dataC<-data.frame(cbind(ID,x1,x2,x3))
## Coefficients for x1, x2, and x3 among 60 taxa.
beta_1<-c(0.1,rep(0,59))
beta_2<-c(0,0.2,rep(0,58))
beta_3<-rnorm(60)
beta_mat<-cbind(beta_1,beta_2,beta_3)
## Generate absolute abundance for 60 taxa in ecosystem.
dataM_eco<-floor(exp(10+as.matrix(dataC[,-1])%*%t(beta_mat) + rnorm(2400,sd=0.05)))
## Generate sequence depth and generate observed abundance
Ci<-runif(40,0.01,0.05)
dataM<-floor(apply(dataM_eco,2,function(x) x*Ci))
colnames(dataM)<-paste0("rawCount",1:60)
## Randomly introduce 0 to make 25% sparsity level.
dataM[sample(seq_len(length(dataM)),length(dataM)/4)]<-0
dataM<-data.frame(cbind(ID,dataM))
## The following steps are to create a SummarizedExperiment object.
## If you already have a SummarizedExperiment format data, you can
## ignore the following steps and directly feed it to the IFAA function.
## Merge two dataset by ID variable
data_merged<-merge(dataM,dataC,by="ID",all=FALSE)
## Seperate microbiome data and covariate data, drop ID variable from microbiome data
dataM_sub<-data_merged[,colnames(dataM)[!colnames(dataM)%in%c("ID")]]
dataC_sub<-data_merged[,colnames(dataC)]
## Create SummarizedExperiment object
test_dat<-SummarizedExperiment::SummarizedExperiment(
assays=list(MicrobData=t(dataM_sub)), colData=dataC_sub)
## Again, if you already have a SummarizedExperiment format data, you can
## ignore the above steps and directly feed it to the IFAA function.
## Run MZILN function
set.seed(123) # For full reproducibility
## If interested in the taxa ratios, say `"rawCount1", "rawCount2",
## and "rawCount3"` over "rawCount10", one can do:
results <- MZILN(
experiment_dat = test_dat,
microbVar = c("rawCount1", "rawCount2", "rawCount3"),
refTaxa = c("rawCount10"),
allCov = c("x1", "x2", "x3"),
sampleIDname = "ID",
fdrRate = 0.05
)
## results can be extracted as follows:
summary_res_ratio <- results$full_results
summary_res_ratio
## To explore all ratios with "rawCount10" being the denominator,
## one can do:
results <- MZILN(experiment_dat = test_dat,
refTaxa=c("rawCount10"),
allCov=c("x1","x2","x3"),
sampleIDname="ID",
fdrRate=0.05)
## to extract the results for all ratios with rawCount10 as the denominator:
summary_res<-results$full_results
## to extract results for the ratio of a specific taxon (e.g., rawCount45) over rawCount10:
target_ratio=summary_res[summary_res$taxon=="rawCount45",]
## to extract all of the ratios having significant associations:
sig_ratios=subset(summary_res,sig_ind==TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.