R/MarZIC.R
In MarZIC: Marginal Mediation Effects with Zero-Inflated Compositional Mediator

Documented in MarZIC

##' Marginal Mediation Model for Zero-Inflated Compositional Mediators
##'
##' @description
##' \loadmathjax
##' MarZIC is used for calculating marginal mediation effects for zero-inflated compositional
##' mediators. For microbiome data, the marginal outcome model for the \mjeqn{j}{}th taxon (or OTU, ASV) is:
##' \mjdeqn{Y=\beta_0+\beta_1M_j+\beta_21_{M_j>0}+\beta_3X+\beta_4X1_{M_j>0}+\beta_5XM_j+\epsilon}{}
##' where \mjeqn{1_{()}}{} is indicator function, 
##' X is the covariate of interest 
##' and \mjeqn{M_j}{} is the relative abundance of the \mjeqn{j}{}th taxon. 
##' The probability of \mjeqn{M_j}{} being structure zero (ie, true zeros) is:
##' \mjdeqn{\log(\frac{\Delta_j}{1-\Delta_j})=\gamma_0 + \gamma_1X}{}
##' The mean of \mjeqn{M_j}{} in compositional structure is modeled as:
##' \mjdeqn{\log(\frac{\mu_j}{1-\mu_j})=\alpha_0 + \alpha_1X}{}
##' Typically, users just need to feed the first seven inputs to the function: 
##'`MicrobData`, `CovData`, `lib_name`, `y_name`, `x_name`, `conf_name` and `taxa_of_interest`.
##'
##' @param MicrobData A dataset contains microbiome data. The microbiome data could be relative abundance or absolute
##' abundance. Subjects with missing value will be removed during analysis.
##' @param CovData A dataset contains outcome, library size and covariates. 
##' @param lib_name      Name of library size variable within colData.
##' @param y_name   Name of outcome variable within colData.
##' @param x_name Name of covariate of interest within colData.
##' @param conf_name Name of confounders within colData. Defaule is NULL, meaning no confounder.
##' @param x4_inter Whether to include the interaction term \mjeqn{\beta_4}{}. Default is TRUE.
##' @param x5_inter Whether to include the interaction term \mjeqn{\beta_5}{}. Default is TRUE.
##' @param taxa_of_interest A character vector for taxa names indicating taxa that should be analyzed. Default
##' is "all", meaning all taxa should be included into analysis.
##' @param mediator_mix_range Number of mixtures in mediator. Default is 1, meaning no mixture.
##' @param transfer_to_RA Logical variable indicating whether the microbiome data should be
##' transferred to relative abundance. Default is TRUE. If TRUE, microbiome data will be rescaled
##' by its row sum.
##' @param num_cores Number of CPU cores to be used in parallelization task.
##' @param adjust_method P value adjustment method. Same as p.adjust. Default is "fdr".
##' @param fdr_rate FDR cutoff for significance. Default is 0.05.
##' @param taxDropThresh The threshold of dropping taxon due to high zero percentage. Default is
##' 0.9, meaning taxon will be dropped for analysis if zero percentage is higher than 90\%.
##' @param taxDropCount The threshold of dropping taxon due to not enough non-zero observation counts.
##' Default is 4 * (length(conf_name)+2), meaning taxon will be dropped if non-zero observation is less than four times
##' of number of covariates plus 1.
##' @param zero_prop_NIE2 The threshold of zero percentage for calculating NIE2. Default is 0.1,
##' meaning NIE2 will be calculated for taxon with zero percentage greater than 10\%.
##' @param zero_count_NIE2 The threshold of zero counts for calculating NIE2.
##' Default is 4 * (length(conf_name)+2), meaning NIE2 will be calculated for taxon with zero counts
##' greater than four times of number of covariates plus 1.
##' @param SDThresh The threshold of dropping taxon due to low coefficient of variation (CV)
##' to avoid constant taxon.
##' Default is 0.05, meaning any taxon has CV less than 0.05 will be dropped.
##' @param SDx The threshold of stopping analysis due to low CV of covariate of interest.
##' Default is 0.05, meaning when CV of covariate of interest is less than 0.05, the analysis
##' will be stopped.
##' @param SDy The threshold of stopping analysis due to low CV of outcome.
##' Default is 0.05, meaning when CV of outcome. is less than 0.05, the analysis
##' will be stopped.
##'
##' @return
##' A `list` of `4` datasets containing the results for `NIE1`, `NIE2`, `NDE`, and `NIE`.
##' Each dataset has row representing each taxon, 6 columns for `Estimates`, `Standard Error`,
##' `Lower bound for 95% Confidence Interval`, `Upper bound for 95% Confidence Interval`,
##' `Adjusted p value`, `Significance indicator`.
##'
##' @examples {
##' library(MarZIC)
##'
##' ## A make up example with 2 taxon and 20 subjects.
##' set.seed(1)
##' nSub <- 20
##' nTaxa <- 2
##' ## generate covariate of interest X
##' X <- rbinom(nSub, 1, 0.5)
##' ## generate confounders
##' conf1<-rnorm(nSub)
##' conf2<-rbinom(nSub,1,0.5)
##' ## generate mean of each taxon. All taxon are having the same mean for simplicity.
##' mu <- exp(-5 + X + 0.1 * conf1 + 0.1 * conf2) /
##'  (1 + exp(-5 + X + 0.1 * conf1 + 0.1 * conf2))
##' phi <- 10
##'
##' ## generate true RA
##' M_taxon<-t(sapply(mu,function(x) dirmult::rdirichlet(n=1,rep(x*phi,nTaxa))))
##'
##' P_zero <- exp(-3 + 0.3 * X + 0.1 * conf1 + 0.1 * conf2) /
##'  (1 + exp(-3 + 0.3 * X + 0.1 * conf1 + 0.1 * conf2))
##'
##' non_zero_ind <- t(sapply(P_zero,function(x) 1-rbinom(nTaxa,1,rep(x,nTaxa))))
##'
##' True_RA<-t(apply(M_taxon*non_zero_ind,1,function(x) x/sum(x)))
##'
##' ## generate outcome Y based on true RA
##' Y <- 1 + 100 * True_RA[,1] + 5 * (True_RA[,1] > 0) + X + conf1 + conf2 + rnorm(nSub)
##'
##' ## library size was set to 10,000 for all subjects for simplicity.
##' libsize <- 10000
##'
##' ## generate observed RA
##' observed_AA <- floor(M_taxon*libsize*non_zero_ind)
##'
##' observed_RA <- t(apply(observed_AA,1,function(x) x/sum(x)))
##' colnames(observed_RA)<-paste0("rawCount",seq_len(nTaxa))
##' CovData <- cbind(Y = Y, X = X, libsize = libsize, conf1 = conf1, conf2 = conf2)
##'
##'
##' ## run the analysis
##' res <- MarZIC(
##'   MicrobData = observed_RA,
##'   CovData = CovData,
##'   lib_name = "libsize",
##'   y_name = "Y",
##'   x_name = "X",
##'   conf_name = c("conf1","conf2"),
##'   taxa_of_interest = NULL,
##'   num_cores = 1,
##'   mediator_mix_range = 1
##' )
##' }
##' @references Wu et al.(2022) MarZIC: A Marginal Mediation Model for Zero-Inflated Compositional Mediators with Applications to Microbiome Data. Genes 2022, 13, 1049.
##'
##'
##' @importFrom foreach foreach %dopar% registerDoSEQ
##' @importFrom parallel makeCluster clusterExport stopCluster clusterSetRNGStream detectCores
##' @importFrom doParallel registerDoParallel stopImplicitCluster
##' @importFrom NlcOptim solnl
##' @importFrom betareg betareg
##' @importFrom pracma jacobian grad hessian
##' @importFrom dirmult rdirichlet
##' @import Rcpp
##' @import stats
##' @import mathjaxr
##'
##'
##'
##' @export
##' @useDynLib MarZIC, .registration=TRUE


MarZIC <- function(MicrobData,
                   CovData,
                   lib_name,
                   y_name,
                   x_name,
                   conf_name = NULL,
                   x4_inter = TRUE,
                   x5_inter = TRUE,
                   taxa_of_interest = "all",
                   mediator_mix_range = 1,
                   transfer_to_RA = TRUE,
                   num_cores = max(detectCores() - 2,1),
                   adjust_method = "fdr",
                   fdr_rate = 0.05,
                   taxDropThresh = 0.8,
                   taxDropCount = 4 * (length(conf_name) + 2),
                   zero_prop_NIE2 = 0.1,
                   zero_count_NIE2 = 4 * (length(conf_name) + 2),
                   SDThresh = 0.05,
                   SDx = 0.05,
                   SDy = 0.05) {

  clean_dat <- data_clean(
    MicrobData = MicrobData,
    CovData = CovData,
    lib_name = lib_name,
    y_name = y_name,
    x_name = x_name,
    conf_name = conf_name,
    taxa_of_interest = taxa_of_interest,
    taxDropThresh = taxDropThresh,
    taxDropCount = taxDropCount,
    SDThresh = SDThresh,
    SDx = SDx,
    SDy = SDy,
    transfer_to_RA = transfer_to_RA
  )
  
  MicrobData_clean <- clean_dat$MicrobData_clean
  conf_name_remain <- clean_dat$conf_name_remain
  CovData <- clean_dat$CovData
  res_list <- suppressWarnings(
    apply_real_data_func(
      MicrobData = MicrobData_clean,
      CovData = CovData,
      lib_name = lib_name,
      y_name = y_name,
      x_name = x_name,
      conf_name = conf_name_remain,
      k_range = mediator_mix_range,
      num_cores = num_cores,
      zero_prop_NIE2 = zero_prop_NIE2,
      zero_count_NIE2 = zero_count_NIE2,
      x4_inter = x4_inter,
      x5_inter = x5_inter
    )
  )
  
  nTaxa <- res_list$nTaxa
  nSub <- res_list$nSub
  
  NIE1_save <- data.frame(matrix(nrow = nTaxa, ncol = 7))
  NIE2_save <- data.frame(matrix(nrow = nTaxa, ncol = 7))
  NDE_save <- data.frame(matrix(nrow = nTaxa, ncol = 7))
  NIE_save <- data.frame(matrix(nrow = nTaxa, ncol = 7))
  
  rownames(NIE1_save) <-
    rownames(NIE2_save) <-
    rownames(NDE_save) <-
    rownames(NIE_save) <- res_list$taxon_ori_name
  colnames(NIE1_save) <-
    colnames(NIE2_save) <- colnames(NDE_save) <- colnames(NIE_save) <-
    c("est",
      "se",
      "CI low",
      "CI up",
      "p value unadj" ,
      "p value adj",
      "significance")
  for (i in seq_len(length(res_list$list_save))) {
    if (is.na(res_list$list_save[[i]]$res_fin_med)[1]) {
      NIE1_save[i,] <-
        NIE2_save[i,] <- NDE_save[i,] <- NIE_save[i,] <- NA
    } else {
      res_temp <- res_list$list_save[[i]]$res_fin_med
      NIE1_save[i, 1] <- res_temp$mediation_effect[1]
      NIE2_save[i, 1] <- res_temp$mediation_effect[2]
      NDE_save[i, 1] <- res_temp$mediation_effect[3]
      NIE_save[i, 1] <- res_temp$mediation_effect[4]
      
      NIE1_save[i, 2] <- res_temp$NIE_sd[1]
      NIE2_save[i, 2] <- res_temp$NIE_sd[2]
      NDE_save[i, 2] <- res_temp$NIE_sd[3]
      NIE_save[i, 2] <- res_temp$NIE_sd[4]
      
      NIE1_save[i, 3] <-
        res_temp$mediation_effect[1] - 1.96 * res_temp$NIE_sd[1]
      NIE2_save[i, 3] <-
        res_temp$mediation_effect[2] - 1.96 * res_temp$NIE_sd[2]
      NDE_save[i, 3] <-
        res_temp$mediation_effect[3] - 1.96 * res_temp$NIE_sd[3]
      NIE_save[i, 3] <-
        res_temp$mediation_effect[4] - 1.96 * res_temp$NIE_sd[4]
      
      NIE1_save[i, 4] <-
        res_temp$mediation_effect[1] + 1.96 * res_temp$NIE_sd[1]
      NIE2_save[i, 4] <-
        res_temp$mediation_effect[2] + 1.96 * res_temp$NIE_sd[2]
      NDE_save[i, 4] <-
        res_temp$mediation_effect[3] + 1.96 * res_temp$NIE_sd[3]
      NIE_save[i, 4] <-
        res_temp$mediation_effect[4] + 1.96 * res_temp$NIE_sd[4]
    }
  }
  
  NIE1_save[, 5] <-
    (1 - pnorm(abs(NIE1_save[, 1] / NIE1_save[, 2]))) * 2
  NIE2_save[, 5] <-
    (1 - pnorm(abs(NIE2_save[, 1] / NIE2_save[, 2]))) * 2
  NDE_save[, 5] <-
    (1 - pnorm(abs(NDE_save[, 1] / NDE_save[, 2]))) * 2
  NIE_save[, 5] <-
    (1 - pnorm(abs(NIE_save[, 1] / NIE_save[, 2]))) * 2
  
  NIE1_save[, 6] <-
    p.adjust((1 - pnorm(abs(
      NIE1_save[, 1] / NIE1_save[, 2]
    ))) * 2, adjust_method)
  NIE2_save[, 6] <-
    p.adjust((1 - pnorm(abs(
      NIE2_save[, 1] / NIE2_save[, 2]
    ))) * 2, adjust_method)
  NDE_save[, 6] <-
    p.adjust((1 - pnorm(abs(
      NDE_save[, 1] / NDE_save[, 2]
    ))) * 2, adjust_method)
  NIE_save[, 6] <-
    p.adjust((1 - pnorm(abs(
      NIE_save[, 1] / NIE_save[, 2]
    ))) * 2, adjust_method)
  
  NIE1_save[, 7] <- NIE1_save[, 6] < fdr_rate
  NIE2_save[, 7] <- NIE2_save[, 6] < fdr_rate
  NDE_save[, 7] <- NDE_save[, 6] < fdr_rate
  NIE_save[, 7] <- NIE_save[, 6] < fdr_rate
  
  output_list <- list(
    NIE1_save = NIE1_save,
    NIE2_save = NIE2_save,
    NDE_save = NDE_save,
    NIE_save = NIE_save
  )
  
  return(output_list)
}