R/emBinRegMixedMAR.R
In glmfitmiss: Fitting GLMs with Missing Data in Both Responses and Covariates

Documented in emBinRegMixedMAR

#' Fits binary regression models with both nonignorable missing responses and missing categorical covariates.
#'@description This function allows users to fit generalized linear models with presence of both missing responses that are nonignorable and incomplete predictors that are categorical. The model is fitted using an EM-based method, which ensures reliable parameter estimation even when dealing with missing data. For more information on the underlying methodology, please refer to Pradhan, Nychka, and Bandyopadhyay (2025).
#' @param formula a formula expression as for regression models, of the form response ~ predictors. The response should be a numeric binary variable with missing values, and predictors can be any variables. A predictor with categorical values with missing can be used in the model. See the documentation of formula for other details.
#' @param data  Input data for fitting the model.
#' @param adtnlCovforR an optional list of covariates to be used to fit the logistic regression \code{logit(R) ~ response+predictors+adtnlCovforR}. \code{adtnlCovforR} has to be supplied as a vector. Default is \code{NULL}.
#' @param family  A character string specifying the type of model family. The default is \code{family=binomial (lin=logit)}.
#' @param vcorctn a TRUE or FALSE value, by default it is FALSE. If TRUE, it calculates a variance and standard error using Louis (1982). The default is \code{vcorctn= TRUE}.
#' @param conflev a value for the confidence interval, the default is 0.95
#' @param biascorrectn a TRUE or FALSE value, an option for bias reduced estimates due to Firth (1993). The default is \code{biascorrectn=TRUE}.
#' @param verbose a TRUE or FALSE value, default is verbose = TRUE
#' @details The `family` parameter in the `emBinRegMixedMAR` function allows you to specify the probability distribution and link function for the response variable in the linear model. It determines the nature of the relationship between the predictors and the response variable.
#' The `family` argument is particularly important when working with binary data, where the response variable has only two possible outcomes. In such cases, you typically want to fit a logistic regression model.
#'
#' Currently family=binomial is supported for binary data:
#'
#' You can also specify different link functions within binomial family. The default link function is the logit function, which models the log-odds of success. Other available link functions include:
#'
#' - "probit" for the probit link function, which models the cumulative standard normal distribution.
#'
#' - "cloglog" for the complementary log-log link function, which models the complementary log-log of the survival function.
#'
#' It is important to choose the appropriate `link` function based on the specific characteristics and assumptions of your binary data. The default "binomial" family with the logit link function is often a good starting point, but alternative link functions might be more appropriate depending on the research question and the nature of the data.
#'
#' @return return the glm estimates
#' @export
#'
#' @examples
#'
#' \donttest{
#' data(testyxm) # testyxm is a list called dt
#' dataWithMiss <- testyxm$dataMissing
#' fit <- emBinRegMixedMAR(Wheeze ~ city + soc + cond,
#'                         data = dataWithMiss, adtnlCovforR = c("age"),
#'                         biascorrectn=TRUE)
#' #display summary of the beta estimates of the model
#' fit$beta
#'
#' #display summary of the alpha estimates of the model used
#' #for non-ignorability setting of the missing responses
#' fit$alpha
#'
#' # Examples using Firth (1993) type bias reduction. Complete case analysis or
#' # biascorrection=FALSE encounters separation
#' fit <- emBinRegMixedMAR(resp~Numnill+Numsleep+Smoke+Set+Reftime,
#'                         data=meningitis60ymis, biascorrectn=TRUE)
#' #display summary of the beta estimates of the model
#' fit$beta
#'
#' #display summary of the alpha estimates of the model used
#' #for non-ignorability setting of the missing responses
#' fit$alpha
#' }

#' @references
#' Firth, D. (1993). Bias reduction of maximum likelihood estimates, Biometrika, 80, 27-38. doi:10.2307/2336755.
#'
#' Ibrahim, J. G. (1990). Incomplete data in generalized linear models. Journal of the American Statistical Association 85, 765–769.
#'
#' Kosmidis, I., Firth, D. (2021). Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models. Biometrika, 108, 71-82. doi:10.1093/biomet/asaa052.
#'
#' Louis, T. A. (1982). Finding the observed information when using the EM algorithm. Proceedings of the Royal Statistical Society, Ser B, 44, 226-233.
#'
#' Maiti, T., Pradhan, V. (2009). Bias reduction and a solution of separation of logistic regression with missing covariates. Biometrics, 65, 1262-1269.
#'
#' Pradhan, V., Nychka, D. and Bandyopadhyay, S. (2025). Addressing Missing Responses and Categorical Covariates in Binary Regression Modeling: An Integrated Framework (submitted).
#'
#' @importFrom stats glm qnorm pnorm
#' @importFrom brglm2 brglmFit
#' @importFrom abind abind
#' @importFrom dplyr arrange group_by mutate summarise syms
#' @importFrom utils globalVariables
emBinRegMixedMAR <- function(formula, data, conflev=0.95, adtnlCovforR=NULL,
                              vcorctn= TRUE,
                              family=binomial(link="logit"),
                              biascorrectn=TRUE,
                             verbose = TRUE){

  # Extract the response variable and covariates from the formula
  response_var <- all.vars(formula)[1]
  covariates <- all.vars(formula)[-1]

  # Check for missing values in the response variable
  if (!any(is.na(data[[response_var]]))) {
    if (verbose) message("This function is not applicable as there are no missing values in the response variable.\n")
    return(NULL)
  }

  # Check for missing values in the covariates
  if (!any(sapply(data[covariates], function(x) any(is.na(x))))) {
    if (verbose) message("This function is not applicable as there are no missing values in the covariates.\n")
    return(NULL)
  }

    # Extract variable names from the formula
  term_labels <- attr(terms(formula), "term.labels")

  # Add the intercept name
  param_names <- c("(Intercept)", term_labels)

  zalpha <- abs(stats::qnorm((1-conflev)/2))

  if (biascorrectn==TRUE){
    f_fit <- emyxmiss(formula, data=data, adtnlCovforR=adtnlCovforR, family=family, method="brglmFit")
  }
  else {
    f_fit <- emyxmiss(formula, data=data, adtnlCovforR=adtnlCovforR, family=family, method="glm.fit")
  }

  if (!f_fit$converged) {
    message("Model did not converge. Try setting biascorrectn=TRUE or use other binary regression methods")
    return(NULL)  # It's a good practice to return NULL explicitly if the function should not proceed.
  } else {
    if (vcorctn== TRUE){
      #--without Louis (1982) correction of the vcov-----------------------
      summary_glm <- summary(f_fit$fit_y)

      rowlabels <- rownames(f_fit$beta)
      beta <- f_fit$beta[,2]
      names(beta) <- rowlabels
      # vcov_beta<-f_fit$cvcov #creates variance using Louis (1982)
      vcov_beta <- f_fit$vcov_beta
      se_beta_em<-sqrt(diag(vcov_beta))

      # Compute the z-values from the betas and standard errors
      z_values <- beta / se_beta_em

      # Calculate the p-values using the standard normal distribution
      p_values <- 2 * (1 - stats::pnorm(abs(z_values)))

      LowerCI <- beta -zalpha*se_beta_em
      UpperCI <- beta +zalpha*se_beta_em
      emest <- data.frame(beta, se_beta_em, LowerCI, UpperCI, p_values)

      colnames(emest) <- c("Estimate", "StdError", "LowerCI", "UpperCI","Pr(>|z|)")
      summary_glm$coefficients <- emest
    }
    else{
      emest=summary(f_fit$fit_y)
    }
    return (list("beta"=summary_glm, "alpha"=summary(f_fit$fit_r)))
  }
 }