R/emforbeta.R
In glmfitmiss: Fitting GLMs with Missing Data in Both Responses and Covariates

Documented in emforbeta

#' Fitting binary regression with missing categorical covariates using likelihood based method
#'
#'@description This function allows users to fit generalized linear models with incomplete predictors that are categorical. The model is fitted using a likelihood-based method, which ensures reliable parameter estimation even when dealing with missing data. For more information on the underlying methodology, please refer to Pradhan, Nychka, and Bandyopadhyay (2025).
#' @param formula a formula expression as for regression models, of the form response ~ predictors. The response should be a numeric binary variable with missing values, and predictors can be any variables. A predictor with categorical values with missing can be used in the model. See the documentation of formula for other details.
#' @param data  Input data for fitting the model
#' @param family  a character string specifying the type of model family. The default is family=binomial (lin=logit)
#' @param NIterations is the number of iterations to be used for convergence. The default is NIterations=50
#' @param verbose a TRUE or FALSE value, by default it is FALSE. A value TRUE prints all intermediate computational details
#' @param theta a vector containing multinomial parameters that sums to 1, default is NULL
#' @param convergenceCriterion Convergence criteria to be used for convergence. The default is 1e-4
#' @param vcorctn a TRUE or FALSE value, by default it is FALSE. If TRUE, it calculates a variance and standard error using Louis (1982)
#' @param method a method="brglmFit" or method="glm.fit" will be used for fitting model. The method="brglmFit" fits generalized linear models using bias reduction methods (Kosmidis, 2014), and other penalized maximum likelihood methods.The deafult option method="glm.fit" fits regression with generalized linear models.
#' @param augmented is the name of an augmented data. The default is NULL
#' @param VarWithMissingVal is a vector of variables including missing values. The default is NULL
#' @return return the glm estimates
#' @details The `family` parameter in the `emforbeta` function allows you to specify the probability distribution and link function for the response variable in the linear model. It determines the nature of the relationship between the predictors and the response variable.
#' The `family` argument is particularly important when working with binary data, where the response variable has only two possible outcomes. In such cases, you typically want to fit a logistic regression model.
#'
#' The following commonly used families are supported for binary data:
#'
#' - "binomial" for a binomial distribution, suitable for binary or dichotomous response variables.
#'
#' You can also specify different link functions within binomial family. The default link function is the logit function, which models the log-odds of success. Other available link functions include:
#'
#' - "probit" for the probit link function, which models the cumulative standard normal distribution.
#'
#' - "cloglog" for the complementary log-log link function, which models the complementary log-log of the survival function.
#'
#' It is important to choose the appropriate `link` function based on the specific characteristics and assumptions of your binary data. The default "binomial" family with the logit link function is often a good starting point, but alternative link functions might be more appropriate depending on the research question and the nature of the data.
#' See also the function 'emBinRegMAR' function.

#' @export
#'
#' @references
#' Firth, D. (1993). Bias reduction of maximum likelihood estimates, Biometrika, 80, 27-38. doi:10.2307/2336755.
#'
#' Ibrahim, J. G. (1990). Incomplete data in generalized linear models. Journal of the American Statistical Association 85, 765–769.
#'
#' Kosmidis, I., Firth, D. (2021). Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models. Biometrika, 108, 71-82. doi:10.1093/biomet/asaa052.
#'
#'Louis, T. A. (1982). Finding the observed information when using the EM algorithm. Proceedings of the Royal Statistical Society, Ser B, 44, 226-233.
#'
#' Maiti, T., Pradhan, V. (2009). Bias reduction and a solution of separation of logistic regression with missing covariates. Biometrics, 65, 1262-1269.
#'
#' Pradhan, V., Nychka, D. and Bandyopadhyay, S. (2025). Beyond the Odds: Fitting Logistic Regression with Missing Data in Small Samples (submitted).
#'
#' @examples
#' \donttest{
#'data(sixcitydata)
#' f_fit <- emforbeta(Wheeze~city+soc+cond,
#'                    data=sixcitydata,
#'                    vcorctn= TRUE,
#'                    family=binomial(link="logit"),
#'                    method="glm.fit")
#' summary(f_fit$mfit) #creates the summary like glm using the return object mfit
#' vcov_beta<-f_fit$cvcov #creates variance using Louis (1982)
#'
#' # Computes the standard error of the estimates
#' se_beta_em<-sqrt(diag(vcov_beta))
#' se_beta_em
#'
#' # Firth correction
#' f_fit <- emforbeta(Wheeze~city+soc+cond,
#'                    data=sixcitydata,
#'                    family=binomial(link="logit"),
#'                    method="brglmFit")
#' # creates the summary like glm using the return object mfit
#'
#' data(ibrahim)
#' f_fit2 <- emforbeta(y~x1+x2+x3,
#'                     data=ibrahim,
#'                     family="binomial")
#' summary(f_fit2$mfit) #creates the summary like glm using the return object mfit
#'
#' f_fit2 <- emforbeta(y~x1+x2+x3,
#'                     data=ibrahim,
#'                     family=binomial (link="probit"),
#'                     method="brglmFit")
#' # creates the summary like glm using the return object mfit
#' summary(f_fit2$mfit) #
#'
#' data(est)
#' f_fit <- emforbeta(survive~Fetoprtn+Antigen+Jaundice+Age,
#'                    data=est,
#'                    family=binomial,
#'                    method="glm.fit")
#' summary(f_fit$mfit)
#'
#' f_fit <- emforbeta(survive~Fetoprtn+Antigen+Jaundice+Age,
#'                    data=est,
#'                    family=binomial,
#'                    method="brglmFit")
#' # Firth corrected estimates with out Louis (1982) correction (see Maiti and Pradhan (2009))
#' summary(f_fit$mfit)
#'
#' data(metastmelanoma)
#' f_fit <- emforbeta(failcens~size+type+nodal+age+sex+trt,
#'                    data=metastmelanoma,
#'                    family=binomial,
#'                    method="glm.fit")
#' summary(f_fit$mfit)
#'
#' f_fit <- emforbeta(failcens~size+type+nodal+age+sex+trt,
#'                    data=metastmelanoma,
#'                    family=binomial,
#'                    method="brglmFit")
#' # Firth corrected estimates with out Louis (1982) correction (see Maiti and Pradhan (2009))
#' summary(f_fit$mfit)
#'
#' data(felinedata)
#' f_fit <- emforbeta(chlamy~Season+Agegrp+Conj+FHV1,
#'                    data=felinedata,
#'                    family=binomial,
#'                    method="glm.fit")
#' summary(f_fit$mfit)
#'
#' f_fit <- emforbeta(chlamy~Season+Agegrp+Conj+FHV1,
#'                    data=felinedata,
#'                    family=binomial,
#'                    method="brglmFit")
#' # Firth corrected estimates with out Louis (1982) correction
#' summary(f_fit$mfit)
#' }
#'
#' @importFrom stats glm
#' @importFrom brglm2 brglmFit
#' @importFrom abind abind
#' @importFrom dplyr arrange group_by mutate summarise syms distinct group_keys
#' @importFrom utils globalVariables
#' @import data.table
emforbeta<-function(formula, data, family = "binomial", vcorctn=FALSE,
                    method = "glm.fit", NIterations=50,
                    verbose=FALSE, theta=NULL,
                    convergenceCriterion= 1e-4,
                    augmented = NULL, VarWithMissingVal = NULL){

  required_packages <- c("dplyr", "abind", "brglm2", "MASS")
  for (pkg in required_packages) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
      stop(sprintf("Package '%s' needed for this function to work. Please install it.", pkg), call. = FALSE)
    }
  }

  formula_vars <- all.vars(formula)

  resp <- all.vars(formula)[1]

  # Checking if any character variables exist in a formula
  checkCharacterVariablesInFormula (formula, data)

  if (is.null(augmented) || is.null(VarWithMissingVal)) {
    # Derive VarWithMissingVal and augmented as before
    missing_cols <- colSums(is.na(data))
    df_data <- subset(data, select = formula_vars)
    predictor_vars <- attr(terms(formula), "term.labels")
    VarWithMissingVal <- predictor_vars[colSums(is.na(df_data[, predictor_vars])) > 0]
    df_data$m <- ifelse(rowSums(is.na(data.frame(df_data[, VarWithMissingVal]))) > 0, 1, 0)
    df_data$grp <- seq_len(nrow(df_data))
    augmented <- dataAugmentation(df_data, formula)
    #initialization of the theta parameters
    # Check if 'resp' is in the column names and subset accordingly
    if (resp %in% names(augmented$ObjPattern)) {
      # removing the resp variable from the list of missing data
      df_aug_temp<-subset(augmented$ObjPattern, select= -which(names(augmented$ObjPattern) == resp))
    } else {
      df_aug_temp <- augmented$ObjPattern
    }
    
    dfn <- nrow(distinct(df_aug_temp))
    
    theta=rep(1/dfn, dfn)
    
    #initialization of the beta parameters
    beta=rep(0.5,length(predictor_vars)+1 )
    
  }
  df_augmented<-augmented$augData

  df_augmented <- df_augmented[order(df_augmented$grp),]

  #initializing the weight variable with 1
  df_augmented$wgt<-1

  #created a variable origGr to keep track of the predicted prob
  df_augmented$origGr<- seq(1:nrow(df_augmented))

  #getting the beta and predicted prob of the model [y|X] for the next iteration
  fit.m<-suppressWarnings(stats::glm(formula, data = df_augmented, family = family, weights=wgt, method = if (method == "brglmFit") brglm2::brglmFit else method))

  beta=fit.m$coefficients
  df_augmented$fit=fit.m$fitted.values


  #merging theta into the data assuming x2=soc and x3=cond include missing values
  bb=theta_back_2_data(theta, df_augmented,VarWithMissingVal)$data
  df_augmented <- bb


  #m==1 indicates original rows with missing values, replacing all theta=0
  #correponding to all m==0 rows
  combo1<-mutate(df_augmented, theta=ifelse(m==1,theta,0))

  df_augmented<-combo1

  #initializing beta and theta based on the last updated values
  new.estimate=c(beta, theta)

  diff=100
  it=0
  for( it in 1:NIterations) {

    old.estimate=new.estimate

    if (verbose) {
      message("Iteration ", it, " beta estimates: ", 
              paste(old.estimate[1:length(beta)], collapse = " "))
      message("Iteration ", it, " theta estimates: ", 
              paste(old.estimate[(length(beta) + 1):length(old.estimate)], collapse = " "))
    }

    #creating weights based on the current predicted prob and theta
    tt.wgt=df_augmented%>%dplyr::group_by(grp)%>% dplyr::mutate_at(resp, as.integer) %>% dplyr::mutate(ttl.prob=fit*theta)%>%dplyr::mutate(wgt=ttl.prob/sum(ttl.prob))

    combo2<-mutate(tt.wgt, wgt=ifelse(m==1,wgt,1))
    combo2$wgt=combo2$wgt
    df_augmented<-combo2

    #getting the beta and predicted prob of the model [y|X] for the next iteration
    fit.m<-suppressWarnings(stats::glm(formula, data = df_augmented, family = family, method = if (method == "brglmFit") brglm2::brglmFit else method, weights=wgt))
    reg.beta=fit.m$coefficients
    df_augmented$fit=fit.m$fitted.values
    converged<-fit.m$converged

    # combo3<-mutate(tt.wgt, wgt=ifelse(m==1,wgt,1))
    combo2$ttl <- colSums(combo2)["wgt"]


    a.a=combo2%>%dplyr::group_by(!!!syms(VarWithMissingVal))%>%dplyr::summarise(wt=mean(sum(wgt)/ttl))
    theta=as.vector(t(a.a$wt))

    #removing the theta and weight column from the original data x.1
    df_augmented$theta<-NULL
    # putting back the newly computed theta to the data assuming soc and cond missing
    #bb=theta_back_2_data(theta, x.1)$data
    bb <- theta_back_2_data(theta, df_augmented,VarWithMissingVal)$data
    df_augmented<-bb
    combo1<-mutate(df_augmented, theta=ifelse(m==1,theta,0))
    df_augmented<-combo1

    #computes the predicted probabilities based on observed response y=Wheeze
    df_augmented=mutate(df_augmented, fit=ifelse(.data[[resp]]==1,fit,1-fit))

    new.estimate=c(reg.beta,theta)

    # test for convergence
    testCriterion<- sum(abs(old.estimate - new.estimate))

    if(testCriterion <= convergenceCriterion){
      break
    }
}

  if (vcorctn==TRUE)
  {
    #returns the Louis correction and update the fit.m object
    louiscorrection <- louisvcov(formula, data=df_augmented)
    # fit.m$vcov <- louiscorrection
  }
  else {

    louiscorrection <- NULL
  }

  return(list("beta" =reg.beta,"alpha"=theta, "wgt" =df_augmented$wgt,"fitem"=df_augmented$fit, "converged"=converged, "finalData"=df_augmented, "mfit"=fit.m, "cvcov"=louiscorrection))
}