R/GEcalib.R
In GECal: Generalized Entropy Calibration

Documented in GEcalib

#' @title Generalized Entropy Calibration
#' 
#' @description
#' \code{GEcalib} computes the calibration weights.
#' Generalized entropy calibration weights maximize the generalized entropy:
#' \deqn{H(\bm{\omega}) = -\sum_{i \in A} G(\omega_i),}
#' subject to the calibration constraints \eqn{\sum_{i \in A} \omega_i \bm{z}_i = \sum_{i \in U} \bm{z}_i},
#' where \eqn{A} denotes the sample index, and \eqn{U} represents the population index. 
#' The auxiliary variables, whose population totals are known, are defined as \eqn{\bm{z}_i^T = (\bm{x}_i^T, g(d_i))}, 
#' where \eqn{g} is the first-order derivative of the gerenalized entropy \eqn{G}, 
#' and \eqn{d_i} is the design weight for each sampled unit \eqn{i \in A}.
#' 
#' @import nleqslv
#' @importFrom stats model.frame model.matrix nlm quantile reformulate
#' 
#' @param formula An object of class "formula" specifying the calibration model. 
#' @param dweight A vector of sampling weights.
#' @param data An optional data frame containing the variables in the model (specified by \code{formula}).
#' @param const A vector used in the calibration constraint for population totals( or means).
#' @param method The method to be used in calibration. See "Details" for more information.
#' @param entropy The generalized entropy used in calibration, which can be either a numeric value or a string. 
#' If numeric, \code{entropy} represents the order of Renyi's entropy, where 
#' \eqn{G(\omega) = r^{-1}(r+1)^{-1}\omega^{r+1}} if \eqn{r \neq 0, -1}.
#' If a string, valid options include: 
#' "SL" (Squared-loss, \eqn{r = 1}), "EL" (Empirical Likelihood, \eqn{r = -1}), "ET" (Exponential Tilting, \eqn{r = 0}), 
#' "HD" (Hellinger Distance, \eqn{r = -1/2}), "CE" (Cross-Entropy), and "PH" (Pseudo-Huber). See "Summary" for details.
#' @param weight.scale Positive scaling factor for the calibration weights \eqn{\omega_i}. Asymptotics justify setting \code{weight.scale} 
#' to the finite population correction (\eqn{fpc = n / N}).
#' @param G.scale Positive scaling factor for the generalized entropy function \eqn{G}. Asymptotics justify setting 
#' \code{G.scale} to the variance of the error term in a linear super-population model.
#' @param K_alpha The \eqn{K} function used in joint optimization when the \code{const} of the debiasing covariate 
#' \eqn{g(d_i)} is not available. \code{K_alpha} can be \code{NULL}, \code{"log"}, or custom functions. See "Details".  
#' @param is.total Logical, \code{TRUE} if \code{sum(const[1])} equals the population size.
#' @param del The optional threshold (\eqn{\delta}) used when Pseudo-Huber (PH) entropy is selected.
#' @param xtol Optional relative steplength tolerance in nleqslv
#' @param maxit Optional maximum number of major iterations in nleqslv
#' @param allowSingular Optional logical value indicating if a small correction to the Jacobian is allowed in nleqslv
#' \code{del = quantile(dweight, 0.75)} if not specified.
#' 
#' @return A list of class \code{calibration} including the calibration weights 
#' and data needed for estimation.
#' 
#' @details 
#' The \code{GEcal} object returns the calibration weights and necessary information for estimating population totals(or mean).
#'
#' The terms to the right of the ~ symbol in the \code{formula} argument define the calibration constraints.
#' When \code{method == "GEC"}, the debiasing covariate \code{g(dweight)} must be included in the \code{formula}.
#' If the population total(mean) of \code{g(dweight)} is unavailable, \code{const} that corresponds to \code{g(dweight)} can be set to \code{NA}.
#' In this case, \code{GECalib} performs joint optimization over
#' both the calibration weights \eqn{\omega_i} and the missing value of \code{const}.
#'
#' The length of the \code{const} vector should match the number of columns in the \code{model.matrix} generated by \code{formula}.
#' Additionally, the condition number of the \code{model.matrix} must exceed \code{.Machine$double.eps} to ensure its invertibility.
#'
#' Both \code{weight.scale} and \code{G.scale} are positive scaling factors used for calibration.
#' Note that \code{weight.scale} is not supported when \code{method == "DS"}.
#' 
#' Let \eqn{q_i} be the scaling factor for the generalized entropy function \eqn{G}, 
#' and \eqn{\phi_i} be the scaling factor for the calibration weights \eqn{\omega_i}.
#' 
#' If \code{method == "GEC"}, \code{GEcalib} minimizes the negative entropy:
#' \deqn{\sum_{i \in A} q_iG(\phi_i\omega_i),}
#' with respect to \eqn{\bm \omega} subject to the calibration constraints \eqn{\sum_{i \in A} \omega_i \bm{z}_i = \sum_{i \in U} \bm{z}_i},
#' where \eqn{\bm{z}_i^T = (\bm{x}_i^T, q_i \phi_i g(\phi_i d_i))}, \eqn{A} denotes the sample index, and \eqn{U} represents the population index. 
#' 
#' If \code{method == "GEC"}, but an element of \code{const} corresponding to the debiasing covariate 
#' \eqn{g(d_i)} is \code{NA}, \code{GEcalib} minimizes the negative adjusted entropy:
#' \deqn{\sum_{i \in A} q_iG(\phi_i\omega_i) - K(\alpha),}
#' with respect to \eqn{\bm \omega} and \eqn{\alpha} subject to the calibration constraints 
#' \eqn{\sum_{i \in A} \omega_i (\bm{x}_i^T, q_i \phi_i g(\phi_i d_i)) = \left(\sum_{i \in U} \bm x_i, \alpha \right)},
#' where the solution \eqn{\hat \alpha} is an estimate of population total for \eqn{g(d_i)}.
#' Examples of \eqn{K(\alpha)} includes \eqn{K(\alpha) = \alpha} when \code{K_alpha == NULL}, and 
#' \deqn{K(\alpha) = \left(\sum_{i \in A} d_i g(d_i) + N \right) 
#' \log \left| \frac{1}{N}\sum_{i \in A}q_i \phi_i \omega_i g(\phi_i \omega_i) + 1  \right|}
#' when \code{K_alpha == "log"}.
#' 
#' If \code{method == "GEC0"}, \code{GEcalib} minimizes the negative adjusted entropy:
#' \deqn{\sum_{i \in A} q_iG(\phi_i\omega_i) - q_i\phi_i\omega_i g(\phi_i d_i)}
#' with respect to \eqn{\bm \omega} subject to the calibration constraints \eqn{\sum_{i \in A} \omega_i \bm{x}_i = \sum_{i \in U} \bm{x}_i}.
#' 
#' If \code{method == "DS"}, \code{GEcalib} minimizes the divergence between \eqn{\bm \omega} and \eqn{\bm d}:
#' \deqn{\sum_{i \in A} q_id_i \tilde G(\omega_i / d_i)}
#' with respect to \eqn{\bm \omega} subject to the calibration constraints \eqn{\sum_{i \in A} \omega_i \bm{x}_i = \sum_{i \in U} \bm{x}_i}. 
#' When \code{method == "DS"}, \code{weight.scale}, the scaling factor for the calibration weights \eqn{\phi_i}, is not applicable.
#' 
#' Examples of \eqn{G} and \eqn{\tilde G} are given in "Summary".
#' 
#' @section Summary:
#' The table below provides a comparison between the \strong{GEC} and \strong{DS} methods.
#' 
#' \tabular{cc}{
#' \strong{GEC} \tab \strong{DS} \cr
#' \tab \cr % This line adds an empty row as a spacer
#' \eqn{\min_{\bm \omega} \left(-H(\bm \omega)\right) = \sum_{i \in A}G(\omega_i) \quad} \tab 
#' \eqn{\quad \min_{\bm \omega} D(\bm \omega, \bm d) = \sum_{i \in A}d_i \tilde G(\omega_i / d_i)} \cr
#' \tab \cr % This line adds an empty row as a spacer
#' s.t. \eqn{\sum_{i \in A} \omega_i (\bm{x}_i^T, g(d_i)) = \sum_{i \in U} (\bm{x}_i^T, g(d_i))} \tab
#' s.t. \eqn{\sum_{i \in A} \omega_i \bm{x}_i^T = \sum_{i \in U} \bm{x}_i^T} \cr
#' \tab \cr % This line adds an empty row as a spacer
#' \eqn{G(\omega) = \begin{cases} \frac{1}{r(r+1)} \omega^{r+1} & r \neq 0, -1\\ 
#' \omega \log \omega - \omega & r = 0\text{(ET)} \\ 
#' -\log \omega & r = -1\text{(EL)} \end{cases}} 
#' \tab \eqn{\tilde G(\omega) = \begin{cases} \frac{1}{r(r+1)} \left(\omega^{r+1} - (r+1)\omega + r\right) & r \neq 0, -1 \\
#' \omega \log \omega - \omega + 1 & r = 0\text{(ET)} \\
#' -\log \omega + \omega - 1 & r = -1\text{(EL)} \end{cases}} \cr
#' }
#' 
#' If \code{method == "GEC"}, further examples include
#' \deqn{G(\omega) = (\omega - 1) \log (\omega-1) - \omega \log \omega}
#' when \code{entropy == "CE"}, and
#' \deqn{G(\omega) = \delta^2 \left(1 + (\omega / \delta)^2 \right)^{1/2}}
#' for a threshold \eqn{\delta} when \code{entropy == "PH"}.
#' 
#' @author Yonghyun Kwon
#' 
#' @references
#' Kwon, Y., Kim, J., & Qiu, Y. (2024). Debiased calibration estimation using generalized entropy in survey sampling.
#' Arxiv preprint <\url{https://arxiv.org/abs/2404.01076}>
#' 
#' Deville, J. C., and Särndal, C. E. (1992). Calibration estimators in survey sampling.
#' Journal of the American statistical Association, 87(418), 376-382.
#' 
#' @examples
#' set.seed(11)
#' N = 10000
#' x = data.frame(x1 = rnorm(N, 2, 1), x2= runif(N, 0, 4))
#' pi = pt((-x[,1] / 2 - x[,2] / 2), 3);
#' pi = ifelse(pi >.7, .7, pi)
#' 
#' delta = rbinom(N, 1, pi)
#' Index_S = (delta == 1)
#' pi_S = pi[Index_S]; d_S = 1 / pi_S
#' x_S = x[Index_S,]
#' 
#' # Deville & Sarndal(1992)'s calibration using divergence
#' w1 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S,
#'                     const = colSums(cbind(1, x)),
#'                     entropy = "ET", method = "DS")$w
#' 
#' # Generalized entropy calibration without debiasing covariate 
#' w2 <- GECal::GEcalib(~ ., dweight = d_S, data = x_S,
#'                     const = colSums(cbind(1, x)),
#'                     entropy = "ET", method = "GEC0")$w
#' all.equal(w1, w2)
#' 
#' # Generalized entropy calibration with debiasing covariate
#' w3 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S,
#'                     const = colSums(cbind(1, x, log(1 / pi))),
#'                     entropy = "ET", method = "GEC")$w
#'                     
#' # Generalized entropy calibration with debiasing covariate
#' # when its population total is unknown
#' w4 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S,
#'                     const = colSums(cbind(1, x, NA)),
#'                     entropy = "ET", method = "GEC")$w
#' all.equal(w1, w4)
#' 
#' w5 <- GECal::GEcalib(~ . + g(d_S), dweight = d_S, data = x_S,
#' const = colSums(cbind(1, x, NA)),
#' entropy = "ET", method = "GEC", K_alpha = "log")$w 


#' @export
GEcalib = function(formula, dweight, data = NULL, const, 
                    method = c("GEC", "GEC0", "DS"),
                    entropy = c("SL", "EL", "ET", "CE", "HD", "PH"), 
                    # weight.bound = NULL, 
                   weight.scale = 1, G.scale = 1,
                    # opt.method = c("nleqslv", "optim", "CVXR"),
                    K_alpha = NULL, is.total = TRUE, del = NULL,
                   xtol = 1e-16, maxit = 1e5, allowSingular = T
){
  entropy <- if (is.numeric(entropy)) {
    entropy  # Assign entropy directly if it's numeric
  } else {
    switch(entropy,
           "EL" = -1, "ET" = 0, "SL" = 1, "HD" = -1 / 2,
           "CE" = "CE", "PH" = "PH",
           stop("Invalid entropy value")  # To handle unexpected values
    )
  }
  
  environment(g) <- environment(); environment(formula) <- environment()
  assign("entropy", entropy, envir = environment())
  assign("del", del, envir = environment())
  
  dweight_name <- deparse(substitute(dweight))
  if (is.null(data)) {
    mf <- model.frame(formula, environment())
    dweight0 = dweight
  } else {
    if(exists(dweight_name, where = data)){
      dweight0 = eval(substitute(dweight), envir = data)
    }else{
      assign(dweight_name, dweight, envir = environment())
      dweight0 = dweight
    }
    mf <- model.frame(formula, data = data)  # Evaluate in the provided data
  }
  
  if(is.null(del)) del = quantile(dweight0, 0.75)
  
  Xs <- model.matrix(attr(mf, "terms"), mf)
  if(length(Xs) > 0 && rcond(Xs) < .Machine$double.eps){
    stop("Error: rcond(model.matrix(formula, data)) < .Machine$double.eps")
  }
  
  if(length(const) != ncol(Xs)){
    stop("Error: length(const) != ncol(model.matrix(formula, data))")
  }
  
  # Get the name of the transformed column g(dweight)
  transformed_name <- paste0("g(", dweight_name, ")")
  
  # Find the location of the column in the model matrix
  col_position <- which(colnames(Xs) == transformed_name)
  
  if(method == "GEC" & length(col_position) == 0){
    stop("Method GEC needs g(dweight) in the formula")
  } 
  
  if(length(weight.scale) == 0) stop("weight.scale has to be positive length")
  
  if(any(weight.scale != 1)){  
    if(any(weight.scale <= 0)) stop("weight.scale has to be positive")
    if(method == "GEC" | method == "GEC0"){
      Xs[,col_position] <- weight.scale * g(dweight0 * weight.scale, entropy = entropy, del = del)
    }else if(method == "DS"){
      warning("weight.scale is not supported in DS method")
    }
  }
  Xs[,col_position] <- Xs[,col_position] * G.scale
  
  What = sum(g(dweight0 * weight.scale, entropy = entropy, del = del) * dweight0 * weight.scale * G.scale)
  if(is.null(K_alpha)){
    K_alpha = identity
  # }else if(is.total & attr(attr(mf, "terms"), "intercept") == 1 & K_alpha == "log"){
  }else if(is.total & K_alpha == "log"){    
    N = const[1]
    K_alpha = function(x) (What + N) * log(abs(x / N + 1))
  }
  
  init = rep(0, length(const))
  if(method == "GEC"){
    init[col_position] = 1
    d = rep(1, nrow(Xs))
    intercept = rep(0, length(d))
  }else if(method == "DS"){
    d = dweight0 * weight.scale
    if(entropy == 0){
      intercept = rep(0, length(d))      
    }else{
      intercept = rep(1 / entropy, length(d))
    }
  }else if(method == "GEC0"){
    d = rep(1, nrow(Xs))
    intercept = g(dweight0 * weight.scale, entropy = entropy, del = del)
  }
  
  if(length(const) == 0){
    w <- dweight0
  }else{
    if(any(is.na(const)) & method == "GEC"){
      if(all(which(is.na(const)) == col_position)){
        nlmres= nlm(targetftn, p = What, d = d, Xs = Xs, init = init,
                    const = const, entropy = entropy, del = del, 
                    weight.scale = weight.scale, G.scale = G.scale,
                    intercept = intercept, K_alpha = K_alpha,
                    maxit = maxit, allowSingular = allowSingular, xtol = xtol)
        if(nlmres$code != 1 & nlmres$code != 2 & nlmres$code != 3){
          stop(message(paste("Messeage from nlm: nlmres$code =", nlmres$code)))
        }
        W = nlmres$estimate
        if(nlmres$minimum >= .Machine$double.xmax){
          message(paste("Messeage from nlm: nlmres$code =", nlmres$code,
                        ", nlmres$minimum =", nlmres$minimum))
          warning("Convergence failed")
          w = rep(NA, length(d))
        }else{
          w = targetftn(W, d = d, Xs = Xs, init = init,
                        const = const, entropy = entropy, del = del,
                        weight.scale = weight.scale, G.scale = G.scale,
                        intercept = intercept, K_alpha = K_alpha, returnw = TRUE,
                        maxit = maxit, allowSingular = allowSingular, xtol = xtol)
        }
      }else{
        stop("NA appears in const outside g(d)")
      }
    }else{
      nleqslv_res = nleqslv::nleqslv(init, f, jac = h, d = d, Xs = Xs, 
                                     const = const, entropy = entropy, del = del,
                                     weight.scale = weight.scale, G.scale = G.scale,
                                     intercept = intercept, 
                                     control = list(maxit = maxit, allowSingular = allowSingular, xtol = xtol),
                                     xscalm = "auto")
      # control = control
      
      if(nleqslv_res$termcd != 1){
        tmpval <- f(nleqslv_res$x, d = d, Xs = Xs, 
                    const = const, entropy = entropy, del = del, 
                    weight.scale = weight.scale, G.scale = G.scale,
                    intercept = intercept)
        
        if(any(is.nan(tmpval)) | (norm(tmpval, type = "2") > 1e-3)){
          message(paste("Messeage from nleqslv: nleqslv_res$message =", nleqslv_res$message,
                        ", norm(tmpval, type = \"2\") =", norm(tmpval, type = "2"),
                        ", nleqslv_res$termcd =", nleqslv_res$termcd))
          warning("Convergence failed")
          w = rep(NA, length(d))   
        }else{
          w = NULL
        }
      }else{
        w = NULL  
      }
      if(is.null(w)){
        w = f(nleqslv_res$x, d = d, Xs = Xs, 
              const = const, entropy = entropy, del = del, 
              weight.scale = weight.scale, G.scale = G.scale,
              intercept = intercept, returnw = TRUE)    
      }
    }
  }
  res_list <- list(w = w, method = method, entropy = entropy,
                   Xs = Xs, weight.scale = weight.scale, G.scale = G.scale,
                   dweight0 = dweight0, del = del, const = const, What = What,
                   col_position = col_position, K_alpha = K_alpha)
  class(res_list) <- "calibration"
  return(res_list)
}