R/wroc.R
In svyROC: Estimation of the ROC Curve and the AUC for Complex Survey Data

Documented in wroc

#' Estimation of the ROC curve of logistic regression models with complex survey data
#'
#' Calculate the ROC curve of a logistic regression model
#' considering sampling weights with complex survey data
#' @param response.var A character string with the name of the column indicating the response variable in the data set
#' or a vector (either numeric or character string) with information of the response variable for all the units.
#' @param phat.var A character string with the name of the column indicating the estimated probabilities in the data set
#' or a numeric vector containing estimated probabilities for all the units.
#' @param weights.var A character string indicating the name of the column with sampling weights or
#' a numeric vector containing information of the sampling weights.
#' It could be \code{NULL} if the sampling design is indicated in the \code{design} argument.
#' For unweighted estimates, set all the sampling weight values to 1.
#' @param tag.event A character string indicating the label used to indicate the event of interest in \code{response.var}.
#' The default option is \code{tag.event = NULL}, which selects the class with the lowest number of units as event.
#' @param tag.nonevent A character string indicating the label used for non-event in \code{response.var}.
#' The default option is \code{tag.nonevent = NULL}, which selects the class with the greatest number of units as non-event.
#' @param data A data frame which, at least, must incorporate information on the columns
#' \code{response.var}, \code{phat.var} and \code{weights.var}.
#' If \code{data=NULL}, then specific numerical vectors must be included in
#' \code{response.var}, \code{phat.var} and \code{weights.var},
#' or the sampling design should be indicated in the argument \code{design}.
#' @param design An object of class \code{survey.design} generated by
#' \code{survey::svydesign} indicating the complex sampling design of the data.
#' If \code{design = NULL}, information on the data set (argument \code{data})
#' and/or sampling weights (argument \code{weights.var}) must be included.
#' @param cutoff.method A character string indicating the method to be used to select the optimal cut-off point.
#' If `cutoff.method = NULL`, then no optimal cut-off point is calculated.
#' If an optimal cut-off point is to be calculated, one of the following methods needs to be selected:
#' `Youden`, `MaxProdSpSe`, `ROC01`, `MaxEfficiency`.
#'
#' @return The output object of this function is a list of class `wroc`, which contains information about the weighted ROC curve of a logistic regression model and some of its components. In particular, this list contains a total of 5 or 6 elements (depending on the selected arguments) with the following information:
#' - `wroc.curve`: this element is a list that contains three numerical vectors. Specifically,
#'   - `Sew.values`: a vector of all the different values for the weighted estimate of the sensitivity across all the possible cut-off points.
#'   - `Spw.values`: a vector of all the different values for the weighted estimate of the specificity across all the possible cut-off points.
#'   - `cutoffs`: this vector contains all the cut-off points that have been considered to estimate sensitivity and specificity parameters.
#' - `wauc`: a numeric value indicating the area under the weighted estimate of the ROC curve.
#' - `optimal.cutoff`: if the argument `cutoff.method != NULL`, this object is a list containing the 4 elements described below:
#'   - `method`: character string indicating the method implemented to calculate the optimal cut-off point.
#'   - `cutoff.value`: the optimal cut-off point value.
#'   - `Spw`: the weighted estimate of the specificity for the optimal cut-off point value (indicated in `cutoff.value`).
#'   - `Sew`: the weighted estimate of the sensitivity for the optimal cut-off point value (indicated in `cutoff.value`).
#' - `tags`: a list containing two elements with the following information:
#'   - `tag.event`: a character string indicating the event of interest.
#'   - `tag.nonevent`: a character string indicating the non-event.
#' - `basics`: a list containing information of the following 4 elements:
#'   - `n.event`: number of units with the event of interest in the data set.
#'   - `n.nonevent`: number of units without the event of interest in the data set.
#'   - `hatN.event`: number of units with the event of interest represented in the population by all the event units in the data set, i.e., the sum of the sampling weights of the units with the event of interest in the data set.
#'   - `hatN.nonevent`: a numeric value indicating the number of non-event units in the population represented by means of the non-event units in the data set, i.e., the sum of the sampling weights of the non-event units in the data set.
#' - `call`: an object saving the information about the way in which the function has been run.
#'
#' @details
#' \eqn{S} indicate a sample of \eqn{n} observations of the vector of random variables \eqn{(Y,\pmb X)}, and \eqn{\forall i=1,\ldots,n,} \eqn{y_i} indicate the \eqn{i^{th}} observation of the response variable \eqn{Y},
#' and \eqn{\pmb x_i} the observations of the vector covariates \eqn{\pmb X}. Let \eqn{w_i} indicate the sampling weight corresponding to the unit \eqn{i} and \eqn{\hat p_i} the estimated probability of event.
#' Let \eqn{S_0} and \eqn{S_1} be subsamples of \eqn{S}, formed by the units without the event of interest (\eqn{y_i=0}) and with the event of interest (\eqn{y_i=1}), respectively.
#' Then, the ROC curve is estimated as follows:
#' \deqn{\widehat{ROC}_w(\cdot)=\{(1-\widehat{Sp}_w(c),\widehat{Se}_w(c)),\:c\in (-\infty, \infty)\}},
#' where, the sensitivity and specificity parameters for a given cut-off point \eqn{c} are estimated as follows:
#' \deqn{\widehat{Se}_w(c)=\dfrac{\sum_{i\in S_1}w_i\cdot I (\hat p_i\geq c)}{\sum_{i\in S_1}w_i}\:;\:\widehat{Sp}_w(c)=\dfrac{\sum_{i\in S_0}w_i\cdot I (\hat p_i<c)}{\sum_{i\in S_0}w_i}.}
#'
#' See Iparragirre et al (2023) for more information. More information of the rest of the elements is given in the documentation of the functions \code{wauc()} and \code{wocp()}.
#'
#'
#' @references Iparragirre, A., Barrio, I. and Arostegui, I. (2023).
#' Estimation of the ROC curve and the area under it with complex survey data.
#' *Stat* **12**(1), e635. (https://doi.org/10.1002/sta4.635)
#'
#' @export
#'
#'
#' @examples
#' data(example_data_wroc)
#'
#' mycurve <- wroc(response.var = "y", phat.var = "phat", weights.var = "weights",
#'                 data = example_data_wroc,
#'                 tag.event = 1, tag.nonevent = 0,
#'                 cutoff.method = "Youden")
#'
#' # Or equivalently
#' \donttest{
#' mycurve <- wroc(response.var = example_data_wroc$y,
#'                 phat.var = example_data_wroc$phat,
#'                 weights.var = example_data_wroc$weights,
#'                 tag.event = 1, tag.nonevent = 0,
#'                 cutoff.method = "Youden")
#'}
wroc <- function(response.var, phat.var, weights.var = NULL,
                 tag.event = NULL, tag.nonevent = NULL,
                 data = NULL, design = NULL,
                 cutoff.method = NULL){

  if(inherits(response.var, "character")){
    response.var <- data[,response.var]
  }

  if(length(table(response.var))!=2){stop("Response variable must have two classes.")}

  if(inherits(phat.var, "character")){
    phat.var <- data[,phat.var]
  }

  if(inherits(weights.var, "character")){
    weights.var <- data[,weights.var]
  }

  if(!is.null(design)){
    data <- get(design$call$data)
    weights <- as.character(design$call$weights[2])
    weights.var <- data[,weights]
  }

  if(length(response.var) != length(phat.var) || length(response.var) != length(weights.var)){
    stop("Vectors indicating the responses, predicted probabilities and sampling weights must be the same length.")
  }

  # Missing tags:
  if(is.null(tag.nonevent)){
    if(is.null(tag.event)){
      tag.nonevent <- names(table(response.var)[which.max(table(response.var))])
      tag.event <- names(table(response.var)[which.min(table(response.var))])
    } else {
      tags <- names(table(response.var))
      tag.other <- which(tags == tag.event)
      tag.nonevent <- tags[-tag.other]
    }
  } else {
    if(is.null(tag.event)){
      tags <- names(table(response.var))
      tag.other <- which(tags == tag.nonevent)
      tag.event <- tags[-tag.other]
    }
  }


  all.phat <- sort(unique(phat.var))

  lower <- all.phat[1:(length(all.phat)-1)]
  upper <- all.phat[2:length(all.phat)]
  cutoffs <- rev(c(0, (lower+upper)/2, 1))

  # ROCw curve:
  sew.v <- sapply(lapply(cutoffs, wse, response.var = response.var, tag.event = tag.event, phat.var = phat.var, weights.var = weights.var), "[[", 1)
  spw.v <- sapply(lapply(cutoffs, wsp, response.var = response.var, tag.nonevent = tag.nonevent, phat.var = phat.var, weights.var = weights.var), "[[", 1)

  res <- list()
  res$wroc.curve <- list(Sew.values = sew.v, Spw.values = spw.v, cutoffs = cutoffs)

  # Area under the ROCw curve:
  #auc <- wauc(response.var, phat.var, weights.var, tag.event, tag.nonevent, data)
  lc <- length(cutoffs)
  auc <- sum(spw.v[-lc]*sew.v[-1]-spw.v[-1]*sew.v[-lc])/2
  res$wauc <- auc

  # Cut-off point:
  if(!is.null(cutoff.method)){
    if(any(!(cutoff.method %in% c("MaxProdSpSe","ROC01","Youden","MaxEfficiency")))) {
      stop ("You have entered an invalid method.", call. = FALSE)
    }
    ocp <- wocp(response.var = response.var, phat.var = phat.var,
                weights.var=weights.var,
                tag.nonevent=tag.nonevent,
                method=cutoff.method)
    res$optimal.cutoff$method <- cutoff.method
    res$optimal.cutoff$cutoff.value <- ocp[["optimal.cutoff"]][["optimal"]][["cutoff"]]
    res$optimal.cutoff$Spw <- ocp[["optimal.cutoff"]][["optimal"]][["Spw"]]
    res$optimal.cutoff$Sew <- ocp[["optimal.cutoff"]][["optimal"]][["Sew"]]
  }

  # Output object:
  res$tags <- list()
  res$tags$tag.event <- as.character(tag.event)
  res$tags$tag.nonevent <- as.character(tag.nonevent)
  res$basics <- list()
  res$basics$n.event <- length(which(response.var==tag.event))
  res$basics$n.nonevent <- length(which(response.var==tag.nonevent))
  res$basics$hatN.event <- sum(weights.var[which(response.var==tag.event)])
  res$basics$hatN.nonevent <- sum(weights.var[which(response.var==tag.nonevent)])
  res$call <- match.call()

  class(res) <- "wroc"

  return(res)

}