R/wsp.R
In svyROC: Estimation of the ROC Curve and the AUC for Complex Survey Data

Documented in wsp

#' Estimation of the specificity with complex survey data
#'
#' Estimate the specificity parameter for a given cut-off point
#' considering sampling weights with complex survey data.
#' @param response.var A character string with the name of the column indicating the response variable in the data set
#' or a vector (either numeric or character string) with information of the response variable for all the units.
#' @param phat.var A character string with the name of the column indicating the estimated probabilities in the data set
#' or a numeric vector containing estimated probabilities for all the units.
#' @param weights.var A character string indicating the name of the column with sampling weights or
#' a numeric vector containing information of the sampling weights.
#' It could be \code{NULL} if the sampling design is indicated in the \code{design} argument.
#' For unweighted estimates, set all the sampling weight values to 1.
#' @param tag.nonevent A character string indicating the label used for non-event in \code{response.var}.
#' The default option is \code{tag.nonevent = NULL}, which selects the class with the greatest number of units as non-event.
#' @param cutoff.value A numeric value indicating the cut-off point to be used.
#' No default value is set for this argument, and a numeric value must be indicated necessarily.
#' @param data A data frame which, at least, must incorporate information on the columns
#' \code{response.var}, \code{phat.var} and \code{weights.var}.
#' If \code{data=NULL}, then specific numerical vectors must be included in
#' \code{response.var}, \code{phat.var} and \code{weights.var},
#' or the sampling design should be indicated in the argument \code{design}.
#' @param design An object of class \code{survey.design} generated by
#' \code{survey::svydesign} indicating the complex sampling design of the data.
#' If \code{design = NULL}, information on the data set (argument \code{data})
#' and/or sampling weights (argument \code{weights.var}) must be included.
#'
#' @return The output of this function is a list of 4 elements containing the following information:
#' - `Spw`: a numeric value indicating the weighted estimate of the specificity parameter.
#' - `tags`: a list containing one element with the following information:
#'    - `tag.nonevent`: a character string indicating the label used for non-events.
#' - `basics`: a list containing information of the following 6 elements:
#'    - `n`: a numeric value indicating the number of units in the data set.
#'    - `n.nonevent`: a numeric value indicating the number of units in the data set without the event of interest.
#'    - `n.nonevent.class`: a numeric value indicating the number of units in the data set without the event of interest that are correctly classified as non-events based on the selected cut-off point.
#'    - `hatN`: a numeric value indicating the number of units in the population that are represented by means of the units in the data set, i.e., the sum of the sampling weights of all the units in the data set.
#'    - `hatN.nonevent`: a numeric value indicating the number of non-event units in the population represented by means of the non-event units in the data set, i.e., the sum of the sampling weights of the non-event units in the data set.
#'    - `hatN.nonevent.class`: number of non-event units represented in the population by the non-event units in the data set that have been correctly classified as non-events based on the selected cut-off point, i.e., the sum of the sampling weights of the correctly classified non-event units in the data set.
#' - `call`: an object saving the information about the way in which the function has been run.
#'
#' @details
#' Let \eqn{S} indicate a sample of \eqn{n} observations of the vector of random variables \eqn{(Y,\pmb X)}, and \eqn{\forall i=1,\ldots,n,} \eqn{y_i} indicate the \eqn{i^{th}} observation of the response variable \eqn{Y},
#' and \eqn{\pmb x_i} the observations of the vector covariates \eqn{\pmb X}. Let \eqn{w_i} indicate the sampling weight corresponding to the unit \eqn{i} and \eqn{\hat p_i} the estimated probability of event.
#' Let \eqn{S_0} and \eqn{S_1} be subsamples of \eqn{S}, formed by the units without the event of interest (\eqn{y_i=0}) and with the event of interest (\eqn{y_i=1}), respectively.
#' Then, the specificity parameter for a given cut-off point \eqn{c} is estimated as follows:
#' \deqn{\widehat{Sp}_w(c)=\dfrac{\sum_{i\in S_0}w_i\cdot I (\hat p_i<c)}{\sum_{i\in S_0}w_i}.}
#' See Iparragirre et al. (2022) and Iparragirre et al. (2023) for more details.
#'
#' @references Iparragirre, A., Barrio, I., Aramendi, J. and Arostegui, I. (2022).
#' Estimation of cut-off points under complex-sampling design data.
#' *SORT-Statistics and Operations Research Transactions* **46**(1), 137--158. (https://doi.org/10.2436/20.8080.02.121)
#' @references Iparragirre, A., Barrio, I. and Arostegui, I. (2023).
#' Estimation of the ROC curve and the area under it with complex survey data.
#' *Stat* **12**(1), e635. (https://doi.org/10.1002/sta4.635)
#'
#' @export
#' @examples
#' data(example_data_wroc)
#'
#' sp.obj <- wsp(response.var = "y",
#'               phat.var = "phat",
#'               weights.var = "weights",
#'               tag.nonevent = 0,
#'               cutoff.value = 0.5,
#'               data = example_data_wroc)
#'
#' # Or equivalently
#' sp.obj <- wsp(response.var = example_data_wroc$y,
#'               phat.var = example_data_wroc$phat,
#'               weights.var = example_data_wroc$weights,
#'               tag.nonevent = 0,
#'               cutoff.value = 0.5)
#' sp.obj

wsp <- function(response.var, phat.var, weights.var = NULL, tag.nonevent = NULL, cutoff.value, data = NULL, design = NULL){

  if(missing(cutoff.value) || !is.numeric(cutoff.value)){stop("Please, indicate a numeric value in the 'cutoff.value' argument to be used as cut-off point.")}

  if(inherits(response.var, "character")){
    response.var <- data[,response.var]
  }

  if(length(table(response.var))!=2){stop("Response variable must have two classes.")}

  if(inherits(phat.var, "character")){
    phat.var <- data[,phat.var]
  }

  if(inherits(weights.var, "character")){
    weights.var <- data[,weights.var]
  }

  if(!is.null(design)){
    data <- get(design$call$data)
    weights <- as.character(design$call$weights[2])
    weights.var <- data[,weights]
  }

  if(length(response.var) != length(phat.var) || length(response.var) != length(weights.var)){
    stop("Vectors indicating the responses, predicted probabilities and sampling weights must be the same length.")
  }

  if(is.null(tag.nonevent)){
    tag.nonevent <- names(table(response.var)[which.max(table(response.var))])
  }

  sum.weight <- sum(weights.var[which(response.var == tag.nonevent)])
  yhat.nonevent.w <- sum(weights.var[which(response.var == tag.nonevent & phat.var < cutoff.value)])

  spw.hat <- yhat.nonevent.w/sum.weight

  spw.obj <- list()

  spw.obj$Spw <- spw.hat
  spw.obj$tags <- list(tag.nonevent = as.character(tag.nonevent))
  spw.obj$basics <- list(n = length(response.var),
                         n.nonevent = length(which(response.var == tag.nonevent)),
                         n.nonevent.class = length(which(response.var == tag.nonevent & phat.var < cutoff.value)),
                         hatN = sum(weights.var),
                         hatN.nonevent = sum.weight,
                         hatN.nonevent.class = yhat.nonevent.w)
  spw.obj$call <- match.call()

  return(spw.obj)

}