R/MagnitudeRule.R
In GaussSuppression: Tabular Data Suppression using Gaussian Elimination

Documented in DominanceRule FindDominantCells MagnitudeRule PPercentRule

#' Dominance `(n,k)` or p% rule for magnitude tables
#'
#' Supports application of multiple values for `n` and `k`. The function works
#' on magnitude tables containing negative cell values by calculating
#'  contribution based on absolute values.
#'
#' This method only supports suppressing a single numeric variable. There are
#' multiple ways of handling sampling weights in the dominance rule. the default
#' method implemented here compares unweighted sample values with the corresponding
#' weighted cell totals. if `domWeightMethod` is set to `"tauargus"`, the
#' method implemented in tauArgus is used. For more information on this
#' method, see "Statistical Disclosure Control" by Hundepool et al (2012,
#'  p. 151).
#'
#' @param data the dataset
#' @param x ModelMatrix generated by parent function
#' @param numVar vector containing numeric values in the data set
#' @param n Parameter `n` in dominance rule.
#' @param k Parameter `k` in dominance rule.
#' @param pPercent Parameter in the p% rule, when non-NULL.  
#'                 Parameters `n` and  `k` will then be ignored.
#'                 Technically, calculations are performed internally as if 
#'                 `n = 1:2`. The results of these intermediate calculations can 
#'                 be viewed by setting `allDominance = TRUE`.
#' @param protectZeros Parameter determining whether cells with value 0 should
#'  be suppressed.
#'  Unless `structuralEmpty` is `TRUE` (see below), cells that result in a value 
#'  of 0 due to removed `removeCode` contributions are also suppressed.
#' @param charVar Variable in data holding grouping information. Dominance will
#'  be calculated after aggregation within these groups.
#' @param removeCodes A vector of `charVar` codes that are to be excluded when 
#'     calculating dominance percentages. Essentially, the corresponding numeric 
#'     values from `dominanceVar` or `numVar` are set to zero before proceeding 
#'     with the dominance calculations. With empty `charVar` row indices are 
#'     assumed and conversion to integer is performed.
#'     See also `removeCodesFraction` below.
#' @param removeCodesFraction Numeric value(s) in the range `[0, 1]`. 
#'    This can be either a single value or a vector with the same length as `removeCodes`. 
#'    A value of 1 represents the default behavior, as described above. 
#'    A value of 0 indicates that dominance percentages are calculated as if `removeCodes` 
#'    were not removed, but percentages associated with `removeCodes` are still excluded 
#'    when identifying major contributions. Values between 0 and 1 modify the 
#'    contributions of `removeCodes` proportionally in the calculation of percentages.
#' @param sWeightVar variable with sampling weights to be used in dominance rule
#' @param domWeightMethod character representing how weights should be treated
#' in the dominance rule. See Details.
#' @param allDominance Logical. If `TRUE`, additional information is included in the output. 
#'   When `n = 2`, the following variables are added:
#'   - `"dominant2"`: The fraction associated with the dominance rule.
#'   - `"max2contributor"`: IDs associated with the second largest contribution. These IDs 
#'       are taken from `charVar` if provided, or the row indices if `charVar` is not supplied.
#'   - `"n_contr"` and `"n_non0_contr"`: Outputs from \code{\link[SSBtools]{max_contribution}}. 
#'       If `removeCodes` is used as input, `"n_contr_all"` and `"n_non0_contr_all"` are also included.
#'       The parameter `max_contribution_output` can be used to specify custom outputs 
#'       from \code{\link[SSBtools]{max_contribution}}. Note that if `max_contribution_output` 
#'       is provided, only the specified outputs will be included, and the default outputs 
#'       (`"n_contr"` and `"n_non0_contr"`) will not be added unless explicitly listed.    
#' @param outputWeightedNum logical value to determine whether weighted numerical
#' value should be included in output. Default is `TRUE` if `sWeightVar` is provided.
#' @param dominanceVar When specified, `dominanceVar` is used in place of `numVar`. 
#'          Specifying `dominanceVar` is beneficial for avoiding warnings when there 
#'          are multiple `numVar` variables. Typically, `dominanceVar` will be one 
#'          of the variables already included in `numVar`.
#' @param structuralEmpty Parameter as input to \code{\link{GaussSuppressionFromData}}.
#'            It is needed also here to handle structural zeros caused by `removeCodes`.
#' @param max_contribution_output See the description of the `allDominance` parameter.           
#' @param apply_abs_directly Logical. Determines how negative values are treated in the rules. 
#'   When `apply_abs_directly = FALSE` (default), absolute values are taken after summing 
#'   contributions, as performed by \code{\link[SSBtools]{max_contribution}}. 
#'   When `apply_abs_directly = TRUE`, absolute values are computed directly on the input values, 
#'   prior to any summation. This corresponds to the old behavior of the function.
#' @param num Output numeric data generated by parent function. 
#'            This parameter is needed when `protectZeros` is `TRUE`.
#' @param ... unused parameters
#'
#' @return logical vector that is `TRUE` in positions corresponding to cells
#' breaching the dominance rules.
#' @export
#' @importFrom SSBtools max_contribution
#' 
#' @examples
#'   set.seed(123)
#' z <- SSBtools::MakeMicro(SSBtoolsData("z2"), "ant")
#' z$value <- sample(1:1000, nrow(z), replace = TRUE)
#' 
#' GaussSuppressionFromData(z, dimVar = c("region", "fylke", "kostragr", "hovedint"), 
#' numVar = "value", candidates = CandidatesNum, primary = DominanceRule, preAggregate = FALSE,
#' singletonMethod = "sub2Sum", n = c(1, 2), k = c(65, 85), allDominance = TRUE)
#' 
#' 
#'num <- c(100,
#'          90, 10,
#'          80, 20,
#'          70, 30,
#'          50, 25, 25,
#'          40, 20, 20, 20,
#'          25, 25, 25, 25)
#' v1 <- c("v1",
#'         rep(c("v2", "v3", "v4"), each = 2),
#'         rep("v5", 3),
#'         rep(c("v6", "v7"), each = 4))
#' sw <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1)
#' d <- data.frame(v1 = v1, num = num, sw = sw)
#' 
#' # without weights
#' GaussSuppressionFromData(d, formula = ~v1 - 1, 
#'  numVar = "num",  n = c(1,2), k = c(80,70),
#'   preAggregate = FALSE, allDominance = TRUE, candidates = CandidatesNum,
#'   primary = DominanceRule)
#'
#' # with weights, standard method
#' GaussSuppressionFromData(d, formula = ~v1 - 1,
#'  numVar = "num",  n = c(1,2), k = c(80,70), sWeightVar = "sw",
#'  preAggregate = FALSE, allDominance = TRUE, candidates = CandidatesNum,
#'  primary = DominanceRule)
#'
#' # with weights, tauargus method
#' GaussSuppressionFromData(d, formula = ~v1 - 1,
#'  numVar = "num",  n = c(1,2), k = c(80,70), sWeightVar = "sw",
#'  preAggregate = FALSE, allDominance = TRUE, candidates = CandidatesNum,
#'  primary = DominanceRule, domWeightMethod = "tauargus")
#'
#' @author Daniel Lupp and Øyvind Langsrud
#'
MagnitudeRule <- function(data,
                          x,
                          numVar,
                          n = NULL,
                          k = NULL,
                          pPercent = NULL,
                          protectZeros = FALSE,
                          charVar = NULL,
                          removeCodes = character(0), 
                          removeCodesFraction = 1,
                          sWeightVar = NULL,
                          domWeightMethod = "default",
                          allDominance = FALSE,
                          outputWeightedNum = !is.null(sWeightVar),
                          dominanceVar = NULL,
                          structuralEmpty = FALSE,
                          apply_abs_directly = FALSE,  
                          max_contribution_output = NULL,
                          num,
                          ...) {
  if (!is.null(pPercent)) {
    n <- 1:2
    k <- c(0, 0)
  }
  
  if (length(n) != length(k))
    stop("You must provide an equal number of inputs for n and k.")
  
  if(length(dominanceVar)){
    if(length(dominanceVar) != 1){
      stop("dominanceVar must be a single variable")
    }
    numVar <- dominanceVar
  }
  
  
  if (is.null(numVar))
    stop("You must provide a numeric variable numVar to use the dominance rule.")
  
  tauArgusDominance <- as.character(domWeightMethod) == "tauargus"
  
  if (length(charVar) & tauArgusDominance)
    stop("the tauArgus weight method does not work with charVar.")
  if (length(numVar) > 1) {
    warning("Multiple numVar were supplied, only the first is suppressed.")
    numVar <- numVar[1]
  }
  if (!is.null(sWeightVar) & tauArgusDominance) {
    if (any(data[[sWeightVar]] < 1))
      warning("Some sample weights are < 1. Consider using other weighted domininace method.")
  }
  
  
  abs_inputnum <- data[, numVar, drop = FALSE]
  if (apply_abs_directly) {
    abs_inputnum <- abs(abs_inputnum)
  } 
  
  if (is.null(sWeightVar)) {
    sweight <- as.matrix(rep(1, nrow(data)))
    sWeightVarName <- "one.weight"
  }
  else {
    sweight <- as.matrix(data[, sWeightVar, drop = FALSE])
    sWeightVarName <- sWeightVar 
  }
  
  if (length(charVar)) {
    if (length(charVar) == 1) {
      charVar_groups <- data[[charVar]]
    } else {
      stop("Only single charVar implemented")
    }
  } else {
    charVar_groups <- NULL
  }
  
  if (length(removeCodes)) {
    if (length(charVar)) {
      remove_fraction <- rep_len(removeCodesFraction, length(removeCodes))
      names(remove_fraction) <- removeCodes
    } else {
      removeCodesFraction <- rep_len(removeCodesFraction, length(removeCodes))
      remove_fraction <- rep_len(NA_integer_, nrow(x))
      remove_fraction[removeCodes] <- removeCodesFraction
    }
  } else {
    remove_fraction <- NULL
  }

  
  sweight_original <- sweight  # For outputWeightedNum below
  
  
  #
  # Now changed below, but sweight <- NULL is important 
  # Trick to call FindDominantCells without sweight
  if(!is.null(charVar_groups) & !tauArgusDominance){
    abs_num <- as.data.frame(as.matrix(crossprod(x, as.matrix(abs_inputnum * as.vector(sweight)))))
    sweight <- NULL
  } else {
    abs_num <- as.data.frame(as.matrix(crossprod(x, as.matrix(abs_inputnum)))) # as before 
  }
  
  abs_inputnum <- abs_inputnum[[numVar]]

  
  # In old code only abs_num == 0 was used
  if (protectZeros) {
    zeros_to_be_protected <- num[[numVar]] == 0    # ordinary zeros
    zeros_to_be_protected[abs_num == 0] <- TRUE    # possible extra zeros after removeCodes 
    if (structuralEmpty) {
        zeros_to_be_protected[colSums(x) == 0] <- FALSE
    } 
  }
  
  index <- !is.null(sweight)
  
  mc_output <- c("y", "sums") 

  if(index | allDominance){
    mc_output <- c(mc_output, "id")
  }
  if (allDominance) {
    if (is.null(max_contribution_output)) {
      if (is.null(remove_fraction)) {
        max_contribution_output <- c("n_contr", "n_non0_contr")
      } else {
        max_contribution_output <- c("n_contr", "n_non0_contr", "n_contr_all", "n_non0_contr_all")
      }
    }
    mc_output <- unique(c(mc_output, max_contribution_output))
  }
    
  max_contribution_ <- max_contribution(x,
                                        abs_inputnum,
                                        n = max(n),
                                        id = charVar_groups,
                                        output = mc_output, 
                                        drop = FALSE, 
                                        remove_fraction = remove_fraction)
  
  
  if(index) {
    maxContribution <- max_contribution_[["id"]]
  } else {
    maxContribution <- max_contribution_[["y"]]
  }
  
  if (allDominance) {
    maxContribution_id <- max_contribution_[["id"]]
    colnames(maxContribution_id) <- paste0("max", seq_len(max(n)) ,"contributor")
    maxContribution_info <- cbind(as.data.frame(maxContribution_id),
                                  as.data.frame(max_contribution_[max_contribution_output]))
                                  
  }
  
    if(!is.null(charVar_groups) & !tauArgusDominance & !is.null(sWeightVar)) {
      abs_num <- max_contribution(x,
                       abs_inputnum * as.vector(sweight_original),
                       n = max(n),
                       id = charVar_groups,
                       output = "sums", 
                       drop = TRUE, 
                       remove_fraction = remove_fraction)
    } else {
      abs_num <- max_contribution_[["sums"]]
    }
    abs_num <- as.data.frame(matrix(abs_num,  dimnames = list(NULL, numVar)))
    if(!is.null(charVar_groups) & !tauArgusDominance) {
      sweight <- NULL
    }
    
    if (protectZeros) {
      zeros_to_be_protected <- max_contribution_[["sums"]] == 0    
      if (structuralEmpty) {
        zeros_to_be_protected[max_contribution_[["n_contr"]] == 0] <- FALSE
      }
    }
  
  prim <-
    mapply(
      function (a, b)
        FindDominantCells(
          x,
          abs_inputnum,
          abs_num,
          a,
          b,
          charVar_groups = charVar_groups,
          samplingWeight = sweight,
          tauArgusDominance = tauArgusDominance,
          returnContrib = TRUE,
          maxContribution = maxContribution
        ),
      n,
      k
    )
  if (is.null(pPercent)) {
    primary <- sapply(seq_len(ncol(prim)), function(x) prim[, x] >= k[x]/100)
    dominant <- apply(primary, 1, function(x) Reduce(`|`, x))
  } else {
    dominant <- abs(1 - prim[, 2]) < abs(pPercent/100 * prim[, 1])
  }
  colnames(prim) <- paste0("dominant", paste(n, sep = ""))
  if (!protectZeros)
    output <- list(primary = dominant)
  else
    output <- list(primary = (dominant | zeros_to_be_protected))
  if (outputWeightedNum) {
    wnum <- data.frame(v1 = as.vector(crossprod(x, sweight_original)),
                       v2 = as.vector(crossprod(x, sweight_original * data[[numVar]])))
    names(wnum) <- c(sWeightVarName, paste0("weighted.", numVar))
    output[["numExtra"]] <- wnum
  }
  if (allDominance) {
    numExtra <- cbind(as.data.frame(prim),
                      maxContribution_info)
    if ("numExtra" %in% names(output))
      output[["numExtra"]] <- cbind(output[["numExtra"]], numExtra)
    else 
      output[["numExtra"]] <- numExtra
  }
  if (length(names(output)) == 1)
    output <- unlist(output)
  output
}


#' @rdname MagnitudeRule
#' @note Explicit  `protectZeros` in wrappers 
#'       since default needed by \code{\link{GaussSuppressionFromData}}
#' @export
DominanceRule <- function(data, n, k, 
                          protectZeros = FALSE, ...) {
  MagnitudeRule(data = data, n = n, k = k, 
                protectZeros = protectZeros, ...) 
}


#' @rdname MagnitudeRule
#' @export
PPercentRule <- function(data, pPercent,  
                         protectZeros = FALSE, ...) {
  MagnitudeRule(data = data, pPercent = pPercent,
                protectZeros = protectZeros, ...)
}



#' Method for finding dominant cells according to (possibly multiple) n,k
#' dominance rules.
#'
#' Supports functionality for grouping contributions according to holding
#' variables, as well as calculating dominance in surveys with a given sampling
#' weight. Two methods are implemented, depending on whether the sampling
#' weights sum to total population. The parameter `tauArgusDominance`
#' determines this. If `FALSE`, unweighted contributions are compared to weighted
#' cell values. If `TRUE`, the method described in  in the
#' book "Statistical Disclosure Control" (Hundepool et al 2012, p. 151) is used.
#'
#' @param x model matrix describing relationship between input and published
#' cells
#' @param inputnum vector of numeric contributions for each of the input records
#' @param num vector of numeric values for each of the published cells
#' @param n vector of integers describing n parameters in n,k rules. Must be
#' same length as `k` parameter.
#' @param k vector of numeric values describing k parameters in n,k rules, where
#' percentages are described as numbers less than 100. Must be same length as
#' `n` parameter.
#' @param charVar_groups vector describing which input records should be grouped
#' @param samplingWeight vector of sampling weights associated to input records
#' @param tauArgusDominance logical value, default `FALSE`. determines how to
#' handle sampling weights in the dominance rule (see details).
#' @param returnContrib logical value, default `FALSE`. If `TRUE` return value is 
#' the percentage of the first n contributors
#' @param maxContribution Possible precalculated output from `MaxContribution` as input. 
#'                        To speed up. 
#'
#' @return logical vector describing which publish-cells need to be suppressed.
#'
FindDominantCells <- function(x,
                              inputnum,
                              num,
                              n,
                              k,
                              charVar_groups,
                              samplingWeight,
                              tauArgusDominance = FALSE,
                              returnContrib = FALSE, 
                              maxContribution = NULL) {
  
  if (is.null(samplingWeight)) {
    # without sampling weight, calculate dominance directly from numerical values
    if (is.null(maxContribution)) {
      max_cont <- MaxContribution(x, inputnum, n = n, groups = charVar_groups)
    } else {
      max_cont <- maxContribution[, 1:n, drop = FALSE]
    }
    
    max_cont[is.na(max_cont)] <- 0
    if (returnContrib) {
      out <- as.vector(rowSums(max_cont)/unlist(num))
      out[is.nan(out)] <- 0
      return(out)
    } else {
      return(as.vector(num > 0 & rowSums(max_cont) > num * k / 100))
    }
  } else {
    # with sampling weights, need to weight the numerical values
    if (is.null(maxContribution)) {
      max_cont_index <-
        MaxContribution(x,
                        inputnum,
                        n = n,
                        groups = charVar_groups,
                        index = TRUE)
    } else {
      max_cont_index <- maxContribution[, 1:n, drop = FALSE]
    }
    
    max_cont <-
      apply(max_cont_index, 2, function(t)
        ifelse(is.na(t), 0, inputnum[t]))
    
    if (!tauArgusDominance) {
      weighted_num <- crossprod(x, inputnum * samplingWeight)
      ncontributions <- rowSums(max_cont)
    }
    else {
      cont_weights <-
        apply(max_cont_index, 2, function(t)
          samplingWeight[t])
      # last_index_t describes cumulative sum of contributing weights, used to
      # calculate which of the contributions need to be considered
      if (n == 1)
        last_index_t <- cont_weights
      else
        last_index_t <-
          t(apply(cont_weights, 1, function(t)
            cumsum(t)))
      # index of last contribution to be added to upsampled contribution
      last_index <- apply(last_index_t, 1,
                          function(t) {
                            ind <- which(t >= n)[1]
                            ifelse(!length(ind), sum(!is.na(t)), ind)
                          })
      # only keep the first n contributors (counting weights) and ensure
      # rowSums(cont_weights) == n
      for (ind in seq(length(last_index))) {
        cont_weights[ind, last_index[ind]] <- ifelse(last_index[ind] == 1,
                                                     n,
                                                     n - last_index_t[ind, last_index[ind] - 1])
        cont_weights[ind, 1:ncol(cont_weights) > last_index[ind]] <-
          0
      }
      cont_weights[is.na(cont_weights)] <- 0
      # sampling weights multiplied with contributions and added up
      ncontributions <- rowSums(max_cont * cont_weights)
      weighted_num <- crossprod(x, inputnum * samplingWeight)
    }
    if (returnContrib) {
      out <- as.vector(ncontributions/weighted_num)
      out[is.nan(out)] <- 0
      return(out)
    }
    return(as.vector(weighted_num > 0 &
                       ncontributions >= weighted_num * k / 100))
  }
}