R/SelectorMaybe.R

#' @title Selector-Combination that Selects According to Two Selectors
#'
#' @include Selector.R
#'
#' @name dict_selectors_maybe
#'
#' @description
#' [`Selector`] that wraps two other [`Selector`]s given during construction and uses both for selection proportionally.
#' Each of the resulting `n_select` individuals is chosen either from `$selector`, or from `$selector_not`.
#'
#' This makes it possible to implement selection methods such as random interleaving, where only a fraction of `p`
#' individuals were selected by a criterion, while the others are taken randomly.
#'
#' @section Algorithm:
#' To perform selection, `n_selector_in` rows of `values` are given to `$selector`, and the remaining
#' `nrow(values) - n_selector_in` rows are given to `$selector_not`. Both selectors are used to generate
#' a subset of selected individuals: `$selector` generates `n_selector_out` individuals, and `$selector_not` generates
#' `n_select - n_selector_out` individuals.
#'
#' `n_selector_in` is either set to
#' `round(nrow(values) * p_in)` when `proportion_in` is `"exact"`, or to `rbinom(1, nrow(values), p_in)` when `proportion_in` is `"random"`.
#'
#' `n_selector_out` is set to `round(n_select * p_out)` when `proportion_out` is `"exact"`, or to `rbinom(1, n_select, p_out)` when `proportion_out` is `"random"`.
#'
#' When `odds_correction` is `TRUE`, then `p_out` is adjusted depending on the used `n_selector_in` value before being applied. Let `odds(p) = p/(1-p)`.
#' Then the effective `p_out` is set such that `odds(effective p_out) = odds(p_out) * n_selector_in / (nrow(values) - n_selector_in) / odds(p_in)`.
#' This corrects for the discrepancy between the chosen `p_in` and the effective proportion of `n_selector_in / nrow(values)` caused either by rounding
#' errors or when `proportion_in` is `"random"`.
#'
#' When `p_in` is exactly 1 or exactly 0, and `p_out` is not equal to `p_in`, then an error is given.
#'
#' If `nrow(values)` is 1, then this individuum is returned and `$selector` / `$selector_not` are not called.
#'
#' If `try_unique` is `TRUE`, then `n_selector_out` is set to at most `n_selector_in` and at least `n_select - nrow(values) + n_selector_in`,
#' and an error is generated when `nrow(values)` is less than `n_select`.
#'
#' If `try_unique` is `FALSE` and `odds_correction` is `TRUE` and `n_selector_in` is either 0 or `nrow(values)`, then `$p_out` is set to either 0 or 1, respectively.
#'
#' If `try_unique` is `FALSE` and `odds_correction` is `FALSE` and `n_selector_in` is either 0 or `nrow(values)`, and `n_selector_out` is not equal
#' to 0 or `n_select`, respectively, then
#' `n_selector_in` is increased / decreased by 1 to give `$selector_not` / `$selector` at least one individuum to choose from. While this behaviour
#' may seem pathological, it is to ensure continuity with sampled values of `n_selector_in` that are close to 0 or `n_select`.
#'
#' If `n_selector_out` is `n_select` or 0, or if `n_selector_in` is `nrows(values) - 1` or 1,
#' then only `$selector` / `$selector_not` is executed, respectively; possibly with a subset
#' of `values` if `n_selector_in` differs from `nrow(values)` / 0.
#'
#' @section Configuration Parameters:
#' This operator has the configuration parameters of the [`Selector`]s that it wraps: The configuration parameters of the operator given to the `selector` construction argument
#' are prefixed with `"maybe."`, the configuration parameters of the operator given to the `selector_not` construction argument are prefixed with `"maybe_not."`.
#'
#' Additional configuration parameters:
#' * `p_in` :: `numeric(1)` \cr
#'   Probability per individual (when `random_choise` is `TRUE`), or fraction of individuals (when `random_choice` is `FALSE`),
#'   that are given to `$selector` instead of `$selector_not`. This may be overriden when `try_unique` is `TRUE`, in which
#'   case at least as many rows are given to `$selector` and `$selector_not` as they are generating output values respectively.
#'   When this is exactly 1 or exactly 0, then `p_out` must be equal to `p_in`.
#'   Must be set by the user.
#' * `p_out` :: `numeric(1)` \cr
#'   Probability per output value (when `random_choise` is `TRUE`), or fraction of output values (when `random_choice` is `FALSE`),
#'   that are generated by `$selector` instead of `$selector_not`. When this values is not given, it defaults to `p_in`.
#' * `shuffle_input` :: `logical(1)` \cr
#'   Whether to distribute input values randomly to `$selector` / `$selector_not`. If `FALSE`, then the first part of `values`
#'   is given to `$selector`. This only randomizes *which* lines of `values` are given to `$selector` / `$selector_not`, but it
#'   does not necessarily reorder the lines of values given to each. In particular, if `p_out` is 0 or 1, then no shuffling takes place. Initialized to `TRUE`.
#' * `proportion_in` :: `character(1)` \cr
#'   When set to `"random"`, sample the number of individuals given to `$selector` according to `rbinom(1, nrow(values), p_in)`.
#'   When set to `"exact"`, give `$selector` `round(nrow(values) * p_in)` individuals. Initialized to `"exact"`.
#' * `proportion_out` :: `character(1)` \cr
#'   When set to `"random"`, sample the number of individuals generated by `$selector` according to `rbinom(1, n_select, p_out)`.
#'   When set to `"exact"`, have `$selector` generate `round(n_select * p_out)` individuals.
#' * `odds_correction` :: `logical(1)`\cr
#'   When set, the effectively used value of `p_out` is set to
#'   `1 / (1 + ((nrow(values) - n_selector_in) * p_in * (1 - p_out)) / (n_selector_in * p_out * (1 - p_in)))`, see the **Algorithm** section.
#'   Initialized to `FALSE`.
#' * `try_unique` :: `logical(1)`\cr
#'   Whether to give at least as many rows of `values` to each of `$selector` and `$selector_not` as they are generating output
#'   values. This should be set to `TRUE` whenever `SelectorMaybe` is used to select unique values, and can be set to
#'   `FALSE` when selecting values multiple times is acceptable. When this is `TRUE`, then having `n_select > nrow(values)`
#'   generates an error. Initialized to `TRUE`.
#'
#' @templateVar id maybe
#' @templateVar additional , \<selector\> \[, \<selector_not\>\]
#' @template autoinfo_prepare_ftr
#'
#' @section Supported Operand Types:
#'
#' Supported [`Domain`][paradox::Domain] classes are the set intersection of supported classes of `selector` and `selector_not`.
#'
#' @template autoinfo_dict
#'
#' @family selectors
#' @family selector wrappers
#' @export
SelectorMaybe = R6Class("SelectorMaybe",
  inherit = Selector,
  public = list(
    #' @description
    #' Initialize the `SelectorMaybe` object.
    #' @param selector ([`Selector`])\cr
    #'   [`Selector`] to wrap. This operator gets run with probability / fraction `p_in` (Configuration parameter).\cr
    #'   The constructed object gets a *clone* of this argument.
    #'   The `$selector` field will reflect this value.
    #' @param selector_not ([`Selector`])\cr
    #'   Another [`Selector`] to wrap. This operator runs when `selector` is not chosen. By
    #'   default, this is [`SelectorRandom`], i.e. selecting randomly.\cr
    #'   The constructed object gets a *clone* of this argument.
    #'   The `$selector_not` field will reflect this value.
    initialize = function(selector, selector_not = SelectorRandom$new()) {
      private$.wrapped = assert_r6(selector, "Selector")$clone(deep = TRUE)
      private$.wrapped_not = assert_r6(selector_not, "Selector")$clone(deep = TRUE)

      if (!paradox_s3) {
        private$.wrapped$param_set$set_id = "maybe"
        private$.wrapped_not$param_set$set_id = "maybe_not"
      }
      private$.maybe_param_set = ps(
        p_in = p_dbl(0, 1, tags = "required"),
        p_out = p_dbl(0, 1),
        shuffle_input = p_lgl(tags = "required"),
        proportion_in = p_fct(c("random", "exact"), tags = "required"),
        proportion_out = p_fct(c("random", "exact"), tags = "required"),
        odds_correction = p_lgl(tags = "required"),
        try_unique = p_lgl(tags = "required")
      )
      private$.maybe_param_set$values = list(shuffle_input = TRUE,
        proportion_in = "exact", proportion_out = "exact", try_unique = TRUE)
      super$initialize(param_classes = intersect(selector$param_classes, selector_not$param_classes),
        param_set = alist(private$.maybe_param_set, maybe = private$.wrapped$param_set, maybe_not = private$.wrapped_not$param_set),
        supported = intersect(selector$supported, selector_not$supported),
        packages = c("stats", selector$packages, selector_not$packages), dict_entry = "maybe",
        own_param_set = quote(private$.maybe_param_set))
    },
    #' @description
    #' See [`MiesOperator`] method. Primes both this operator, as well as the wrapped operators
    #' given to `selector` and `selector_not` during construction.
    #' @param param_set ([`ParamSet`][paradox::ParamSet])\cr
    #'   Passed to [`MiesOperator`]`$prime()`.
    #' @return [invisible] `self`.
    prime = function(param_set) {
      private$.wrapped$prime(param_set)
      private$.wrapped_not$prime(param_set)
      super$prime(param_set)
      invisible(self)
    }
  ),
  active = list(
    #' @field selector ([`Selector`])\cr
    #' [`Selector`] being wrapped. This operator gets run with probability / proportion `p_in` and generates
    #' output with probability / proportion `p_out` (configuration parameters).
    selector = function(val) {
      if (!missing(val)) stop("mutator is read-only.")
      private$.wrapped
    },
    #' @field selector_not ([`Selector`])\cr
    #' Alternative [`Selector`] being wrapped. This operator gets run with probability / proportion `1 - p_in`
    #' and generates output with probability / proportion `1 - p_out` (configuration parameters).
    selector_not = function(val) {
      if (!missing(val)) stop("mutator_not is read-only.")
      private$.wrapped_not
    }
  ),
  private = list(
    .select = function(values, fitnesses, n_select, group_size) {
      params = private$.maybe_param_set$get_values()
      params$p_out = params$p_out %??% params$p_in

      ###########
      ## checks
      if (params$p_in %in% c(0, 1) && params$p_out != params$p_in) {
        stopf("When p_in is %s, then p_out must also be %s, but it is %s instead.", params$p_in, params$p_in, params$p_out)
      }

      if (params$try_unique && nrow(values) < n_select) {
        stopf("try_unique is TRUE, but nrow(values) (which is %s) is less than n_select (which is %s).", nrow(values), n_select)
      }

      ###########
      ## default returns
      if (nrow(values) == 1) return(rep(1L, n_select))

      ###########
      ## determine n_selector_in
      n_selector_in = switch(params$proportion_in,
        random = stats::rbinom(1, nrow(values), params$p_in),
        excact = round(nrow(values) * params$p_in)
      )
      n_selector_not_in = nrow(values) - n_selector_in

      ###########
      ## odds_correction
      if (params$odds_correction) {
        if (params$p_in == params$p_out || n_selector_in == 0 || n_selector_not_in == 0) {
          # catch mostly the p_in == 0 or 1 cases
          # If p_out were not equal to p_in in these cases we would throw an error above.
          # If they *are* equal, we would get a 0 / 0 division below.
          params$p_out = n_selector_in / nrow(values)
        } else {
          params$p_out = 1 /
            (1 +
              (n_selector_not_in * params$p_in * (1 - params$p_out)) /
              (n_selector_in * params$p_out * (1 - params$p_in))
            )
        }
      }

      ###########
      ## determine n_selector_out
      n_selector_out = switch(params$proportion_in,
        random = stats::rbinom(1, nrow(values), params$p_out),
        excact = round(nrow(values) * params$p_out)
      )

      ###########
      ## handle try_unique
      if (params$try_unique) {
        # make sure:
        # 1) n_selector_out >= n_selector_in
        # 2) n_select - n_selector_out >= nrow(values) - n_selector_in
        #   <=> n_selector_out <= n_select - nrow(values) + n_selector_in
        n_selector_out = min(max(n_selector_out, n_select - nrow(values) + n_selector_in), n_selector_in)
      } else if (n_selector_out %nin% c(0, n_select)) {  # can only happen when !odds_correction, sine otherwise p_out is 0 or 1 by now
        n_selector_in = min(max(n_selector_in, 1), nrow(values) - 1)
      }

      ###########
      ## determine group_size
      group_size = vectorize_group_size(group_size, n_select)

      if (params$proportion_out == "random") {
        # random: distribute inputs among groups
        # bitmap to represent all output positions, and whether they are given to selector 'maybe'
        #   [<group 1>|-group 2   >|-grp3>]
        #   [_____________________________] -- bitmap
        group_bitmap = logical(n_select)
        # distribution n_selector_out positions on that bitmap
        #   [<group 1>|-group 2   >|-grp3>]
        #   [______#_______#___#__________]
        group_bitmap[sample.int(n_select, n_selector_out, replace = FALSE)] = TRUE
        # for each position, compute how many go to 'maybe' selector at that position or before
        #   [<group 1>|-group 2   >|-grp3>]
        #   [______#_______#___#__________]
        #   [00000011111111222233333333333] -- take_le
        take_le = cumsum(group_bitmap)
        # index into this vector through cumsum of group size
        #   [<group 1>|-group 2   >|-grp3>]
        #   [______#_______#___#__________]
        #   [00000011111111222233333333333]
        #            ^            ^      ^
        group_index = cumsum(group_size)
        # this many are in the given group, or below
        #   [        1            3      3]
        group_le = take_le[group_index]
        # differentiate to get items in each group, starting with 0
        #   [ 1 2 0 ]
        group_size_out = diff(c(0, group_le))
      } else {
        # exact: distribute as equally as possible
        group_size_out = floor(group_size * n_selector_out / n_select)
        remainder = n_selector_out - sum(group_size_out)
        adding = sample.int(length(group_size), remainder, replace = FALSE)
        group_size_out[adding] = group_size_out[adding] + 1
        group_bitmap = unlist(Map(function(gs, gs_out) {
          bitmap = logical(gs)
          bitmap[sample.int(gs, gs_out, replace = FALSE)] = TRUE
          bitmap
        }, group_size, group_size_out))
      }
      group_size_not_out = group_size - group_size_out

      selector_in_positions = seq_len(n_selector_in)
      selector_not_in_positions = seq_len(nrow(values) - n_selector_in) + n_selector_in
      if (params$shuffle_input) {
        # sample a logical vector, so we can also negate
        s_in = sample.int(nrow(values)) <= n_selector_in
        selector_in_positions = s_in[selector_in_positions]
        selector_not_in_positions = s_in[selector_not_in_positions]
      }

      if (n_selector_out == 0) {
        selected_out = integer(0)
      } else if (n_selector_in == 1) {
        selected_out = rep(selector_in_positions, n_selector_out)
      } else {
        selected_out = selector_in_positions[private$.wrapped$operate(
          values[selector_in_positions],
          fitnesses[selector_in_positions, , drop = FALSE],
          n_selector_out,
          group_size_out[group_size_out != 0]
        )]
      }
      if (n_selector_out == n_select) {
        selected_not_out = integer(0)
      } else if (n_selector_in == nrow(values) - 1) {
        selected_not_out = rep(selector_not_in_positions, n_select - n_selector_out)
      } else {
        selected_not_out = selector_not_in_positions[private$.wrapped$operate(
          values[selector_not_in_positions],
          fitnesses[selector_not_in_positions, , drop = FALSE],
          n_select - n_selector_out,
          group_size_not_out[group_size_not_out != 0]
        )]
      }
      output = integer(n_select)
      output[group_bitmap] = selected_out
      output[!group_bitmap] = selected_not_out
      output
    },
    .wrapped = NULL,
    .wrapped_not = NULL,
    .maybe_param_set = NULL
  )
)
dict_selectors$add("maybe", SelectorMaybe)
mlr-org/miesmuschel documentation built on April 5, 2025, 6:08 p.m.