R/dig_grid.R
In nuggets: Extensible Data Pattern Searching Framework

Documented in dig_grid

#' Search for grid-based rules
#'
#' @description
#' `r lifecycle::badge("experimental")`
#'
#' This function creates a grid column names specified
#' by `xvars` and `yvars` (see [var_grid()]). After that, it enumerates all
#' conditions created from data in `x` (by calling [dig()]) and for each such
#' condition and for each row of the grid of combinations, a user-defined
#' function `f` is executed on each sub-data created from `x` by selecting all
#' rows of `x` that satisfy the generated condition and by selecting the
#' columns in the grid's row.
#'
#' Function is useful for searching for patterns that are based on the
#' relationships between pairs of columns, such as in [dig_correlations()].
#'
#' @param x a matrix or data frame with data to search in.
#' @param f the callback function to be executed for each generated condition.
#'      The arguments of the callback function differ based on the value of the
#'      `type` argument (see below):
#'      \itemize{
#'      \item If `type = "crisp"` (that is, boolean),
#'      the callback function `f` must accept a single argument `pd` of type
#'      `data.frame` with single (if `yvars == NULL`) or two (if `yvars != NULL`)
#'      columns, accessible as `pd[[1]]` and `pd[[2]]`. Data frame `pd` is
#'      a subset of the original
#'      data frame `x` with all rows that satisfy the generated condition.
#'      Optionally, the callback function may accept an argument `nd` that
#'      is a subset of the original data frame `x` with all rows that do not
#'      satisfy the generated condition.
#'      \item If `type = "fuzzy"`, the callback function `f` must accept an argument
#'      `d` of type `data.frame` with single (if `yvars == NULL`) or two (if
#'      `yvars != NULL`) columns, accessible as `d[[1]]` and `d[[2]]`, and
#'      a numeric argument `weights` with the same length as the number of rows
#'      in `d`. The `weights` argument contains the truth degree
#'      of the generated condition for each row of `d`. The truth degree is
#'      a number in the interval \eqn{[0, 1]} that represents the degree of
#'      satisfaction of the condition in the original data row.
#'      }
#'      In all cases, the function must return a list of scalar values, which
#'      will be converted into a single row of result of final tibble.
#' @param condition a tidyselect expression (see
#'      [tidyselect syntax](https://tidyselect.r-lib.org/articles/syntax.html))
#'      specifying the columns to use as condition predicates. The selected
#'      columns must be logical or numeric. If numeric, fuzzy conditions are
#'      considered.
#' @param xvars a tidyselect expression (see
#'      [tidyselect syntax](https://tidyselect.r-lib.org/articles/syntax.html))
#'      specifying the columns of `x`, whose names will be used as a domain for
#'      combinations use at the first place (xvar)
#' @param yvars `NULL` or a tidyselect expression (see
#'      [tidyselect syntax](https://tidyselect.r-lib.org/articles/syntax.html))
#'      specifying the columns of `x`, whose names will be used as a domain for
#'      combinations use at the second place (yvar)
#' @param disjoint an atomic vector of size equal to the number of columns of `x`
#'      that specifies the groups of predicates: if some elements of the `disjoint`
#'      vector are equal, then the corresponding columns of `x` will NOT be
#'      present together in a single condition. If `x` is prepared with
#'      [partition()], using the [var_names()] function on `x`'s column names
#'      is a convenient way to create the `disjoint` vector.
#' @param allow a character string specifying which columns are allowed to be
#'      selected by `xvars` and `yvars` arguments. Possible values are:
#'      \itemize{
#'      \item `"all"` - all columns are allowed to be selected
#'      \item `"numeric"` - only numeric columns are allowed to be selected
#'      }
#' @param na_rm a logical value indicating whether to remove rows with missing
#'      values from sub-data before the callback function `f` is called
#' @param type a character string specifying the type of conditions to be processed.
#'      The `"crisp"` type accepts only logical columns as condition predicates.
#'      The `"fuzzy"` type accepts both logical and numeric columns as condition
#'      predicates where numeric data are in the interval \eqn{[0, 1]}. The
#'      callback function `f` differs based on the value of the `type` argument
#'      (see the description of `f` above).
#' @param min_length the minimum size (the minimum number of predicates) of the
#'      condition to be generated (must be greater or equal to 0). If 0, the empty
#'      condition is generated in the first place.
#' @param max_length the maximum size (the maximum number of predicates) of the
#'      condition to be generated. If equal to Inf, the maximum length of conditions
#'      is limited only by the number of available predicates.
#' @param min_support the minimum support of a condition to trigger the callback
#'      function for it. The support of the condition is the relative frequency
#'      of the condition in the dataset `x`. For logical data, it equals to the
#'      relative frequency of rows such that all condition predicates are TRUE on it.
#'      For numerical (double) input, the support is computed as the mean (over all
#'      rows) of multiplications of predicate values.
#' @param max_support the maximum support of a condition to trigger the callback
#'      function for it. See argument `min_support` for details of what is the
#'      support of a condition.
#' @param max_results the maximum number of generated conditions to execute the
#'      callback function on. If the number of found conditions exceeds
#'      `max_results`, the function stops generating new conditions and returns
#'      the results. To avoid long computations during the search, it is recommended
#'      to set `max_results` to a reasonable positive value. Setting `max_results`
#'      to `Inf` will generate all possible conditions.
#' @param verbose a logical scalar indicating whether to print progress messages.
#' @param threads the number of threads to use for parallel computation.
#' @param error_context a list of details to be used in error messages.
#'      This argument is useful when `dig_grid()` is called from another
#'      function to provide error messages, which refer to arguments of the
#'      calling function. The list must contain the following elements:
#'      \itemize{
#'      \item `arg_x` - the name of the argument `x` as a character string
#'      \item `arg_condition` - the name of the argument `condition` as a character
#'         string
#'      \item `arg_xvars` - the name of the argument `xvars` as a character string
#'      \item `arg_yvars` - the name of the argument `yvars` as a character string
#'      \item `call` - an environment in which to evaluate the error messages.
#'      }
#' @return A tibble with found patterns. Each row represents a single call of
#'      the callback function `f`.
#' @author Michal Burda
#' @seealso [dig()], [var_grid()]; see also [dig_correlations()] and
#'     [dig_paired_baseline_contrasts()], as they are using this function internally.
#' @examples
#' # *** Example of crisp (boolean) patterns:
#' # dichotomize iris$Species
#' crispIris <- partition(iris, Species)
#'
#' # a simple callback function that computes mean difference of `xvar` and `yvar`
#' f <- function(pd) {
#'     list(m = mean(pd[[1]] - pd[[2]]),
#'          n = nrow(pd))
#'     }
#'
#' # call f() for each condition created from column `Species`
#' dig_grid(crispIris,
#'          f,
#'          condition = starts_with("Species"),
#'          xvars = starts_with("Sepal"),
#'          yvars = starts_with("Petal"),
#'          type = "crisp")
#'
#' # *** Example of fuzzy patterns:
#' # create fuzzy sets from Sepal columns
#' fuzzyIris <- partition(iris,
#'                        starts_with("Sepal"),
#'                        .method = "triangle",
#'                        .breaks = 3)
#'
#' # a simple callback function that computes a weighted mean of a difference of
#' # `xvar` and `yvar`
#' f <- function(d, weights) {
#'     list(m = weighted.mean(d[[1]] - d[[2]], w = weights),
#'          w = sum(weights))
#' }
#'
#' # call f() for each fuzzy condition created from column fuzzy sets whose
#' # names start with "Sepal"
#' dig_grid(fuzzyIris,
#'          f,
#'          condition = starts_with("Sepal"),
#'          xvars = Petal.Length,
#'          yvars = Petal.Width,
#'          type = "fuzzy")
#' @export
dig_grid <- function(x,
                     f,
                     condition = where(is.logical),
                     xvars = where(is.numeric),
                     yvars = where(is.numeric),
                     disjoint = var_names(colnames(x)),
                     allow = "all",
                     na_rm = FALSE,
                     type = "crisp",
                     min_length = 0L,
                     max_length = Inf,
                     min_support = 0.0,
                     max_support = 1.0,
                     max_results = Inf,
                     verbose = FALSE,
                     threads = 1L,
                     error_context = list(arg_x = "x",
                                          arg_f = "f",
                                          arg_condition = "condition",
                                          arg_xvars = "xvars",
                                          arg_yvars = "yvars",
                                          arg_disjoint = "disjoint",
                                          arg_allow = "allow",
                                          arg_na_rm = "na_rm",
                                          arg_type = "type",
                                          arg_min_length = "min_length",
                                          arg_max_length = "max_length",
                                          arg_min_support = "min_support",
                                          arg_max_support = "max_support",
                                          arg_max_results = "max_results",
                                          arg_verbose = "verbose",
                                          arg_threads = "threads",
                                          call = current_env())) {
    .must_be_flag(na_rm,
                  arg = error_context$arg_na_rm,
                  call = error_context$call)
    .must_be_enum(type, c("crisp", "fuzzy"),
                  arg = error_context$arg_type,
                  call = error_context$call)

    if (type == "crisp") {
        .must_be_function(f,
                          required = c("pd"),
                          optional = c("nd"),
                          arg = error_context$arg_f,
                          call = error_context$call)
    } else {
        .must_be_function(f,
                          required = c("d", "weights"),
                          optional = NULL,
                          arg = error_context$arg_f,
                          call = error_context$call)
    }

    condition <- enquo(condition)

    cols <- .convert_data_to_list(x, error_context = error_context)
    .extract_cols(cols,
                  !!condition,
                  allow_numeric = (type == "fuzzy"),
                  allow_empty = TRUE,
                  error_context = list(arg_selection = error_context$arg_condition,
                                       call = error_context$call))

    xvars <- enquo(xvars)
    yvars <- enquo(yvars)
    grid <- var_grid(x,
                     !!xvars,
                     !!yvars,
                     allow = allow,
                     error_context = error_context)

    processF <- function(condition, support, result) {
        isnull <- sapply(result, is.null)
        result <- lapply(result[!isnull], as_tibble)
        result <- do.call(rbind, result)

        if (!is.null(result)) {
            cond <- format_condition(names(condition))
            gr <- grid[!isnull, ]
            result <- cbind(condition = rep(cond, nrow(gr)),
                            support = support,
                            gr,
                            result)
        }

        result
    }

    if (type == "fuzzy") {
        # fuzzy variant
        tempF1 <- function(condition, support, weights) {
            result <- apply(grid, 1, function(row) {
                dd <- x[, row, drop = FALSE]
                if (na_rm) {
                    dd <- na.omit(dd)
                    weights <- weights[attr(dd, "na.action")]
                }

                f(d = dd, weights = weights)
            })

            processF(condition, support, result)
        }
        callbackF <- tempF1
    } else if ("nd" %in% formalArgs(f)) {
        # crisp variant with nd
        tempF2 <- function(condition, support, indices) {
            pd <- x[indices, , drop = FALSE]
            nd <- x[!indices, , drop = FALSE]

            result <- apply(grid, MARGIN = 1, simplify = FALSE, FUN = function(row) {
                pdd <- pd[, row, drop = FALSE]
                ndd <- nd[, row, drop = FALSE]
                if (na_rm) {
                    pdd <- na.omit(pdd)
                    ndd <- na.omit(ndd)
                }

                f(pd = pdd, nd = ndd)
            })

            processF(condition, support, result)
        }
        callbackF <- tempF2
    } else {
        # crisp variant without nd
        tempF3 <- function(condition, support, indices) {
            pd <- x[indices, , drop = FALSE]

            result <- apply(grid, MARGIN = 1, simplify = FALSE, FUN = function(row) {
                pdd <- pd[, row, drop = FALSE]
                if (na_rm)
                    pdd <- na.omit(pdd)

                f(pd = pdd)
            })

            processF(condition, support, result)
        }
        callbackF <- tempF3
    }

    res <- dig(x = x,
               f = callbackF,
               condition = !!condition,
               disjoint = disjoint,
               min_length = min_length,
               max_length = max_length,
               min_support = min_support,
               max_support = max_support,
               max_results = max_results,
               verbose = verbose,
               threads = threads,
               error_context = list(arg_x = error_context$arg_x,
                                    arg_condition = error_context$arg_condition,
                                    arg_disjoint = error_context$arg_disjoint,
                                    arg_min_length = error_context$arg_min_length,
                                    arg_max_length = error_context$arg_max_length,
                                    arg_min_support = error_context$arg_min_support,
                                    arg_max_support = error_context$arg_max_support,
                                    arg_max_results = error_context$arg_max_results,
                                    arg_verbose = error_context$arg_verbose,
                                    arg_threads = error_context$arg_threads,
                                    call = error_context$call))

    res <- do.call(rbind, res)

    as_tibble(res)
}