Nothing
#######################################################################
# nuggets: An R framework for exploration of patterns in data
# Copyright (C) 2025 Michal Burda
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
#######################################################################
#' Search for patterns of a custom type
#'
#' A general function for searching for patterns of a custom type. The function
#' allows selection of columns of `x` to be used as condition predicates. It
#' enumerates all possible conditions in the form of elementary conjunctions of
#' selected predicates, and for each condition executes a user-defined callback
#' function `f`. The callback is expected to perform some analysis and return an
#' object (often a list) representing a pattern or patterns related to the
#' condition. The results of all calls are returned as a list.
#'
#' The callback function `f` may accept a number of arguments (see `f` argument
#' description). The algorithm automatically provides condition-related
#' information to `f` based on which arguments are present.
#'
#' In addition to conditions, the function can evaluate *focus* predicates
#' (foci). Foci are specified separately and are tested within each generated
#' condition. Extra information about them is then passed to `f`.
#'
#' Restrictions may be imposed on generated conditions, such as:
#' - minimum and maximum condition length (`min_length`, `max_length`);
#' - minimum condition support (`min_support`);
#' - minimum focus support (`min_focus_support`), i.e. support of rows where
#' both the condition and the focus hold.
#'
#' @details
#' Let \eqn{P} be the set of condition predicates selected by `condition` and
#' \eqn{E} be the set of focus predicates selected by `focus`. The function
#' generates all possible conditions as elementary conjunctions of distinct
#' predicates from \eqn{P}. These conditions are filtered using `disjoint`,
#' `excluded`, `min_length`, `max_length`, `min_support`, and `max_support`.
#'
#' For each remaining condition, all foci from \eqn{E} are tested and filtered
#' using `min_focus_support` and `min_conditional_focus_support`. If at least
#' one focus remains (or if `filter_empty_foci = FALSE`), the callback `f` is
#' executed with details of the condition and foci. Results of all calls are
#' collected and returned as a list.
#'
#' Let \eqn{C} be a condition (\eqn{C \subseteq P}), \eqn{F} the set of
#' filtered foci (\eqn{F \subseteq E}), \eqn{R} the set of rows of `x`, and
#' \eqn{\mu_C(r)} the truth degree of condition \eqn{C} on row \eqn{r}. The
#' parameters passed to `f` are defined as:
#'
#' - `condition`: a named integer vector of column indices representing the
#' predicates of \eqn{C}. Names correspond to column names.
#'
#' - `sum`: a numeric scalar value of the number of rows satisfying \eqn{C} for
#' logical data, or the sum of truth degrees for fuzzy data,
#' \eqn{sum = \sum_{r \in R} \mu_C(r)}.
#'
#' - `support`: a numeric scalar value of relative frequency of rows satisfying \eqn{C},
#' \eqn{supp = sum / |R|}.
#'
#' - `pp`, `pn`, `np`, `nn`: a numeric vector of entries of a contingency table
#' for \eqn{C} and \eqn{F}, satisfying the Ruspini condition
#' \eqn{pp + pn + np + nn = |R|}.
#' The \eqn{i}-th elements of these vectors correspond to the \eqn{i}-th focus
#' \eqn{F_i} from \eqn{F} and are defined as:
#' * `pp[i]`: rows satisfying both \eqn{C} and \eqn{F_i},
#' \eqn{pp_i = \sum_{r \in R} \mu_{C \land F_i}(r)}.
#' * `pn[i]`: rows satisfying \eqn{C} but not \eqn{F_i},
#' \eqn{pn_i = \sum_{r \in R} \mu_C(r) - pp_i}.
#' * `np[i]`: rows satisfying \eqn{F_i} but not \eqn{C},
#' \eqn{np_i = \sum_{r \in R} \mu_{F_i}(r) - pp_i}.
#' * `nn[i]`: rows satisfying neither \eqn{C} nor \eqn{F_i},
#' \eqn{nn_i = |R| - (pp_i + pn_i + np_i)}.
#'
#' @param x A matrix or data frame. If a matrix, it must be numeric (double) or
#' logical. If a data frame, all columns must be numeric (double) or logical.
#' @param f A callback function executed for each generated condition. It may
#' declare any subset of the arguments listed below. The algorithm detects
#' which arguments are present and provides only those values to `f`. This
#' design allows the user to control both the amount of information received
#' and the computational cost, as some arguments are more expensive to
#' compute than others. The function `f` is expected to return an object
#' (typically a list) representing a pattern or patterns related to the
#' condition. The results of all calls of `f` are collected and returned as
#' a list. Possible arguments are: `condition`, `sum`, `support`, `indices`,
#' `weights`, `pp`, `pn`, `np`, `nn`, or `foci_supports` (deprecated), which
#' are thoroughly described below in the "Details" section.
#' @param condition tidyselect expression (see
#' [tidyselect syntax](https://tidyselect.r-lib.org/articles/syntax.html))
#' specifying columns of `x` to use as condition predicates
#' @param focus tidyselect expression (see
#' [tidyselect syntax](https://tidyselect.r-lib.org/articles/syntax.html))
#' specifying columns of `x` to use as focus predicates
#' @param disjoint An atomic vector (length = number of columns in `x`) defining
#' groups of predicates. Columns in the same group cannot appear together in
#' a condition. With data from [partition()], use [var_names()] on column
#' names to construct `disjoint`.
#' @param excluded `NULL` or a list of character vectors, each representing an
#' implication formula. In each vector, all but the last element form the
#' antecedent and the last element is the consequent. These formulae are
#' treated as *tautologies* and used to filter out generated conditions. If
#' a condition contains both the antecedent and the consequent of any such
#' formula, it is not passed to the callback function `f`. Likewise, if a
#' condition contains the antecedent, the corresponding focus (the consequent)
#' is not passed to `f`.
#' @param min_length Minimum number of predicates in a condition required to
#' trigger the callback `f`. Must be \eqn{\ge 0}. If set to 0, the empty
#' condition also triggers the callback.
#' @param max_length Maximum number of predicates allowed in a condition.
#' Conditions longer than `max_length` are not generated. If `Inf`, the only
#' limit is the total number of available predicates. Must be \eqn{\ge 0} and
#' \eqn{\ge min_length}. This setting strongly influences both the number of
#' generated conditions and the speed of the search.
#' @param min_support Minimum support of a condition required to trigger `f`.
#' Support is the relative frequency of the condition in `x`. For logical
#' data, this is the proportion of rows where all condition predicates are
#' `TRUE`. For numeric (double) data, support is the mean (over all rows) of
#' the products of predicate values. Must be in \eqn{[0,1]}. If a condition’s
#' support falls below `min_support`, recursive generation of its extensions
#' is stopped. Thus, `min_support` directly affects search speed and the
#' number of callback calls.
#' @param min_focus_support Minimum support of a focus required for it to be
#' passed to `f`. For logical data, this is the proportion of rows where both
#' the condition and the focus are `TRUE`. For numeric (double) data, support
#' is computed as the mean (over all rows) of a t-norm of predicate values
#' (the t-norm is selected by `t_norm`). Must be in \eqn{[0,1]}. Foci with
#' support below this threshold are excluded. Together with
#' `filter_empty_foci`, this parameter influences both search speed and the
#' number of triggered calls of `f`.
#' @param min_conditional_focus_support Minimum conditional support of a focus
#' within a condition. Defined as the relative frequency of rows where the
#' focus is `TRUE` among those where the condition is `TRUE`. If \eqn{sum}
#' (see `support` in *Details*) is the number of rows (or sum of truth
#' degrees for fuzzy data) satisfying the condition, and \eqn{pp} (see
#' `pp[i]` in *Details*) is the sum of truth degrees where both the condition
#' and the focus hold, then conditional support is \eqn{pp / sum}. Must be in
#' \eqn{[0,1]}. Foci below this threshold are not passed to `f`. Together with
#' `filter_empty_foci`, this parameter influences search speed and the number
#' of callback calls.
#' @param max_support Maximum support of a condition to trigger `f`. Conditions
#' with support above this threshold are skipped, but recursive generation of
#' their supersets continues. Must be in \eqn{[0,1]}.
#' @param filter_empty_foci Logical; controls whether `f` is triggered for
#' conditions with no remaining foci after filtering by `min_focus_support`
#' or `min_conditional_focus_support`. If `TRUE`, `f` is called only when at
#' least one focus remains. If `FALSE`, `f` is called regardless.
#' @param t_norm T-norm used for conjunction of weights: `"goedel"` (minimum),
#' `"goguen"` (product), or `"lukas"` (Łukasiewicz).
#' @param max_results Maximum number of results (objects returned by the
#' callback `f`) to store and return in the output list. When this limit
#' is reached, generation of further conditions stops. Use a positive
#' integer to enable early stopping; set to `Inf` to remove the cap.
#' @param verbose Logical; if `TRUE`, print progress messages.
#' @param threads Number of threads for parallel computation.
#' @param error_context A list of details to be used when constructing error
#' messages. This is mainly useful when `dig()` is called from another
#' function and errors should refer to the caller’s argument names rather
#' than those of `dig()`. The list must contain:
#' \itemize{
#' \item `arg_x` – name of the argument `x` as a character string
#' \item `arg_f` – name of the argument `f` as a character string
#' \item `arg_condition` – name of the argument `condition`
#' \item `arg_focus` – name of the argument `focus`
#' \item `arg_disjoint` – name of the argument `disjoint`
#' \item `arg_excluded` – name of the argument `excluded`
#' \item `arg_min_length` – name of the argument `min_length`
#' \item `arg_max_length` – name of the argument `max_length`
#' \item `arg_min_support` – name of the argument `min_support`
#' \item `arg_min_focus_support` – name of the argument
#' `min_focus_support`
#' \item `arg_min_conditional_focus_support` – name of the argument
#' `min_conditional_focus_support`
#' \item `arg_max_support` – name of the argument `max_support`
#' \item `arg_filter_empty_foci` – name of the argument `filter_empty_foci`
#' \item `arg_t_norm` – name of the argument `t_norm`
#' \item `arg_threads` – name of the argument `threads`
#' \item `call` – environment in which to evaluate error messages
#' }
#' @returns A list of results returned by the callback function `f`.
#' @seealso [partition()], [var_names()], [dig_grid()]
#' @author Michal Burda
#'
#' @examples
#' library(tibble)
#'
#' # Prepare iris data
#' d <- partition(iris, .breaks = 2)
#'
#' # Simple callback: return formatted condition names
#' dig(x = d,
#' f = function(condition) format_condition(names(condition)),
#' min_support = 0.5)
#'
#' # Callback returning condition and support
#' res <- dig(x = d,
#' f = function(condition, support) {
#' list(condition = format_condition(names(condition)),
#' support = support)
#' },
#' min_support = 0.5)
#' do.call(rbind, lapply(res, as_tibble))
#'
#' # Within each condition, evaluate also supports of columns starting with
#' # "Species"
#' res <- dig(x = d,
#' f = function(condition, support, pp) {
#' c(list(condition = format_condition(names(condition))),
#' list(condition_support = support),
#' as.list(pp / nrow(d)))
#' },
#' condition = !starts_with("Species"),
#' focus = starts_with("Species"),
#' min_support = 0.5,
#' min_focus_support = 0)
#' do.call(rbind, lapply(res, as_tibble))
#'
#' # Multiple patterns per condition based on foci
#' res <- dig(x = d,
#' f = function(condition, support, pp) {
#' lapply(seq_along(pp), function(i) {
#' list(condition = format_condition(names(condition)),
#' condition_support = support,
#' focus = names(pp)[i],
#' focus_support = pp[[i]] / nrow(d))
#' })
#' },
#' condition = !starts_with("Species"),
#' focus = starts_with("Species"),
#' min_support = 0.5,
#' min_focus_support = 0)
#'
#' # Flatten result and convert to tibble
#' res <- unlist(res, recursive = FALSE)
#' do.call(rbind, lapply(res, as_tibble))
#'
#' @export
dig <- function(x,
f,
condition = everything(),
focus = NULL,
disjoint = var_names(colnames(x)),
excluded = NULL,
min_length = 0,
max_length = Inf,
min_support = 0.0,
min_focus_support = 0.0,
min_conditional_focus_support = 0.0,
max_support = 1.0,
filter_empty_foci = FALSE,
t_norm = "goguen",
max_results = Inf,
verbose = FALSE,
threads = 1L,
error_context = list(arg_x = "x",
arg_f = "f",
arg_condition = "condition",
arg_focus = "focus",
arg_disjoint = "disjoint",
arg_excluded = "excluded",
arg_min_length = "min_length",
arg_max_length = "max_length",
arg_min_support = "min_support",
arg_min_focus_support = "min_focus_support",
arg_min_conditional_focus_support = "min_conditional_focus_support",
arg_max_support = "max_support",
arg_filter_empty_foci = "filter_empty_foci",
arg_t_norm = "t_norm",
arg_max_results = "max_results",
arg_verbose = "verbose",
arg_threads = "threads",
call = current_env())) {
cols <- .convert_data_to_list(x,
error_context = error_context)
condition <- enquo(condition)
focus <- enquo(focus)
condition_cols <- .extract_cols(cols,
!!condition,
allow_numeric = TRUE,
allow_empty = TRUE,
error_context = list(arg_selection = error_context$arg_condition,
call = error_context$call))
foci_cols <- .extract_cols(cols,
!!focus,
allow_numeric = TRUE,
allow_empty = TRUE,
error_context = list(arg_selection = error_context$arg_focus,
call = error_context$call))
.must_be_function(f,
required = NULL,
optional = c("condition", "foci_supports",
"pp", "np", "pn", "nn",
"indices", "sum", "support", "weights"),
arg = error_context$arg_f,
call = error_context$call)
arguments <- formalArgs(f)
if (is.null(arguments)) {
arguments <- ""
}
fun <- function(l) {
do.call(f, l)
}
.must_be_vector_or_factor(disjoint,
null = TRUE,
arg = error_context$arg_disjoint,
call = error_context$call)
if (!isTRUE(length(disjoint) == 0 || length(disjoint) == ncol(x))) {
cli_abort(c("The length of {.arg {error_context$arg_disjoint}} must be 0 or must be equal to the number of columns in {.arg {error_context$arg_x}}.",
"x" = "The number of columns in {.arg {error_context$arg_x}} is {ncol(x)}.",
"x" = "The length of {.arg {error_context$arg_disjoint}} is {length(disjoint)}."),
call = error_context$call)
}
orig_disjoint <- disjoint
if (length(disjoint) > 0) {
disjoint <- as.integer(as.factor(disjoint))
} else {
disjoint <- seq_along(cols)
}
.must_be_list_of_characters(excluded,
null = TRUE,
arg = error_context$arg_excluded,
call = error_context$call)
orig_excluded <- excluded
if (is.null(excluded)) {
excluded <- list()
} else {
excluded_predicates <- unique(unlist(excluded))
excluded_undefined <- setdiff(excluded_predicates, colnames(x))
if (length(excluded_undefined) > 0) {
details <- paste0("Column {.var ", excluded_undefined, "} can't be found.")
cli_abort(c("Can't find some column names in {.arg {error_context$arg_x}} that correspond to all predicates in {.arg {error_context$arg_excluded}}.",
"i" = "Consider using {.fn remove_ill_conditions()} to remove conditions with undefined predicates.",
..error_details(details)))
}
excluded <- lapply(excluded,
fmatch,
colnames(x))
}
.must_be_integerish_scalar(min_length,
arg = error_context$arg_min_length,
call = error_context$call)
.must_be_finite(min_length,
arg = error_context$arg_min_length,
call = error_context$call)
.must_be_greater_eq(min_length, 0,
arg = error_context$arg_min_length,
call = error_context$call)
min_length <- as.integer(min_length)
.must_be_integerish_scalar(max_length,
arg = error_context$arg_max_length,
call = error_context$call)
.must_be_greater_eq(max_length, 0,
arg = error_context$arg_max_length,
call = error_context$call)
if (max_length < min_length) {
cli_abort(c("{.arg {error_context$arg_max_length}} must be greater or equal to {.arg {error_context$arg_min_length}}.",
"x" = "{.arg {error_context$arg_min_length}} equals {min_length}.",
"x" = "{.arg {error_context$arg_max_length}} equals {max_length}."),
call = error_context$call)
}
if (!is.finite(max_length)) {
max_length <- -1L;
}
max_length <- as.integer(max_length)
.must_be_double_scalar(min_support,
arg = error_context$arg_min_support,
call = error_context$call)
.must_be_in_range(min_support, c(0, 1),
arg = error_context$arg_min_support,
call = error_context$call)
min_support <- as.double(min_support)
.must_be_double_scalar(min_focus_support,
arg = error_context$arg_min_focus_support,
call = error_context$call)
.must_be_in_range(min_focus_support, c(0, 1),
arg = error_context$arg_min_focus_support,
call = error_context$call)
min_focus_support <- as.double(min_focus_support)
.must_be_double_scalar(min_conditional_focus_support,
arg = error_context$arg_min_conditional_focus_support,
call = error_context$call)
.must_be_in_range(min_conditional_focus_support, c(0, 1),
arg = error_context$arg_min_conditional_focus_support,
call = error_context$call)
min_conditional_focus_support <- as.double(min_conditional_focus_support)
.must_be_double_scalar(max_support,
arg = error_context$arg_max_support,
call = error_context$call)
.must_be_in_range(max_support, c(0, 1),
arg = error_context$arg_max_support,
call = error_context$call)
max_support <- as.double(max_support)
.must_be_flag(filter_empty_foci,
arg = error_context$arg_filter_empty_foci,
call = error_context$call)
.must_be_enum(t_norm, c("goguen", "goedel", "lukas"),
arg = error_context$arg_t_norm,
call = error_context$call)
.must_be_integerish_scalar(max_results,
arg = error_context$arg_max_results,
call = error_context$call)
.must_be_greater_eq(max_results, 1,
arg = error_context$arg_max_results,
call = error_context$call)
if (is.finite(max_results)) {
max_results <- as.integer(max_results)
} else {
max_results <- -1L
}
.must_be_flag(verbose,
arg = error_context$arg_verbose,
call = error_context$call)
.must_be_integerish_scalar(threads,
arg = error_context$arg_threads,
call = error_context$call)
.must_be_greater_eq(threads, 1,
arg = error_context$arg_threads,
call = error_context$call)
threads <- as.integer(threads)
config <- list(nrow = nrow(x),
arguments = arguments,
disjoint = disjoint,
excluded = excluded,
minLength = min_length,
maxLength = max_length,
minSupport = min_support,
minFocusSupport = min_focus_support,
minConditionalFocusSupport = min_conditional_focus_support,
maxSupport = max_support,
filterEmptyFoci = filter_empty_foci,
tNorm = t_norm,
maxResults = max_results,
verbose = verbose,
threads = threads)
res <- dig_(cols,
names(cols),
condition_cols$selected,
foci_cols$selected,
fun,
config)
nugget(res,
flavour = NULL,
call_function = "dig",
call_data = list(nrow = nrow(x),
ncol = ncol(x),
colnames = as.character(colnames(x))),
call_args = list(x = deparse(substitute(x)),
condition = names(cols)[condition_cols$selected],
focus = names(cols)[foci_cols$selected],
disjoint = orig_disjoint,
excluded = orig_excluded,
min_length = min_length,
max_length = if (max_length < 0) Inf else max_length,
min_support = min_support,
min_focus_support = min_focus_support,
min_conditional_focus_support = min_conditional_focus_support,
max_support = max_support,
filter_empty_foci = filter_empty_foci,
t_norm = t_norm,
max_results = max_results,
verbose = verbose,
threads = threads))
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.