knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(nuggets) library(dplyr) library(ggplot2) library(tidyr) options(tibble.width = Inf)
Package nuggets searches for patterns that can be expressed as formulae in
the form of elementary conjunctions, referred to in this text as conditions.
Conditions are constructed from predicates, which correspond to data
columns. The interpretation of conditions depends on the choice of underlying
logic:
Crisp (Boolean) logic: each predicate takes values TRUE (1) or FALSE
(0). The truth value of a condition is computed according to the rules of
classical Boolean algebra.
Fuzzy logic: each predicate is assigned a truth degree from the interval $[0, 1]$. The truth degree of a conjunction is then computed using a chosen triangular norm (t-norm). The package supports three common t-norms, which are defined for predicates' truth degrees $a, b \in [0, 1]$ as follows:
Before applying nuggets, data columns intended as predicates must be prepared
either by dichotomization (conversion into dummy logical variables) or by
transformation into fuzzy sets. The package provides functions for both
transformations. See the section Data Preparation below
for a quick overview, or the Data Preparation vignette
for a comprehensive guide.
nuggets implements functions to search for pre-defined types of patterns or
to discover patterns of user-defined type. For example, the package provides:
dig_associations() for association rules,dig_baseline_contrasts(), dig_complement_contrasts(), and
dig_paired_baseline_contrasts() for various contrast patterns on numeric
variables,dig_correlations() for conditional correlations. To provide custom evaluation functions for conditions and to search for user-defined types of patterns, the package offers two general functions:
dig() is a general function for searching arbitrary pattern types.dig_grid() is a wrapper around dig() for patterns defined by conditions
and a pair of columns evaluated by a user-defined function. See the section Pre-defined Patterns below for examples and details on using the pre-defined pattern discovery functions and the section Advanced Use for examples of custom pattern discovery.
Discovered rules and patterns can be post-processed, visualized, and explored interactively. That part is covered in the section Post-processing and Visualization below.
Before applying nuggets, data columns intended as predicates must be prepared
either by dichotomization (conversion into dummy variables) or by
transformation into fuzzy sets. The package provides the partition()
function for both transformations.
This section gives a quick overview of data preparation with nuggets. For
a detailed guide, including information about all available functions and
advanced techniques, please see the
Data Preparation Vignette.
For crisp patterns, numeric columns are transformed to logical (TRUE/FALSE)
columns. To show the process, we start with the built-in mtcars dataset,
which we first slightly modify by converting the cyl column to a factor:
# For demonstration, convert 'cyl' column of the mtcars dataset to a factor mtcars <- mtcars |> mutate(cyl = factor(cyl, levels = c(4, 6, 8), labels = c("four", "six", "eight"))) head(mtcars, n = 3)
Now we can use the partition() function to transform all columns into crisp
predicates:
# Transform the whole dataset to crisp predicates crisp_mtcars <- mtcars |> partition(cyl, vs:gear, .method = "dummy") |> partition(mpg, .method = "crisp", .breaks = c(-Inf, 15, 20, 30, Inf)) |> partition(disp:carb, .method = "crisp", .breaks = 3) head(crisp_mtcars, n = 3)
As seen above, the "dummy" method can be used to create logical columns for
each category of processed variables. Here, it was applied to create dummy
variables for the factor variable cyl as well as for the numeric variables
vs, am, and gear.
The method "crisp" creates logical columns representing intervals for
numeric variables. In the example, it was used to create intervals for mpg
based on specified breakpoints (-Inf, 15, 20, 30, Inf), and for
disp, hp, drat, wt, qsec, and carb using equal-width intervals
(3 intervals each).
Now all columns are logical and can be used as predicates in crisp conditions.
Fuzzy predicates express the degree to which a condition is satisfied, with values in the interval $[0,1]$. This allows modeling of smooth transitions between categories:
# Start with fresh mtcars and transform to fuzzy predicates fuzzy_mtcars <- mtcars |> partition(cyl, vs:gear, .method = "dummy") |> partition(mpg, .method = "triangle", .breaks = c(-Inf, 15, 20, 30, Inf)) |> partition(disp:carb, .method = "triangle", .breaks = 3) head(fuzzy_mtcars, n = 3)
Similar to the crisp example, the "dummy" method creates logical columns for
categorical variables (cyl, vs, am, gear).
The "triangle" method creates fuzzy predicates with triangular membership
functions. For mpg, it uses specified breakpoints to define fuzzy intervals.
For the remaining numeric variables (disp through carb), it automatically
creates 3 overlapping fuzzy sets with smooth transitions between intervals.
Note that the cyl, vs, am, and gear columns are still represented by
dummy logical columns, while the numeric columns are now represented by fuzzy
sets. This combination allows both crisp and fuzzy predicates to be used
together in pattern discovery.
The nuggets package provides powerful and flexible data preparation tools.
The Data Preparation vignette covers these capabilities
in depth, including:
Custom breakpoints for domain-specific intervals
Fuzzy partitioning for modeling gradual transitions and uncertainty:
Trapezoidal shapes using .span and .inc parameters for overlapping
fuzzy sets
Quality control utilities to improve pattern mining:
is_almost_constant() and remove_almost_constant() to identify and
filter uninformative columnsdig_tautologies() to find always-true or almost-always-true rules that
can be used to prune search spaces
Custom labels for predicates to make discovered patterns more interpretable
For example, you can use quantile-based partitioning to ensure balanced predicates, or use raised-cosine fuzzy sets with custom labels to create meaningful linguistic terms like "very_low", "low", "medium", "high", and "very_high". These preparation choices significantly impact the interpretability and usefulness of patterns discovered in subsequent analyses.
The package nuggets provides a set of functions for discovering some of the
best-known pattern types. These functions can process Boolean data, fuzzy data,
or both. Each function returns a tibble, where every row represents one detected
pattern.
Note: This section assumes that the data have already been preprocessed — i.e., transformed into a binarized or fuzzified form. See the previous section Data Preparation for details on how to prepare your dataset (for example,
crisp_mtcarsandfuzzy_mtcars).
For more advanced workflows — such as defining custom pattern types or computing user-defined measures — see the section Advanced Use.
Association rules identify conditions (antecedents) under which a specific feature (consequent) is present very often.
[ A \Rightarrow C ]
If condition A is satisfied, then the feature C tends to be present.
For example,
university_edu & middle_age & IT_industry => high_incomecan be read as:
In practice, the antecedent A is a set of predicates, and the consequent C
is usually a single predicate.
For a set of predicates (I), let (\text{supp}(I)) denote the support — the relative frequency (for logical data) or the mean truth degree (for fuzzy data) of rows satisfying all predicates in (I). Using this notation, the following rule properties and quality measures may be defined:
Rules with high support are frequent in the data. Rules with high confidence indicate a strong association between antecedent and consequent. Rules with high lift suggest that the validity of antecedent increases the likelihood of the consequent occurring.
Before searching for rules, it is recommended to create a vector of disjoints, which specifies predicates that must not appear together in the same condition. This vector should have the same length as the number of dataset columns.
For example, columns representing gear=3 and gear=4 are mutually exclusive,
so their shared group label in disj prevents meaningless conditions like
gear=3 & gear=4. You can conveniently generate this vector with
var_names():
disj <- var_names(colnames(fuzzy_mtcars)) print(disj)
The dig_associations() function searches for association rules. Its main
arguments are:
x: the data matrix or data frame (logical or numeric);antecedent, consequent: tidyselect expressions selecting columns for each
side of the rule;disjoint: a vector defining mutually exclusive predicates;min_support, min_confidence,
min_coverage, and limits like min_length, max_length;t_norm, and contingency_table.In the following example, we search for fuzzy association rules in the dataset
fuzzy_mtcars, such that:
"am" may appear in the antecedent;"am" may appear in the consequent;0.02, i.e., 2 % of data rows have to contain both the
antecedent and consequent of the rule;0.8, i.e., the conditional probability of consequent
given antecedent should be at least 80%;pp, pn, np and
nn, which contains the counts (or sums of degrees) of rows satisfying
antecedent & consequent (pp), antecedent & not consequent (pn), not
antecedent & consequent (np), and not antecedent & not consequent
(nn). These values are important for further computation of various
additional interestingness measures.result <- dig_associations(fuzzy_mtcars, antecedent = !starts_with("am"), consequent = starts_with("am"), disjoint = disj, min_support = 0.02, min_confidence = 0.8, contingency_table = TRUE)
The result is a tibble containing the discovered rules and their quality metrics. You can arrange them, for example, by decreasing support:
result <- arrange(result, desc(support)) print(result)
This example illustrates the typical workflow for mining association rules with
nuggets. The same structure and arguments apply when analyzing either fuzzy or
Boolean datasets.
Conditional correlations identify strong relationships between pairs of numeric variables under specific conditions.
The dig_correlations() function searches for pairs of variables that are
significantly correlated within sub-data satisfying generated conditions. This
is useful for discovering context-dependent relationships.
In the following example, we search for correlations between different numeric
variables in the original mtcars data under conditions defined by the prepared
predicates in crisp_mtcars:
# Prepare combined dataset with both condition predicates and numeric variables combined_mtcars <- cbind(crisp_mtcars, mtcars[, c("mpg", "disp", "hp", "wt")]) # Extend disjoint vector for the new numeric columns disj_combined <- c(var_names(colnames(crisp_mtcars)), c("mpg", "disp", "hp", "wt")) # Search for conditional correlations corr_result <- dig_correlations(combined_mtcars, condition = colnames(crisp_mtcars), xvars = c("mpg", "hp"), yvars = c("wt", "disp"), disjoint = disj_combined, min_length = 1, max_length = 2, min_support = 0.2, method = "pearson") print(corr_result)
This example combines crisp predicates (from crisp_mtcars) with numeric
variables from the original mtcars dataset. The function searches for
conditions under which pairs of numeric variables show significant Pearson
correlations. The disjoint vector is extended to include the new numeric
columns, preventing conflicts in the search algorithm.
The result shows conditions under which specific pairs of variables exhibit strong correlations, along with correlation coefficients and p-values.
Contrast patterns identify conditions under which numeric variables show
statistically significant differences. The nuggets package provides several
functions for different types of contrasts.
Baseline contrasts identify conditions under which a variable is significantly different from a baseline value (typically zero) using a one-sample statistical test.
# Prepare combined dataset with predicates and numeric variables combined_mtcars2 <- cbind(crisp_mtcars, mtcars[, c("mpg", "hp", "wt")]) # Extend disjoint vector for the new numeric columns disj_combined2 <- c(var_names(colnames(crisp_mtcars)), c("mpg", "hp", "wt")) # Search for baseline contrasts baseline_result <- dig_baseline_contrasts(combined_mtcars2, condition = colnames(crisp_mtcars), vars = c("mpg", "hp", "wt"), disjoint = disj_combined2, min_length = 1, max_length = 2, min_support = 0.2, method = "t") head(baseline_result)
This example tests whether the mean of numeric variables (mpg, hp, wt)
significantly differs from zero under various conditions. The method = "t"
parameter specifies a t-test. The results show which combinations of
conditions lead to statistically significant deviations from the baseline.
Complement contrasts identify conditions under which a variable differs significantly between elements that satisfy the condition and those that don't.
complement_result <- dig_complement_contrasts(combined_mtcars2, condition = colnames(crisp_mtcars), vars = c("mpg", "hp", "wt"), disjoint = disj_combined2, min_length = 1, max_length = 2, min_support = 0.15, method = "t") head(complement_result)
This example uses a two-sample t-test to compare the mean values of numeric variables between rows that satisfy a condition and rows that don't. The results identify conditions where subgroups have significantly different characteristics compared to the rest of the data.
Paired baseline contrasts identify conditions under which there is a significant difference between two paired numeric variables.
paired_result <- dig_paired_baseline_contrasts(combined_mtcars2, condition = colnames(crisp_mtcars), xvars = c("mpg", "hp"), yvars = c("wt", "wt"), disjoint = disj_combined2, min_length = 1, max_length = 2, min_support = 0.2, method = "t") head(paired_result)
This example performs paired t-tests to compare two variables within the same
rows under specific conditions. Here, it tests whether mpg differs from wt
(and hp from wt) in various subgroups. This is useful for detecting
context-dependent relationships between paired measurements.
After discovering patterns with nuggets, you'll often want to manipulate, format, and visualize the results. The package provides several tools for these tasks.
The geom_diamond() function provides a specialized visualization for association
rules and their hierarchical structure. It displays rules as a lattice where
broader (more general) conditions appear above their descendants:
# Search for rules with various confidence levels for visualization vis_rules <- dig_associations(fuzzy_mtcars, antecedent = starts_with(c("gear", "vs")), consequent = "am=1", disjoint = disj, min_support = 0, min_confidence = 0, min_length = 0, max_length = 3, max_results = 50) print(vis_rules) # Create diamond plot showing rule hierarchy ggplot(vis_rules) + aes(condition = antecedent, fill = confidence, linewidth = confidence, size = support, label = paste0(antecedent, "\nconf: ", round(confidence, 2))) + geom_diamond(nudge_y = 0.25) + scale_x_discrete(expand = expansion(add = 0.5)) + scale_y_discrete(expand = expansion(add = 0.25)) + labs(title = "Association Rules Hierarchy", subtitle = "consequent: am=1")
This example creates a hierarchical visualization of association rules. The
geom_diamond() function arranges rules in a lattice structure where simpler
rules (with fewer predicates) appear at the top and more complex rules below.
Visual properties (fill color, edge width, node size) encode
rule quality measures, making it easy to identify the most interesting patterns.
Custom label merges antecedent with confidence value for better readability.
Additional modifications (scale_x_discrete, scale_y_discrete) add padding.
The diamond plot helps identify:
The explore() function launches an interactive Shiny application for exploring
discovered patterns. This is particularly useful for association rules:
# Launch interactive explorer for association rules rules <- dig_associations(fuzzy_mtcars, antecedent = everything(), consequent = everything(), min_support = 0.05, min_confidence = 0.7) # Open interactive explorer explore(rules, data = fuzzy_mtcars)
The interactive explorer provides:
For advanced workflows, the nuggets package allows users to define custom
pattern types and evaluation functions. This section demonstrates how to use
the general dig() function with custom callbacks and the specialized
dig_grid() wrapper.
The dig() function allows you to execute a user-defined callback function on
each generated frequent condition. This enables searching for custom pattern
types beyond the pre-defined functions.
The following example replicates the search for association rules using a custom callback function with the datasets prepared earlier:
# Define thresholds for custom association rules min_support <- 0.02 min_confidence <- 0.8 # Define custom callback function f <- function(condition, support, pp, pn) { # Calculate confidence for each focus (consequent) conf <- pp / support # Filter rules by confidence and support thresholds sel <- !is.na(conf) & conf >= min_confidence & !is.na(pp) & pp >= min_support conf <- conf[sel] supp <- pp[sel] # Return list of rules meeting criteria lapply(seq_along(conf), function(i) { list(antecedent = format_condition(names(condition)), consequent = names(conf)[[i]], support = supp[[i]], confidence = conf[[i]]) }) } # Search using custom callback custom_result <- dig(fuzzy_mtcars, f = f, condition = !starts_with("am"), focus = starts_with("am"), disjoint = disj, min_length = 1, min_support = min_support) # Flatten and format results custom_result <- custom_result |> unlist(recursive = FALSE) |> lapply(as_tibble) |> do.call(rbind, args = _) |> arrange(desc(support)) print(custom_result)
The callback function f() receives information based on its argument names:
condition: vector of column indices forming the conditionsupport: relative frequency of the conditionpp, pn: contingency table entriesThis approach gives you full control over pattern evaluation and filtering logic.
The dig_grid() function is useful for patterns based on relationships between
pairs of columns. It creates a grid of column combinations and evaluates a
user-defined function for each condition and column pair.
Here's an example that computes custom statistics for pairs of numeric variables:
# Define callback for grid-based patterns grid_callback <- function(d, weights) { if (nrow(d) < 5) return(NULL) # Skip if too few observations # Compute weighted correlation wcor <- cov.wt(d, wt = weights, cor = TRUE)$cor[1, 2] list( correlation = wcor, n_obs = sum(weights > 0.1), mean_x = weighted.mean(d[[1]], weights), mean_y = weighted.mean(d[[2]], weights) ) } # Prepare combined dataset combined_fuzzy <- cbind(fuzzy_mtcars, mtcars[, c("mpg", "hp", "wt")]) # Extend disjoint vector for new numeric columns combined_disj3 <- c(var_names(colnames(fuzzy_mtcars)), c("mpg", "hp", "wt")) # Search using grid approach grid_result <- dig_grid(combined_fuzzy, f = grid_callback, condition = colnames(fuzzy_mtcars), xvars = c("mpg", "hp"), yvars = c("wt"), disjoint = combined_disj3, type = "fuzzy", min_length = 1, max_length = 2, min_support = 0.15, max_results = 20) # Display results print(grid_result)
The dig_grid() function is particularly useful for:
This vignette has introduced the core functionality of the nuggets package for discovering patterns in data through systematic exploration of conditions. Key takeaways:
Data Preparation: Transform your data into predicates using partition().
Pre-defined Pattern Discovery: The package provides specialized functions for common pattern types:
dig_associations() finds association rules (A → C)dig_correlations() discovers conditional correlations between variable pairsdig_baseline_contrasts() identifies when variables deviate from baseline under conditionsdig_complement_contrasts() finds subgroups differing from the restdig_paired_baseline_contrasts() compares paired variables within contextsPost-processing: Manipulate and visualize discovered patterns:
geom_diamond()explore()Advanced Usage: Define custom pattern types:
dig() with custom callback functions for specialized analysesdig_grid() for patterns based on variable pairs?dig_associations) for detailed parameter descriptionsexplore()) to gain insights into discovered patternsThe nuggets package provides a flexible framework for pattern discovery that scales from simple association rule mining to complex custom pattern searches, all while supporting both crisp and fuzzy logic approaches.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.