knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(nuggets) library(dplyr)
Package nuggets
searches for patterns that can be described with formulae in
the form of elementary conjunctions, which are called conditions in this
text. The conditions are constructed from predicates, which represent data
columns. The user may select the interpretation of conditions by selecting the
underlying logic:
nuggets
allows to work
with three most common t-norms: Goedel (minimum), Goguen (product), and
Lukasiewicz. Let $a, b \in [0, 1]$ be the truth degrees of two predicates.
Goedel t-norm is defined as $\min(a, b)$, Goguen t-norm as $a \cdot b$, and
Lukasiewicz t-norm as $\max(0, a + b - 1)$.Before analyzed by nuggets
, the data columns that would serve as predicates in
conditions have to be either dichotomized or transformed to fuzzy sets. The
package provides functions for both transformations. See the section
Data Preparation for more details.
nuggets
provides functions to search for patterns of pre-defined types, such
as dig_associations()
for association rules, dig_paired_baseline_contrasts()
for contrast patterns on paired numeric variables, and dig_correlations()
for conditional correlations. See the section Pre-defined Patterns for more
details.
The user may also define a custom function to evaluate the conditions and
search for patterns of a different type. dig()
function is a general function
that allows to search for patterns of any type. dig_grid()
function is
a wrapper around dig()
that allows to search for patterns defined by
conditions and a pair of columns, whose combination is evaluated by the
user-defined function. See the section Custom Patterns for more details.
For patterns based on crisp conditions, the data columns that would serve as
predicates in conditions have to be transformed to logical (TRUE
/FALSE
)
data:
Both operations can be done with the help of the partition()
function. The
partition()
function requires the dataset as its first argument and a
tidyselect selection expression to select the columns to be transformed.
Factors and logical columns are transformed to dummy logical columns.
For numeric columns, the partition()
function requires the .method
argument
to specify the method of partitioning. The "crisp"
method divides the range of
values of the selected columns into intervals specified by the .breaks
argument and codes the values into dummy logical columns. The .breaks
argument
is a numeric vector that specifies the border values of the intervals.
For example, consider the CO2
dataset from the datasets
package:
head(CO2)
The Plant
, Type
, and Treatment
columns are factors and they will be
transformed to dummy logical columns without any special arguments added to the
partition()
function:
partition(CO2, Plant:Treatment)
The conc
and uptake
columns are numeric. For instance, we can split the
conc
column into four intervals: (-Inf, 175], (175, 350], (350, 675], and
(675, Inf). The breaks are thus c(-Inf, 175, 350, 675, Inf)
.
partition(CO2, conc, .method = "crisp", .breaks = c(-Inf, 175, 350, 675, Inf))
Similarly, we can split the uptake
column into three intervals: (-Inf, 10],
(10, 20], and (20, Inf) by specifying the breaks c(-Inf, 10, 20, Inf)
.
The transformation of the whole CO2
dataset to crisp predicates can be done as
follows:
crispCO2 <- CO2 |> partition(Plant:Treatment) |> partition(conc, .method = "crisp", .breaks = c(-Inf, 175, 350, 675, Inf)) |> partition(uptake, .method = "crisp", .breaks = c(-Inf, 10, 20, Inf)) head(crispCO2)
Each call to the partition()
function returns a tibble data frame with the
selected columns transformed to dummy logical columns while the other columns
remain unchanged.
Each original factor column became replaced by a set of logical columns, all
of which start with the original column name and are followed by the factor
level name. For example, the Type
column, which is a factor with two levels
Quebec
and Mississippi
, was replaced by two logical columns:
Type=Quebec
and Type=Mississippi
. Numeric columns were replaced by logical
columns with names that indicate the interval to which the original value
belongs. For example, the conc
column was replaced by four logical columns:
conc=(-Inf,175]
, conc=(175,350]
, conc=(350,675]
, and conc=(675,Inf)
.
Other columns were transformed similarly:
colnames(crispCO2)
Now all the columns are logical and can be used as predicates in crisp conditions.
For patterns based on fuzzy conditions, the data columns that would serve as predicates in conditions have to be transformed to fuzzy predicates. The fuzzy predicate is represented by a vector of truth degrees from the interval $[0, 1]$. The truth degree of a predicate is the degree to which the predicate is true with 0 meaning that the predicate is false and 1 meaning that the predicate is true. A value between 0 and 1 indicates a partial truthfulness.
In order to search for fuzzy patterns, the numeric input data columns have to be transformed to fuzzy predicates, i.e., to vectors of truth degrees from the interval $[0, 1]$. (Fuzzy methods allow to be used with logical columns too.)
The transformation to fuzzy predicates can be done again with the help of the
partition()
function. Again, factors will be transformed to dummy logical
columns. On the other hand, numeric columns will be transformed to fuzzy
predicates. For that, the partition()
function provides two fuzzy partitioning
methods: "triangle"
and "raisedcos"
. The "triangle"
method creates fuzzy
sets with triangular membership functions, while the "raisedcos"
method creates
fuzzy sets with raised cosine membership functions.
More advanced fuzzy partitioning of numeric columns may be achieved with the
help of the lfl package, which provides
tools for definition of fuzzy sets of many types including fuzzy sets that model
linguistic terms such as "very small", "extremely big" and so on. See the
lfl
documentation
for more information.
In the following example, both the conc
and uptake
columns are
transformed to fuzzy sets with triangular membership functions. For that, the
partition()
function requires the .breaks
argument to specify the shape
of fuzzy sets. For .method = "triangle"
, each consecutive triplet of values
in the .breaks
vector specifies a single triangular fuzzy set: the first and
the last value of the triplet are the borders of the triangle, and the middle
value is the peak of the triangle.
For instance, the conc
column's .breaks
may be specified as
c(-Inf, 175, 350, 675, Inf)
, which creates three triangular fuzzy sets:
conc=(-Inf,175,350)
, conc=(175,350,675)
, and conc=(350,675,Inf)
.
Similarly, the uptake
column's .breaks
may be specified as
c(-Inf, 18, 28, 37, Inf)
.
The transformation of the whole CO2
dataset to fuzzy predicates can be done as
follows:
fuzzyCO2 <- CO2 |> partition(Plant:Treatment) |> partition(conc, .method = "triangle", .breaks = c(-Inf, 175, 350, 675, Inf)) |> partition(uptake, .method = "triangle", .breaks = c(-Inf, 18, 28, 37, Inf)) head(fuzzyCO2) colnames(fuzzyCO2)
nuggets
provides a set of functions for searching for some best-known pattern types.
These functions allow to process Boolean data, fuzzy data, or both. The result of
these functions is always a tibble with patterns stored as rows. For more advance
usage, which allows to search for custom patterns or to compute user-defined measures
and statistics, see the section Custom Patterns.
Association rules are rules of the form $A \Rightarrow B$, where $A$ is either Boolean or fuzzy condition in the form of conjunction, and $B$ is a Boolean or fuzzy predicate.
Before continuing with the search for rules, it is advisable to create the so-called
vector of disjoints. The vector of disjoints is a character vector with the same
length as the number of columns in the analyzed dataset. It specifies predicates, which
are mutually exclusive and should not be combined together in a single pattern's condition:
columns with equal values in the disjoint vector will not appear in a single condition.
Providing the vector of disjoints to the algorithm will speed-up the search as it makes
no sense, e.g., to combine Plant=Qn1
and Plant=Qn2
in a condition
Plant=Qn1 & Plant=Qn2
as such formula is never true for any data row.
The vector of disjoints can be easily created from the column names of the dataset, e.g.,
by obtaining the first part of column names before the equal sign, which is neatly
provided by the var_names()
function as follows:
disj <- var_names(colnames(fuzzyCO2)) print(disj)
The function dig_associations
takes the analyzed dataset as its first parameter and
a pair of tidyselect
expressions to select the column names to appear
in the left-hand (antecedent) and right-hand (consequent) side of the rule. The following
command searches for associations rules, such that:
result <- dig_associations(fuzzyCO2, antecedent = !starts_with("Treatment"), consequent = starts_with("Treatment"), disjoint = disj, min_support = 0.02, min_confidence = 0.8)
The result is a tibble with found rules. We may arrange it by support in descending order:
result <- arrange(result, desc(support)) print(result)
TBD (dig_correlations
)
TBD (dig_contrasts
)
The nuggets
package allows to execute a user-defined callback function on each generated
frequent condition. That way a custom type of patterns may be searched. The following example
replicates the search for associations rules with the custom callback function. For that, a dataset
has to be dichotomized and the disjoint vector created as in the Data Preparation section
above:
head(fuzzyCO2) print(disj)
As we want to search for associations rules with some minimum support and confidence, we define the variables to hold that thresholds. We also need to define a callback function that will be called for each found frequent condition. Its purpose is to generate the rules with the obtained condition as an antecedent:
min_support <- 0.02 min_confidence <- 0.8 f <- function(condition, support, foci_supports) { conf <- foci_supports / support sel <- !is.na(conf) & conf >= min_confidence & !is.na(foci_supports) & foci_supports >= min_support conf <- conf[sel] supp <- foci_supports[sel] lapply(seq_along(conf), function(i) { list(antecedent = format_condition(names(condition)), consequent = format_condition(names(conf)[[i]]), support = supp[[i]], confidence = conf[[i]]) }) }
The callback function f()
defines three arguments: condition
, support
and foci_supports
.
The names of the arguments are not random. Based on the argument names of the callback function,
the searching algorithm provides information to the function. Here condition
is a vector of indices
representing the conjunction of predicates in a condition. By the predicate we mean the column in the
source dataset. The support
argument gets the relative frequency of the condition in the dataset.
foci_supports
is a vector of supports of special predicates, which we call "foci" (plural of "focus"),
within the rows satisfying the condition. For associations rules, foci are potential rule consequents.
Now we can run the digging for rules:
result <- dig(fuzzyCO2, f = f, condition = !starts_with("Treatment"), focus = starts_with("Treatment"), disjoint = disj, min_length = 1, min_support = min_support)
As we return a list of lists in the callback function, we have to flatten the first level of lists in the result and binding it into a data frame:
result <- result |> unlist(recursive = FALSE) |> lapply(as_tibble) |> do.call(rbind, args = _) |> arrange(desc(support)) print(result)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.