View source: R/dig_associations.R
| dig_associations | R Documentation |
Association rules identify conditions (antecedents) under which a specific feature (consequent) is present very often.
A => C
If condition A is satisfied, then the feature C is present very often.
university_edu & middle_age & IT_industry => high_income
People in middle age with university education working in IT industry
have very likely a high income.
Antecedent A is usually a set of predicates, and consequent C is a single
predicate.
For the following explanations we need a mathematical function supp(I), which
is defined for a set I of predicates as a relative frequency of rows satisfying
all predicates from I. For logical data, supp(I) equals to the relative
frequency of rows, for which all predicates i_1, i_2, \ldots, i_n from I are TRUE.
For numerical (double) input, supp(I) is computed as the mean (over all rows)
of truth degrees of the formula i_1 AND i_2 AND ... AND i_n, where
AND is a triangular norm selected by the t_norm argument.
Association rules are characterized with the following quality measures.
Length of a rule is the number of elements in the antecedent.
Coverage of a rule is equal to supp(A).
Consequent support of a rule is equal to supp(\{c\}).
Support of a rule is equal to supp(A \cup \{c\}).
Confidence of a rule is the fraction supp(A) / supp(A \cup \{c\}).
Lift of a rule is the ratio of its support to the expected support
assuming antecedent and consequent are independent, i.e.,
supp(A \cup \{c\}) / (supp(A) * supp(\{c\})).
dig_associations(
x,
antecedent = everything(),
consequent = everything(),
disjoint = var_names(colnames(x)),
excluded = NULL,
min_length = 0L,
max_length = Inf,
min_coverage = 0,
min_support = 0,
min_confidence = 0,
contingency_table = FALSE,
measures = deprecated(),
t_norm = "goguen",
max_results = Inf,
verbose = FALSE,
threads = 1,
error_context = list(arg_x = "x", arg_antecedent = "antecedent", arg_consequent =
"consequent", arg_disjoint = "disjoint", arg_excluded = "excluded", arg_min_length =
"min_length", arg_max_length = "max_length", arg_min_coverage = "min_coverage",
arg_min_support = "min_support", arg_min_confidence = "min_confidence",
arg_contingency_table = "contingency_table", arg_measures = "measures", arg_t_norm =
"t_norm", arg_max_results = "max_results", arg_verbose = "verbose", arg_threads =
"threads", call = current_env())
)
x |
a matrix or data frame with data to search in. The matrix must be
numeric (double) or logical. If |
antecedent |
a tidyselect expression (see tidyselect syntax) specifying the columns to use in the antecedent (left) part of the rules |
consequent |
a tidyselect expression (see tidyselect syntax) specifying the columns to use in the consequent (right) part of the rules |
disjoint |
an atomic vector of size equal to the number of columns of |
excluded |
NULL or a list of character vectors, where each character vector contains the names of columns that must not appear together in a single antecedent. |
min_length |
the minimum length, i.e., the minimum number of predicates in the antecedent, of a rule to be generated. Value must be greater or equal to 0. If 0, rules with empty antecedent are generated in the first place. |
max_length |
The maximum length, i.e., the maximum number of predicates in the antecedent, of a rule to be generated. If equal to Inf, the maximum length is limited only by the number of available predicates. |
min_coverage |
the minimum coverage of a rule in the dataset |
min_support |
the minimum support of a rule in the dataset |
min_confidence |
the minimum confidence of a rule in the dataset |
contingency_table |
a logical value indicating whether to provide a contingency
table for each rule. If |
measures |
(Deprecated. Search for associations using
|
t_norm |
a t-norm used to compute conjunction of weights. It must be one of
|
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical value indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
error_context |
a named list providing context for error messages.
This is mainly useful when
|
An S3 object, which is an instance of associations and nugget
classes, and which is a tibble with found patterns and computed quality measures.
Michal Burda
partition(), var_names(), dig()
d <- partition(mtcars, .breaks = 2)
dig_associations(d,
antecedent = !starts_with("mpg"),
consequent = starts_with("mpg"),
min_support = 0.3,
min_confidence = 0.8)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.