discover | R Documentation |
Finds all the minimal functional dependencies represented in a data frame.
discover(
df,
accuracy,
digits = getOption("digits"),
full_cache = TRUE,
store_cache = TRUE,
skip_bijections = FALSE,
exclude = character(),
exclude_class = character(),
dependants = names(df),
detset_limit = ncol(df) - 1L,
progress = FALSE,
progress_file = ""
)
df |
a data.frame, the relation to evaluate. |
accuracy |
a numeric in (0, 1]: the accuracy threshold required in order to conclude a dependency. |
digits |
a positive integer, indicating how many significant digits are
to be used for numeric and complex variables. A value of |
full_cache |
a logical, indicating whether to store information about how sets of attributes group the relation records (stripped partitions). Otherwise, only the number of groups is stored. Storing the stripped partition is expected to let the algorithm run more quickly, but might be inefficient for small data frames or small amounts of memory. |
store_cache |
a logical, indicating whether to keep cached information to use when finding dependencies for other dependants. This allows the algorithm to run more quickly by not having to re-calculate information, but takes up more memory. |
skip_bijections |
a logical, indicating whether to skip some dependency
searches that are made redundant by discovered bijections between
attributes. This can significantly speed up the search if |
exclude |
a character vector, containing names of attributes to not
consider as members of determinant sets. If names are given that aren't
present in |
exclude_class |
a character vector, indicating classes of attributes to not consider as members of determinant_sets. Attributes are excluded if they inherit from any given class. |
dependants |
a character vector, containing names of all attributes for which to find minimal functional dependencies for which they are the dependant. By default, this is all of the attribute names. A smaller set of attribute names reduces the amount of searching required, so can reduce the computation time if only some potential dependencies are of interest. |
detset_limit |
an integer, indicating the largest determinant set size that should be searched for. By default, this is large enough to allow all possible determinant sets. See Details for comments about the effect on the result, and on the computation time. |
progress |
a logical, for whether to display progress to the user during
dependency search in |
progress_file |
a scalar character or a connection. If |
Column names for df
must be unique.
The algorithm used for finding dependencies is DFD. This searches for determinant sets for each dependent attribute (dependant) by traversing the powerset of the other (non-excluded) attributes, and is equivalent to depth-first.
The implementation for DFD differs a little from the algorithm presented in the original paper:
Some attributes, or attribute types, can be designated, ahead of time, as not being candidate members for determinant sets. This reduces the number of candidate determinant sets to be searched, saving time by not searching for determinant sets that the user would remove later anyway.
Attributes that have a single unique value, i.e. are constant, get attributed a single empty determinant set. In the standard DFD algorithm, they would be assigned all the other non-excluded attributes as length-one determinant sets. Assigning them the empty set distinguishes them as constant, allowing for special treatment at normalisation and later steps.
As was done in the original Python library, there is an extra case in seed generation for when there are no discovered maximal non-dependencies. In this case, we take all of the single-attribute nodes, then filter out by minimal dependencies as usual. This is equivalent to taking the empty set as the single maximal non-dependency.
There are three results when checking whether a candidate node is minimal/maximal. TRUE indicates the node is minimal/maximal, as usual. FALSE has been split into FALSE and NA. NA indicates that we can not yet determine whether the node is minimal/maximal. FALSE indicates that we have determined that it is not minimal/maximal, and we can set its category as such. This is done by checking whether any of its adjacent subsets/supersets are dependencies/non-dependencies, instead of waiting to exhaust the adjacent subsets/supersets to visit when picking the next node to visit.
We do not yet keep hashmaps to manage subset/superset relationships, as described in Section 3.5 of the original paper.
skip_bijections
allows for additional optimisation for finding
functional dependencies when there are pairwise-equivalent attributes.
Missing values (NA) are treated as a normal value, with NA = NA being true, and x = NA being false for any non-NA value of x.
Numerical/complex values, i.e. floating-point values, represent difficulties for stating functional dependencies. A fundamental condition for stating functional dependencies is that we can compare two values for the same variable, and they are equivalent or not equivalent.
Usually, this is done by checking they're equal – this is the approach used
in discover
– but we can use any comparison that is an equivalence
relation.
However, checking floating-point values for equality is not simple. ==
is not appropriate, even when comparing non-calculated values we've read from
a file, because how a given number is converted into a float can vary by
computer architecture, meaning that two values can be considered equal on one
computer, and not equal on another. This can happen even if they're both
using 64-bit R, and even though all R platforms work with values conforming
to the same standard (see double
). For example,
8.54917750000000076227
and 8.54917749999999898591
are converted into
different floating-point representations on x86, but the same representation
on ARM, resulting in inequality and equality respectively.
For this and other reasons, checking numerical/complex values for
(near-)equality in R is usually done with all.equal
. This
determines values x
and y
to be equal if their absolute/relative
absolute difference is within some tolerance value. However, we can not use
this. Equivalence relations must be transitive: if we have values x
,
y
, and z
, and x
is equivalent to both y
and z
,
then y
and z
must also be equivalent. This tolerance-based
equivalence is not transitive: it is reasonably straightforward to set up
three values so that the outer values are far enough apart to be considered
non-equivalent, but the middle value is close enough to be considered
equivalent to both of them. Using this to determine functional dependencies,
therefore, could easily result in a large number of inconsistencies.
This means we have no good option for comparing numerical/complex values as-is for equivalence, with consistent results across different machines, so we must treat them differently. We have three options:
Round/truncate the values, before comparison, to some low degree of precision;
Coerce the values to another class before passing them into discover
;
Read values as characters if reading data from a file.
discover
takes the first option, with a default number of significant
digits low enough to ensure consistency across different machines. However,
the user can also use any of these options when processing the data before
passing it to discover
. The third option, in particular, is
recommended if reading data from a file.
Skipping bijections allows skipping redundant searches. For example, if the
search discovers that A -> B
and B -> A
, then only one of those
attributes is considered for the remainder of the search. Since the search
time increases exponentially with the number of attributes considered, this
can significantly speed up search times. At the moment, this is only be done
for bijections between single attributes, such as A <-> B
; if A
<-> {B, C}
, nothing is skipped. Whether bijections are skipped doesn't
affect which functional dependencies are present in the output, but it might
affect their order.
Skipping bijections for approximate dependencies, i.e. when accuracy < 1
,
should be avoided: it can result in incorrect output, since an approximate
bijection doesn't imply equivalent approximate dependencies.
Setting detset_limit
smaller than the largest-possible value has
different behaviour for different search algorithms, the result is always
that discover(x, 1, detset_limit = n)
is equivalent to doing a full
search, fds <- discover(x, 1)
, then
filtering by determinant set size post-hoc, fds[lengths(detset(fds)) <=
n]
.
For DFD, the naive way to implement it is by removing determinant sets larger than the limit from the search tree for possible functional dependencies for each dependant. However, this usually results in the search taking much more time than without a limit.
For example, suppose we search for determinant sets for a dependant that has
none (the dependant is the only key for df
, for example). Using DFD,
we begin with a single attribute, then add other attributes one-by-one, since
every set gives a non-dependency. When we reach a maximum-size set, we can
mark all subsets as also being non-dependencies.
With the default limit, there is only one maximum-size set, containing all of
the available attributes. If there are n
candidate attributes for
determinants, the search finishes after visiting n
sets.
With a smaller limit k
, there are \binom{n}{k}
maximum-size sets
to explore. Since a DFD search adds or removes one attribute at each step,
this means the search must take at least k - 2 + 2\binom{n}{k}
steps,
which is larger than n
for all non-trivial cases 0 < k \leq n
.
We therefore use a different approach, where any determinant sets above the size limit are not allowed to be candidate seeds for new search paths, and any discovered dependencies with a size above the limit are discard at the end of the entire DFD search. This means that nodes for determinant sets above the size limit are only visited in order to determine maximality of non-dependencies within the size limit. It turns out to be rare that this results in a significant speed-up, but it never results in the search having to visit more nodes than it would without a size limit, so the average search time is never made worse.
A functional_dependency
object, containing the discovered
dependencies. The column names of df
are stored in the attrs
attribute, in order, to serve as a default priority order for the
attributes during normalisation.
Abedjan Z., Schulze P., Naumann F. (2014) DFD: efficient functional dependency discovery. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM '14). New York, U.S.A., 949–958.
# simple example
discover(ChickWeight, 1)
# example with spurious dependencies
discover(CO2, 1)
# exclude attributes that can't be determinants.
# in this case, the numeric attributes are now
# not determined by anything, because of repeat measurements
# with no variable to mark them as such.
discover(CO2, 1, exclude_class = "numeric")
# include only dependencies with dependants of interest.
discover(CO2, 1, dependants = c("Treatment", "uptake"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.