knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(errorlocate)
Errorlocate uses the linear, categorical and conditional rules from a rules set formulated with R package validate
, to create a Mixed Integer Problem.
For most users the details of the translation are not relevant and hidden in locate_errors
.
Often the number of errors found and the processing time are much more relevant parameters.
In a few cases, you may run into a problems with your error localization problem:
a) The processing time of (some of the records) of locate_errors
is high.
b) locate_errors
missed an obvious error.
c) locate_errors
indicates that it did not find a valid solution (for some records) .
Problem a. can be addressed by using the parallel argument of locate_errors
(and replace_errors
).
Problem b can be due to that error_locate
ignores non-linear rules, and therefore
is not able to deduce the errors, because it only takes linear, categorical and
conditional rules into a account.
There may also be problems with your rules set. Problems set may be mitigated
by using the validatetools
package that can detect conflicting and redundant rules and has methods to
simplify your rule set.
If you want to dive deep into the mixed integer problem that is created
by error_locate
you can use the inspect_mip
function.
In the following sections an example is given of how linear, categorical and conditional rules are written as Mixed Integer Problems. First let's see how these rules in validator can be formally defined.
Each translatable rule $r_i(\mathbf{x})$ can be written as a disjunction of
atomic clauses $C_i^j(x)$: it is a function $r_i$ that operates on (some of) the values
of record $\mathbf{x} = (x_1, \ldots, x_n)$ and is TRUE
(valid) or FALSE
(not valid)
$$ r_i(\mathbf{x}) = \bigvee_j C_i^j(\mathbf{x}) $$
with each atomic clause:
$$ C_i^j(\mathbf{x}) = \left{ \begin{array}{l} \mathbf{a}^T\mathbf{x} \leq b \ \mathbf{a}^T\mathbf{x} = b \ x_j \in F_{ij} \textrm{with } F_{ij} \subseteq D_j \ x_j \not\in F_{ij} \textrm{with } F_{ij} \subseteq D_j \ \end{array} \right. $$
Each linear, categorical or conditional rule $r_i$ can be written in this form.
rules <- validator(example_1 = if (income > 0) age >= 16) rules$exprs()
So the rule if (income > 0) age >= 16
can be written as (income <= 0
OR age >=16
)
rules <- validator(example_2 = if (has_house == "yes") income >= 1000) rules$exprs()
So the rule if (has_house == "yes") income >= 1000)
can be written as (has_house != "yes"
OR age >=1000
)
The rules form a system $R(\mathbf{x})$:
$$ R(\mathbf{x}) = \bigwedge_i r_i $$ which means that all rules $r_i$ must be valid. If $R(\mathbf{x})$ is true for record $\mathbf{x}$, then the record is valid, otherwise one (or more) of the rules is violated.
Each rule set $R(\mathbf{x})$ can be translated into a mip problem and solved.
$$ \begin{array}{r} \textrm{Minimize } f(\mathbf{x}) = 0; \ \textrm{s.t. }\mathbf{Rx} \leq \mathbf{d} \ \end{array} $$ - $f(\mathbf{x})$ is the (weighted) number of changed variable: $\delta_i \in {0,1}$
$$ f(\mathbf{x}) = \sum_{i=1}^N w_i \delta_i $$
$\mathbf{R}$ contains rules:
$\mathbf{R}_H(\mathbf{x}) \leq \mathbf{d}_H$ that were specified with validate
/validator
$\mathbf{R}_0(\mathbf{x}, \mathbf{\delta}) \leq \mathbf{d}_0$ : soft constraints that try fix the current record of $\mathbf{x}$ to the observed values.
inspect_mip
:Most users will use the function locate_errors
to find errors.
The function inspect_mip
works exactly same, except that it operates on
just one record in stead of a whole data.frame
. The result of inspect_mip
is a mip object, that is not yet executed and can be inspected.
rules <- validator( r1 = age >= 18 , r2 = income >= 0 ) data <- data.frame(age = c(12, 35), income = c(2000, -1000)) data
So we detect two errors in the dataset:
summary(confront(data, rules))
locate_errors(data, rules)$errors
Lets inspect the first record
mip <- inspect_mip(data, rules)
The mip
object contains the mip problem before it is executed.
We can inspect the lp problem, prior to solving it with lpSolveApi
with
# inspect the lp problem (prior to solving it with lpsolveAPI) lp <- mip$to_lp() print(lp)
Validator rules r1
and r2
are encoded in two lines of the model. The values of the current record are encoded as soft constraints in age_ub
, age_lb
, income_lb
and income_ub
. These constraints try to fix the values of age
at 12 and income
at 2000, but can be violated, setting .delta_age
or .delta_income
to 1.
For large problems the lp problem can be written to disk for inspection
mip$write_lp("my_problem.lp")
Once we execute the mip project, the lp solver is executed on the problem:
res <- mip$execute()
Extra arguments are passed through to lpSolveAPI
. The result object contains several properties:
names(res)
res$solution
indicates of a solution was found
res$solution
res$s
indicates the lpSolveAPI
status, what kind of solution was found.
res$s
res$errors
indicates which fields/values are deemed erroneous:
res$errors
res$values
contains the values for the valid solution that has been found by the lpsolver:
res$values
Note that the solver has found that setting age
from 12 to 18 gives a valid
solution. .delta_age = 1
indicates that age
contained an error.
The result object res
also contains an lp
object after optimization. This
object can be further investigated using lpSolveAPI
functions.
# lp problem after solving res$lp
Note that the lp problem has been simplified. For example the single
variable constraints,the lp problem/object after solving shows that the solver
has optimized some of the rules: it has moved rule r1
and r2
into the
Lower boundary
conditions. It also removed age_lb
because that was
superfluous with respect to the boundary conditions.
In categorical rules, each category is coded in a separate column/mip variable:
e.g. if we have a working
variable, with two categories ("job", "retired"), the mip problem is encoded as follows:
rules <- validator( r1 = working %in% c("job","retired") ) data <- data.frame(working="?")
knitr::kable(data)
mip <- inspect_mip(data, rules) mip$to_lp()
Row r1
indicates that either working:job
or working:retired
must be true. The Kind
row (SOS
) indicates that these variables share the same switch, only one of them can be set.
mip$execute()$values
With categorical variables it is also possible to specify if-then
rules. These are encoded as one mip rule:
rules <- validator( r1 = if (voted == TRUE) adult == TRUE) data <- data.frame(voted = TRUE, adult = FALSE)
knitr::kable(data)
mip <- inspect_mip(data, rules) mip$to_lp()
mip$execute()$values
rules <- validator( r1 = if (income > 0) age >= 16) data <- data.frame(age = 12, income = 2000)
knitr::kable(data)
mip <- inspect_mip(data, rules)
errorlocate
encodes this rule into multiple rules
(as noted in the theoretical section above), so rule r1
is chopped into
1 rule + 2 sub rules:
r1: if (income > 0) age >= 16
:
r1._lin1: if (r1._lin1 == FALSE) income <= 0
r1._lin2: if (r1._lin2 == FALSE) age >= 16
r1: r1._lin1 == FALSE | r1._lin2 == FALSE
This can be seen with:
mip$mip_rules()
The resulting lp model is:
mip$to_lp()
mip$execute()$values
This works together with categorical, linear and conditional rules.
rules <- validator( r1 = working %in% c("no_job", "job","retired") , r2 = if (age < 12) working == "no_job" , r3 = if (working == "retired") age > 50 , r4 = age >=0 ) data <- data.frame(age = 12, working="retired") mip <- inspect_mip(data, rules) mip$execute()$errors
The weights for each variable are normally set to 1, and errorlocate
adds some random remainder to the weights: so the solutions are unique and reproducible (using set.seed
).
set.seed(42) rules <- validator( r1 = if (voted == TRUE) adult == TRUE) data <- data.frame(voted = TRUE, adult = FALSE) mip <- inspect_mip(data, rules, weight = c(voted = 3, adult=1))
$objective
contains the generated weights:
mip$objective
These are assigned to the delta
variables in the objective function of the mip.
mip$to_lp()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.