knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(Ncpus = 1)
Errorlocate uses validation rules from package validate
to locate faulty
values in observations (or in database slang: erronenous fields in records).
It follows this simple recipe (Felligi-Holt):
errorlocate
does this by translating this into a mixed integer
problem (see vignette("inspect_mip", package="errorlocate"
) and solving it using
lpSolveAPI
.
errorlocate
has two main functions to be used:
locate_errors
for detecting errorsreplace_errors
for replacing faulty values with NA
library(validate) library(errorlocate)
Let's start with a simple example:
We have a rule that age cannot be negative:
rules <- validator(age > 0)
And we have the following data set
"age, income -10, 0 15, 2000 25, 3000 NA, 1000 " -> csv d <- read.csv(textConnection(csv), strip.white = TRUE)
d
le <- locate_errors(d, rules) summary(le)
summary(le)
gives an overview of the errors found in this data set.
The complete error listing can be found with:
le$errors
Which says that record 1 has a faulty value for age.
Suppose we expand our rules
rules <- validator( r1 = age > 0 , r2 = if (income > 0) age > 16 )
With validate::confront
we can see that rule r2
is violated (record 2).
summary(confront(d, rules))
What errors will be found by locate_errors
?
set.seed(1) le <- locate_errors(d, rules) le$errors
It now detects that age
in observation 2 is also faulty, since it
violates the second rule. Note that we use set.seed
.
This is needed because in this example, either age
or income
can
be considered faulty. set.seed
assures that the procedure is
reproducible.
With replace_errors
we can remove the errors (which still need to be imputed).
d_fixed <- replace_errors(d, le) summary(confront(d_fixed, rules))
In which replace_errors
set all faulty values to NA
.
d_fixed
locate_errors
allows for supplying weigths for the variables.
It is common that the quality of the observed variables differs.
When we have more trust in age
we can give it more weight so it chooses
income when it has to decide between the two (record 2):
set.seed(1) # good practice, although not needed in this example weight <- c(age = 2, income = 1) le <- locate_errors(d, rules, weight) le$errors
Weights can be specified in different ways:
(see also errorlocate::expand_weights
):
vector
: all records will have same set of weights. Unspeficied columns
will have weight 1.matrix
or data.frame
, same dimension as the data: specify weights per record.Inf
weights to fixate a variable, so it won't be changed.locate_errors
solves a mixed integer problem. When the number of interactions between validation rules is large, finding an optimal
solution can become computationally intensive. Both locate_errors
as well as replace_errors
have a parallization option: Ncpus
making
use of multiple processors. The $duration
(s) property of each solution
indicates the time spent to find a solution for each record. This can
be restricted using the argument timeout
(s).
# duration is in seconds. le$duration
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.