knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
Find errors in data given a set of validation rules.
The errorlocate
helps to identify obvious errors in raw datasets.
It works in tandem with the package validate
.
With validate
you formulate data validation rules to which the data must comply.
For example:
age >= 0
.if (married ==TRUE) age > 16
.profit == turnover - cost
.While validate
can check if a record is valid or not, it does not identify
which of the variables are responsible for the invalidation. This may seem a simple task,
but is actually quite tricky: a set of validation rules forms a web
of dependent variables: changing the value of an invalid record to repair for rule 1, may invalidate
the record for rule 2.
errorlocate
provides a small framework for record based error detection and implements the Felligi Holt
algorithm. This algorithm assumes there is no other information available then the values of a record
and a set of validation rules. The algorithm minimizes the (weighted) number of values that need
to be adjusted to remove the invalidation.
errorlocate
can be installed from CRAN:
install.packages("errorlocate")
Beta versions can be installed with drat
:
drat::addRepo("data-cleaning") install.packages("errorlocate")
The latest development version of errorlocate
can be installed from github with devtools
:
devtools::install_github("data-cleaning/errorlocate")
library(errorlocate) rules <- validator( profit == turnover - cost , cost >= 0.6 * turnover , turnover >= 0 , cost >= 0 # is implied ) data <- data.frame(profit=750, cost=125, turnover=200) data_no_error <- replace_errors(data, rules) # faulty data was replaced with NA print(data_no_error) er <- errors_removed(data_no_error) print(er) summary(er) er$errors
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.