knitr::opts_chunk$set(eval = FALSE) library(errorlocate) library(magrittr)
\begin{centering} \includegraphics[width=0.8\paperwidth]{img/bad-data} \par \end{centering}
A large part of your job is spent in data-cleaning:
getting your data in the right shape (e.g. tidyverse
)
assessing missing data (e.g. VIM
)
checking validity (e.g. validate
)
locating and removing errors: errorlocate
!
impute values for missing or erroneous data (e.g. simputation
)
\begin{center} \includegraphics[height=1\paperheight]{img/keep-calm-and-validate} \end{center}
\usebackgroundtemplate{Hi}
Package validate
allows to:
library(validate) check_that( data.frame(age=160, driver_license=TRUE), age >= 0, age < 150, if (driver_license == TRUE) age >= 16 )
Error localization is a procedure that points out fields in a data set that can be altered or imputed in such a way that all validation rules can be satisfied.
library(validate) check_that( data.frame(age=160, driver_license=TRUE), age >= 0, age < 150, if (driver_license == TRUE) age >= 16 )
It is clear that age
has an erroneous value, but for more complex rule sets
it is less clear.
check_that( data.frame( age = 3 , married = TRUE , attends = "kindergarten" ) , if (married == TRUE) age >= 16 , if (attends == "kindergarten") age <= 6 )
Ok, clear that this is a faulty record, but what is the error?
Find the minimal (weighted) number of variables that cause the invalidation of the data rules.
Makes sense! (But there are exceptions...)
Implemented in errorlocate
(second generation of editrules
).
errorlocate::locate_errors
locate_errors( data.frame( age = 3 , married = TRUE , attends = "kindergarten" ) , validator( if (married == TRUE) age >= 16 , if (attends == "kindergarten") age <= 6 ) )$errors
errorlocate::replace_errors
replace_errors( data.frame( age = 3 , married = TRUE , attends = "kindergarten" ) , validator( if (married == TRUE) age >= 16 , if (attends == "kindergarten") age <= 6 ) )
errorlocate
:
translates error localization problem into a mixed integer problem, which
is solved with lpsolveAPI
.
contains a small framework for implementing your own error localization algorithms.
The replace_errors
function is pipe friendly:
rules <- validator(age < 150) data_noerrors <- data.frame(age=160, driver_license = TRUE) %>% replace_errors(rules) errors_removed(data_noerrors) # contains errors removed
\Large{Interested?}
install.packages("errorlocate")
Or visit:
http://github.com/data-cleaning/errorlocate
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.