knitr::opts_chunk$set(eval = FALSE)
library(errorlocate)
library(magrittr)

{.plain .fragile}

\begin{centering} \includegraphics[width=0.8\paperwidth]{img/bad-data} \par \end{centering}

Data cleaning...

A large part of your job is spent in data-cleaning:

{.plain}

\begin{center} \includegraphics[height=1\paperheight]{img/keep-calm-and-validate} \end{center}

\usebackgroundtemplate{Hi}

Validation rules?

Package validate allows to:

library(validate)
check_that( data.frame(age=160, driver_license=TRUE), 
  age >= 0, 
  age < 150,
  if (driver_license == TRUE) age >= 16
)

Explicit validation rules:

Note:

Error localization

Error localization is a procedure that points out fields in a data set that can be altered or imputed in such a way that all validation rules can be satisfied.

Find the error:

library(validate)
check_that( data.frame(age=160, driver_license=TRUE), 
  age >= 0, 
  age < 150,
  if (driver_license == TRUE) age >= 16
)

It is clear that age has an erroneous value, but for more complex rule sets it is less clear.

Multivariate example:

check_that( data.frame( age     = 3
                      , married = TRUE
                      , attends = "kindergarten"
                      )
          , if (married == TRUE) age >= 16
          , if (attends == "kindergarten") age <= 6
          )

Ok, clear that this is a faulty record, but what is the error?

Feligi Holt formalism:

Find the minimal (weighted) number of variables that cause the invalidation of the data rules.

Makes sense! (But there are exceptions...)

Implemented in errorlocate (second generation of editrules).

errorlocate::locate_errors

locate_errors( data.frame( age     = 3
                  , married = TRUE
                  , attends = "kindergarten"
                  )
     , validator( if (married == TRUE) age >= 16
                , if (attends == "kindergarten") age <= 6
                )
     )$errors

errorlocate::replace_errors

replace_errors( 
    data.frame( age     = 3
              , married = TRUE
              , attends = "kindergarten"
              )
  , validator( if (married == TRUE) age >= 16
             , if (attends == "kindergarten") age <= 6
             )
)

Internal workings:

errorlocate:

Pipe friendly

The replace_errors function is pipe friendly:

rules <- validator(age < 150)

data_noerrors <- 
  data.frame(age=160, driver_license = TRUE) %>% 
  replace_errors(rules)

errors_removed(data_noerrors) # contains errors removed

Thank you!

\Large{Interested?}

install.packages("errorlocate")

Or visit:

http://github.com/data-cleaning/errorlocate



data-cleaning/errorlocate documentation built on Oct. 1, 2023, 1:04 p.m.