knitr::opts_chunk$set(eval = FALSE)
library(errorlocate)
library(magrittr)

Who am I?

Data cleaning...

A large part of your job is spent in data-cleaning:

Statistical Value Chain

\begin{center} \includegraphics[width=\textwidth]{img/valuechain.pdf} \end{center}

{.plain}

\begin{center} \includegraphics[height=1\paperheight]{img/keep-calm-and-validate} \end{center}

Validation rules?

Package validate allows to:

library(validate)
check_that( data.frame(age=160, driver_license=TRUE), 
  age >= 0, 
  age < 150,
  if (driver_license == TRUE) age >= 16
)

Explicit validation rules:

Note:

Error localization

Error localization is a procedure that points out fields in a data set that can be altered or imputed in such a way that all validation rules can be satisfied.

Find the error:

library(validate)
check_that( data.frame(age=160, driver_license=TRUE), 
  age >= 0, 
  age < 150,
  if (driver_license == TRUE) age >= 16
)

It is clear that age has an erroneous value, but for more complex rule sets it is less clear.

Multivariate example:

check_that( data.frame( age     = 3
                      , married = TRUE
                      , attends = "kindergarten"
                      )
          , if (married == TRUE) age >= 16
          , if (attends == "kindergarten") age <= 6
          )

Ok, clear that this is a faulty record, but what is the error?

Feligi Holt formalism:

Find the minimal (weighted) number of variables that cause the invalidation of the data rules.

Makes sense! (But there are exceptions...)

Implemented in errorlocate (second generation of editrules).

Formal description (1)

Rule $r_i(x)$

A rule a disjunction of atomic clauses:

$$ r_i(\la{x}) = \bigvee_j C_i^j(\la{x}) $$ with:

$$ C_i^j(\la{x}) = \left{ \begin{array}{l} \la{a}^T\la{x} \leq b \ \la{a}^T\la{x} = b \ x_j \in F_{ij} \textrm{with } F_{ij} \subseteq D_j \ x_j \not\in F_{ij} \textrm{with } F_{ij} \subseteq D_j \ \end{array} \right. $$

Rule system:

The rules form a system $R(\la{x})$:

$$ R_H(\la{x}) = \bigwedge_i r_i $$ If $R_H(\la{x})$ is true for record $\la{x}$, then the record is valid, otherwise one (or more) of the rules is violated.

Mixed Integer Programming to FH

Each rule set $R(\la{x})$ can be translated into a mip problem and solved. $$ \begin{array}{r} \textrm{Minimize } f(\mathbf{x}) = 0; \ \textrm{s.t. }\mathbf{Rx} \leq \mathbf{d} \ \end{array} $$

$$ f(\la{x}) = \sum_{i=1}^N w_i \delta_i $$

errorlocate

errorlocate::locate_errors

locate_errors( data.frame( age     = 3
                  , married = TRUE
                  , attends = "kindergarten"
                  )
     , validator( if (married == TRUE) age >= 16
                , if (attends == "kindergarten") age <= 6
                )
     )$errors

errorlocate::replace_errors

replace_errors( 
    data.frame( age     = 3
              , married = TRUE
              , attends = "kindergarten"
              )
  , validator( if (married == TRUE) age >= 16
             , if (attends == "kindergarten") age <= 6
             )
)

Pipe %>% friendly

The replace_errors function is pipe friendly:

rules <- validator(age < 150)

data_noerrors <- 
  data.frame(age=160, driver_license = TRUE) %>% 
  replace_errors(rules)

errors_removed(data_noerrors) # contains errors removed

Interested?

\begincols \begincol{0.48\textwidth} \includegraphics[width=0.9\textwidth]{img/SDCR.jpg} \endcol

\begincol{0.48\textwidth} \begin{block}{SDCR} M. van der Loo and E. de Jonge (2018) \emph{Statistical Data Cleaning with applications in R} Wiley, Inc. \end{block} \begin{block}{errorlocate} \begin{itemize} \item Available on \href{https://CRAN.R-project.org/package=errorlocate}{\underline{CRAN}} \end{itemize} \end{block} \begin{block}{More theory?} $\leftarrow$ See book \end{block} \endcol \endcols

Thank you for your attention (and enjoy The Hague)!



data-cleaning/errorlocate documentation built on Oct. 1, 2023, 1:04 p.m.