knitr::opts_chunk$set(eval = TRUE)
library(validatetools)

Who am I?

{.plain}

\hspace*{-1cm}\includegraphics[width=1\paperwidth]{img/bad-data}

Data cleaning...

A large part of your and our job is spent in data-cleaning:

Desirable data cleaning properties:

{.plain}

\hspace*{-1cm} \includegraphics[width=1\paperwidth]{img/rules.png}

Data Cleaning philosophy

Advantages:

R package validate

With package validate you can formulate explicit rules that data must conform to:

library(validate)
check_that( data.frame(age=160, job = "no", income = 3000), 
  age >= 0, 
  age < 150,
  job %in% c("yes", "no"),
  if (job == "yes") age >= 16,
  if (income > 0) job == "yes"
)

Rules (2)

A lot of datacleaning packages are using validate rules to facilitate their work.

Why-o-why validatetools?

Because we'd like to...

Problem: infeasibility

Problem

One or more rules in conflict: all data incorrect! (and yes that happens when rule sets are large ...)

library(validatetools)
rules <- validator( is_adult = age >=21
                  , is_child = age < 18
                  )
is_infeasible(rules)

{.plain}

\hspace*{-2cm} \includegraphics[height=1\paperheight]{img/keepcalm-and-resolve.png}

Conflict, and now?

rules <- validator( is_adult = age >=21
                  , is_child = age < 18
                  )
# Find out which rule would remove the conflict
detect_infeasible_rules(rules)
# And its conflicting rule(s)
is_contradicted_by(rules, "is_adult")

Detecting and removing redundant rules

Rule $r_1$ may imply $r_2$, so $r_2$ can be removed.

rules <- validator( r1 = age >= 18
                  , r2 = age >= 12
                  )
detect_redundancy(rules)
remove_redundancy(rules)

Value substitution

rules <- validator( r1 = if (gender == "male") weight > 50
                  , r2 = gender %in% c("male", "female")
                  )

substitute_values(rules, gender = "male")

Conditional statement

A bit more complex reasoning, but still classical logic:

rules <- validator( r1 = if (income > 0) age >= 16
                  , r2 = age < 12
                  )
# age > 16 is always FALSE so r1 can be simplified
simplify_conditional(rules)

All together now!

simplify_rules applies all simplification methods to the rule set

rules <- validator( r1 = job %in% c("yes", "no")
                  , r2 = if (job == "yes") income > 0
                  , r3 = if (age < 16) income == 0
                  )
simplify_rules(rules, job = "yes")

How does it work?

validatetools:

Rule types

If statement is Modus ponens:

$$ \begin{array}{ll} & \textsf{if } P \textsf{ then } Q \ \Leftrightarrow & P \implies Q \ \Leftrightarrow & \lnot P \lor Q \end{array} $$

Example

rules <- validator(
  example = if (job == "yes") income > 0
)

$$ r_{\textrm{example}}(x) = \textrm{job} \not \in \textrm{"yes"} \lor \textrm{income} > 0
$$

print(rules)

Interested?

\begin{minipage}[c]{0.5\textwidth} \includegraphics[width=0.9\textwidth]{img/SDCR.jpg} \end{minipage} \begin{minipage}[c]{0.5\textwidth} \begin{block}{SDCR} M. van der Loo and E. de Jonge (2018) \emph{Statistical Data Cleaning with applications in R} Wiley, Inc. \end{block} \begin{block}{validatetools} \begin{itemize} \item Available on \href{https://CRAN.R-project.org/package=validatetools}{\underline{CRAN}} \end{itemize} \end{block} \begin{block}{More theory?} $\leftarrow$ See book \end{block} \end{minipage}

Thank you for your attention! / Köszönöm a figyelmet!

Addendum

Formal logic

Rule set $S$

A validation rule set $S$ is a conjunction of rules $r_i$, which applied on record $\la{x}$ returns TRUE (valid) or FALSE (invalid)

$$ S(\la{x}) = r_1(\la{x}) \land \cdots \land r_n(\la{x}) $$

Note

Formal logic (2)

Rule $r_i(x)$

A rule a disjunction of atomic clauses:

$$ r_i(x) = \bigvee_j C_i^j(x) $$ with:

$$ C_i^j(\la{x}) = \left{ \begin{array}{l} \la{a}^T\la{x} \leq b \ \la{a}^T\la{x} = b \ x_j \in F_{ij} \textrm{with } F_{ij} \subseteq D_j \ x_j \not\in F_{ij} \textrm{with } F_{ij} \subseteq D_j \ \end{array} \right. $$

Mixed Integer Programming

Each rule set problem can be translated into a mip problem, which can be readily solved using a mip solver.

validatetools uses lpSolveApi.

$$ \begin{array}{r} \textrm{Minimize } f(\mathbf{x}) = 0; \ \textrm{s.t. }\mathbf{Rx} \leq \mathbf{d} \ \end{array} $$ with $\la{R}$ and $\la{d}$ the rule definitions and $f(\la{x})$ is the specific problem that is solved.



data-cleaning/validate.simplify documentation built on June 15, 2024, 2:54 p.m.