knitr::opts_chunk$set(eval = TRUE) library(validatetools)
whisker
, validate
, errorlocate
, docopt
, tableplot
, chunked
, ffbase
,...\hspace*{-1cm}\includegraphics[width=1\paperwidth]{img/bad-data}
A large part of your and our job is spent in data-cleaning:
getting your data in the right shape (e.g. tidyverse
, recipes
)
checking validity (e.g. validate
, dataMaid
, errorlocate
)
impute values for missing or erroneous data (e.g. VIM
, simputation
, recipes
)
see data changes, improvements (e.g. daff
, diffobj
, lumberjack
)
\hspace*{-1cm} \includegraphics[width=1\paperwidth]{img/rules.png}
validate
With package validate
you can formulate explicit rules that data must conform to:
library(validate) check_that( data.frame(age=160, job = "no", income = 3000), age >= 0, age < 150, job %in% c("yes", "no"), if (job == "yes") age >= 16, if (income > 0) job == "yes" )
A lot of datacleaning packages are using validate rules to facilitate their work.
validate
: validation checks and data quality stats on data. errorlocate
: to find errors in variables (in stead of records)rspa
: data correction under data constraintsdeductive
: deductive correctiondcmodify
: deterministic correction and imputation.validatetools
?validate
, what is the need?detect and resolve problems with rules:
check the rule set using formal logic (without any data!).
One or more rules in conflict: all data incorrect! (and yes that happens when rule sets are large ...)
library(validatetools) rules <- validator( is_adult = age >=21 , is_child = age < 18 ) is_infeasible(rules)
\hspace*{-2cm} \includegraphics[height=1\paperheight]{img/keepcalm-and-resolve.png}
rules <- validator( is_adult = age >=21 , is_child = age < 18 ) # Find out which rule would remove the conflict detect_infeasible_rules(rules) # And its conflicting rule(s) is_contradicted_by(rules, "is_adult")
Rule $r_1$ may imply $r_2$, so $r_2$ can be removed.
rules <- validator( r1 = age >= 18 , r2 = age >= 12 ) detect_redundancy(rules) remove_redundancy(rules)
rules <- validator( r1 = if (gender == "male") weight > 50 , r2 = gender %in% c("male", "female") ) substitute_values(rules, gender = "male")
A bit more complex reasoning, but still classical logic:
rules <- validator( r1 = if (income > 0) age >= 16 , r2 = age < 12 ) # age > 16 is always FALSE so r1 can be simplified simplify_conditional(rules)
simplify_rules
applies all simplification methods to the rule set
rules <- validator( r1 = job %in% c("yes", "no") , r2 = if (job == "yes") income > 0 , r3 = if (age < 16) income == 0 ) simplify_rules(rules, job = "yes")
validatetools
:
reformulates rules into formal logic form.
translates them into a mixed integer program for each of the problems.
$$ \begin{array}{ll} & \textsf{if } P \textsf{ then } Q \ \Leftrightarrow & P \implies Q \ \Leftrightarrow & \lnot P \lor Q \end{array} $$
rules <- validator( example = if (job == "yes") income > 0 )
$$
r_{\textrm{example}}(x) = \textrm{job} \not \in \textrm{"yes"} \lor \textrm{income} > 0
$$
print(rules)
\begin{minipage}[c]{0.5\textwidth} \includegraphics[width=0.9\textwidth]{img/SDCR.jpg} \end{minipage} \begin{minipage}[c]{0.5\textwidth} \begin{block}{SDCR} M. van der Loo and E. de Jonge (2018) \emph{Statistical Data Cleaning with applications in R} Wiley, Inc. \end{block} \begin{block}{validatetools} \begin{itemize} \item Available on \href{https://CRAN.R-project.org/package=validatetools}{\underline{CRAN}} \end{itemize} \end{block} \begin{block}{More theory?} $\leftarrow$ See book \end{block} \end{minipage}
Thank you for your attention! / Köszönöm a figyelmet!
A validation rule set $S$ is a conjunction of rules $r_i$, which applied on record $\la{x}$ returns TRUE
(valid) or FALSE
(invalid)
$$ S(\la{x}) = r_1(\la{x}) \land \cdots \land r_n(\la{x}) $$
a record has to comply to each rule $r_i$.
it is thinkable that two or more $r_i$ are in conflict, making each record invalid.
A rule a disjunction of atomic clauses:
$$ r_i(x) = \bigvee_j C_i^j(x) $$ with:
$$ C_i^j(\la{x}) = \left{ \begin{array}{l} \la{a}^T\la{x} \leq b \ \la{a}^T\la{x} = b \ x_j \in F_{ij} \textrm{with } F_{ij} \subseteq D_j \ x_j \not\in F_{ij} \textrm{with } F_{ij} \subseteq D_j \ \end{array} \right. $$
Each rule set problem can be translated into a mip problem, which can be readily solved using a mip solver.
validatetools
uses lpSolveApi
.
$$ \begin{array}{r} \textrm{Minimize } f(\mathbf{x}) = 0; \ \textrm{s.t. }\mathbf{Rx} \leq \mathbf{d} \ \end{array} $$ with $\la{R}$ and $\la{d}$ the rule definitions and $f(\la{x})$ is the specific problem that is solved.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.