knitr::opts_chunk$set(eval = FALSE) library(errorlocate) library(magrittr)
whisker
, validate
, errorlocate
, docopt
, daff
, tableplot
, ffbase
, chunked
, ...A large part of your job is spent in data-cleaning:
getting your data in the right shape (e.g. tidyverse
, dplyr
)
assessing missing data (e.g. VIM
, datamaid
)
checking validity (e.g. validate
)
locating and removing errors: errorlocate
!
impute values for missing or erroneous data (e.g. simputation
, VIM
, recipes
)
\begin{center} \includegraphics[width=\textwidth]{img/valuechain.pdf} \end{center}
\begin{center} \includegraphics[height=1\paperheight]{img/keep-calm-and-validate} \end{center}
Package validate
allows to:
library(validate) check_that( data.frame(age=160, driver_license=TRUE), age >= 0, age < 150, if (driver_license == TRUE) age >= 16 )
Error localization is a procedure that points out fields in a data set that can be altered or imputed in such a way that all validation rules can be satisfied.
library(validate) check_that( data.frame(age=160, driver_license=TRUE), age >= 0, age < 150, if (driver_license == TRUE) age >= 16 )
It is clear that age
has an erroneous value, but for more complex rule sets
it is less clear.
check_that( data.frame( age = 3 , married = TRUE , attends = "kindergarten" ) , if (married == TRUE) age >= 16 , if (attends == "kindergarten") age <= 6 )
Ok, clear that this is a faulty record, but what is the error?
Find the minimal (weighted) number of variables that cause the invalidation of the data rules.
Makes sense! (But there are exceptions...)
Implemented in errorlocate
(second generation of editrules
).
A rule a disjunction of atomic clauses:
$$ r_i(\la{x}) = \bigvee_j C_i^j(\la{x}) $$ with:
$$ C_i^j(\la{x}) = \left{ \begin{array}{l} \la{a}^T\la{x} \leq b \ \la{a}^T\la{x} = b \ x_j \in F_{ij} \textrm{with } F_{ij} \subseteq D_j \ x_j \not\in F_{ij} \textrm{with } F_{ij} \subseteq D_j \ \end{array} \right. $$
The rules form a system $R(\la{x})$:
$$ R_H(\la{x}) = \bigwedge_i r_i $$ If $R_H(\la{x})$ is true for record $\la{x}$, then the record is valid, otherwise one (or more) of the rules is violated.
Each rule set $R(\la{x})$ can be translated into a mip problem and solved. $$ \begin{array}{r} \textrm{Minimize } f(\mathbf{x}) = 0; \ \textrm{s.t. }\mathbf{Rx} \leq \mathbf{d} \ \end{array} $$
$$ f(\la{x}) = \sum_{i=1}^N w_i \delta_i $$
errorlocate
lpSolveAPI
to solve the problem.errorlocate::locate_errors
locate_errors( data.frame( age = 3 , married = TRUE , attends = "kindergarten" ) , validator( if (married == TRUE) age >= 16 , if (attends == "kindergarten") age <= 6 ) )$errors
errorlocate::replace_errors
replace_errors( data.frame( age = 3 , married = TRUE , attends = "kindergarten" ) , validator( if (married == TRUE) age >= 16 , if (attends == "kindergarten") age <= 6 ) )
The replace_errors
function is pipe friendly:
rules <- validator(age < 150) data_noerrors <- data.frame(age=160, driver_license = TRUE) %>% replace_errors(rules) errors_removed(data_noerrors) # contains errors removed
\begincols \begincol{0.48\textwidth} \includegraphics[width=0.9\textwidth]{img/SDCR.jpg} \endcol
\begincol{0.48\textwidth} \begin{block}{SDCR} M. van der Loo and E. de Jonge (2018) \emph{Statistical Data Cleaning with applications in R} Wiley, Inc. \end{block} \begin{block}{errorlocate} \begin{itemize} \item Available on \href{https://CRAN.R-project.org/package=errorlocate}{\underline{CRAN}} \end{itemize} \end{block} \begin{block}{More theory?} $\leftarrow$ See book \end{block} \endcol \endcols
Thank you for your attention (and enjoy The Hague)!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.