knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" ) library(ruler, quietly = TRUE, warn.conflicts = FALSE) library(dplyr, quietly = TRUE, warn.conflicts = FALSE) options(tibble.print_min = 6, tibble.print_max = 6)
ruler
offers a set of tools for creating tidy data validation reports using
dplyr grammar of data manipulation. It is structured to be flexible and extendable in terms of creating rules and using their output.
To fully use this package a solid knowledge of dplyr
is required. The key idea behind ruler
's design is to validate data by modifying regular dplyr
code with as little overhead as possible.
Some functionality is powered by the keyholder package. It is highly recommended to use its supported functions during rule construction. All one- and two-table dplyr
verbs applied to local data frames are supported and considered the most appropriate way to create rules.
This README is structured as follows:
ruler
for exploration of obeying user-defined rules and its automatic validation.ruler
's capabilities in more detail.You can install current stable version from CRAN with:
install.packages("ruler")
Also you can install development version from github with:
# install.packages("devtools") devtools::install_github("echasnovski/ruler")
# Utilities functions is_integerish <- function(x) { all(x == as.integer(x)) } z_score <- function(x) { abs(x - mean(x)) / sd(x) } # Define rule packs my_packs <- list( data_packs( dims = . %>% summarise(nrow_low = nrow(.) >= 10, nrow_high = nrow(.) <= 15, ncol_low = ncol(.) >= 20, ncol_high = ncol(.) <= 30) ), group_packs( vs_am_num = . %>% group_by(vs, am) %>% summarise(vs_am_low = n() >= 7), .group_vars = c("vs", "am") ), col_packs( enough_col_sum = . %>% summarise_if(is_integerish, rules(is_enough = sum(.) >= 14)) ), row_packs( enough_row_sum = . %>% filter(vs == 1) %>% transmute(is_enough = rowSums(.) >= 200) ), cell_packs( dbl_not_outlier = . %>% transmute_if(is.numeric, rules(is_not_out = z_score(.) < 1)) %>% slice(-(1:5)) ) ) # Expose data to rules mtcars_exposed <- mtcars %>% as_tibble() %>% expose(my_packs) # View exposure mtcars_exposed %>% get_exposure() # Assert any breaker invisible(mtcars_exposed %>% assert_any_breaker())
Rule is a function which converts data unit of interest (data, group, column, row, cell) to logical value indicating whether this object satisfies certain condition.
Rule pack is a function which combines several rules into one functional
block. The recommended way of creating rules is by creating packs right away with the use of dplyr
and magrittr's
pipe operator.
Exposing data to rules means applying rules to data, collecting results in common format and attaching them to the data as an exposure
attribute. In this way actual exposure can be done in multiple steps and also be a part of a general data preparation pipeline.
Exposure is a format designed to contain uniform information about validation of different data units. For reproducibility it also saves information about applied packs. Basically exposure is a list with two elements:
tibble
with the following structure:There are four basic combinations of var
and id
values which define five basic data units:
var == '.all'
and id == 0
: Data as a whole.var != '.all'
and id == 0
: Group (var
shouldn't be an actual column name) or column (var
should be an actual column name) as a whole.var == '.all'
and id != 0
: Row as a whole.var != '.all'
and id != 0
: Described cell.With exposure attached to data one can perform different kinds of actions: exploration, assertion, imputation and so on.
# List of two rule packs for checking data properties my_data_packs <- data_packs( # data_dims is a pack name data_dims = . %>% summarise( # ncol and nrow are rule names ncol = ncol(.) == 12, nrow = nrow(.) == 32 ), # Data after subsetting should have number of rows in between 10 and 30 # Rules are applied separately vs_1 = . %>% filter(vs == 1) %>% summarise( nrow_low = nrow(.) > 10, nrow_high = nrow(.) < 30 ) )
# List of one nameless rule pack for checking group property my_group_packs <- group_packs( # Name will be imputed during exposure . %>% group_by(vs, am) %>% summarise(any_cyl_6 = any(cyl == 6)), # One should supply grouping variables for correct interpretation of output .group_vars = c("vs", "am") )
# rules() defines function predicators with necessary name imputations # List of two rule pack for checking certain columns' properties my_col_packs <- col_packs( sum_bounds = . %>% summarise_at( # Check only columns with names starting with 'c' vars(starts_with("c")), rules(sum_low = sum(.) > 300, sum_high = sum(.) < 400) ), # In the edge case of checking one column with one rule there is a need # for forcing inclusion of names in the output of summarise_at(). # This is done with naming argument in vars() vs_mean = . %>% summarise_at(vars(vs = vs), rules(mean(.) > 0.5)) )
z_score <- function(x) { (x - mean(x)) / sd(x) } # List of one rule pack checking certain rows' property my_row_packs <- row_packs( row_mean = . %>% mutate(rowMean = rowMeans(.)) %>% transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>% # Check only rows 10-15 # Values in 'id' column of report will be based on input data (i.e. 10-15) # and not on output data (1-6) slice(10:15) )
is_integerish <- function(x) { all(x == as.integer(x)) } # List of two cell pack checking certain cells' property my_cell_packs <- cell_packs( my_cell_pack_1 = . %>% transmute_if( # Check only integer-like columns is_integerish, rules(is_common = abs(z_score(.)) < 1) ) %>% # Check only rows 20-30 slice(20:30), # The same edge case as in column rule pack vs_side = . %>% transmute_at(vars(vs = "vs"), rules(. > mean(.))) )
By default exposing removes obeyers.
mtcars %>% expose(my_data_packs, my_group_packs) %>% get_exposure()
One can leave obeyers by setting .remove_obeyers
to FALSE
.
mtcars %>% expose(my_data_packs, my_group_packs, .remove_obeyers = FALSE) %>% get_exposure()
By default expose()
guesses the pack type if 'not-pack' function is supplied. This behaviour has some edge cases but is useful for interactive use.
mtcars %>% expose( some_data_pack = . %>% summarise(nrow = nrow(.) == 10), some_col_pack = . %>% summarise_at(vars(vs = "vs"), rules(is.character(.))) ) %>% get_exposure()
To write strict and robust code one can set .guess
to FALSE
.
mtcars %>% expose( some_data_pack = . %>% summarise(nrow = nrow(.) == 10), some_col_pack = . %>% summarise_at(vars(vs = "vs"), rules(is.character(.))), .guess = FALSE ) %>% get_exposure()
General actions are recommended to be done with act_after_exposure()
. It takes two arguments:
.trigger
- a function which takes the data with attached exposure and returns TRUE
if some action should be made..actor
- a function which takes the same argument as .trigger
and performs some action.If trigger didn't notify then the input data is returned untouched. Otherwise the output of .actor()
is returned. Note that act_after_exposure()
is often used for creating side effects (printing, throwing error etc.) and in that case should invisibly return its input (to be able to use it with pipe).
trigger_one_pack <- function(.tbl) { packs_number <- .tbl %>% get_packs_info() %>% nrow() packs_number > 1 } actor_one_pack <- function(.tbl) { cat("More than one pack was applied.\n") invisible(.tbl) } mtcars %>% expose(my_col_packs, my_row_packs) %>% act_after_exposure( .trigger = trigger_one_pack, .actor = actor_one_pack ) %>% invisible()
ruler
has function assert_any_breaker()
which can notify about presence of any breaker in exposure.
mtcars %>% expose(my_col_packs, my_row_packs) %>% assert_any_breaker()
More leaned towards assertions:
More leaned towards validation:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.