knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" ) library(ruler, quietly = TRUE, warn.conflicts = FALSE) library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
This vignette describes and explains logic behind common ways of creating rule packs.
Rule is a function which converts data unit of interest (data, group, column, row, cell) to logical value indicating whether this object satisfies certain condition.
Rule pack is a function which combines several rules for common data unit into one functional block. The recommended way of creating rules is by creating packs right away with the use of dplyr
and magrittr's pipe operator.
Some of ruler
's functionality is powered by the keyholder package. It is highly recommended to use its supported functions during rule pack construction. All one- and two-table dplyr
verbs applied to local data frames are supported and considered the most appropriate way to create rule packs.
As described in vignette about design process it is necessary for rule pack to have type because outputs for different data units have different structure. For this reason ruler
has family of *_packs()
constructors (where *
stands for the name of data unit):
To check whether dimensions of mtcars
obey some rules one can write the next
dplyr pipeline:
mtcars %>% summarise( nrow_low = nrow(.) > 10, nrow_high = nrow(.) < 30, ncol = ncol(.) == 12 )
The output has the following structure:
There is an easy way to transform this pipeline into a function to be used for any data: mtcars
should be replaced with .
character. To indicate that this function is a rule pack for data unit 'data' it should be wrapped with data_packs()
.
The next code creates a list my_data_packs
with one data rule pack named my_data_pack_1
. That rule pack defines rules with names nrow_low
, nrow_high
, ncol
.
my_data_packs <- data_packs( my_data_pack_1 = . %>% summarise( nrow_low = nrow(.) > 10, nrow_high = nrow(.) < 30, ncol = ncol(.) == 12 ) )
To check whether certain groups of rows of mtcars
obey some rules one can write the next dplyr pipeline:
mtcars %>% group_by(vs, am) %>% summarise(any_cyl_6 = any(cyl == 6))
The output has the following structure:
vs
and am
in this case).The next code creates a list with one nameless group rule pack (the name will be
imputed during exposure). This pack contains one rule any_cyl_6
which checks every group defined by vs
and am
columns.
my_group_packs <- group_packs( . %>% group_by(vs, am) %>% summarise(any_cyl_6 = any(cyl == 6)), .group_vars = c("vs", "am") )
Notes:
ungroup
ed..group_vars
argument to distinguish them from non-grouping ones.var
column in validation report is created by uniting them with the default separator .
. In this case values will be 0.0
, 0.1
, 1.0
, 1.1
. To change separator supply it with .group_sep
argument.To check whether certain columns of mtcars
obey some rules one can write the next dplyr pipeline:
is_integerish <- function(x) { all(x == as.integer(x)) } mtcars %>% summarise_if(is_integerish, list(mean_low = ~ mean(.) > 0.5))
The output has the following structure:
In general it is hard to automatically separate output's column names into 'validated column name' and 'rule name' because default separator _
is a commonly used one. For this reason ruler
has function rules()
with the following functionality:
rules()
's arguments.._.
(Morse code for 'R') to rule names. Note that one can change this prefix with .prefix
argument.The next code creates a list with two elements:
my_col_pack_1
which checks obedience of 'integerish' columns to rule mean_low
.vs
to some (will be imputed as rule__1
) rule. Note the use of named argument in vars(vs = "vs")
. This is the current way in dplyr
's scoped variants of summarise
and mutate
to force using both column and function names in output's column name.my_col_packs <- col_packs( my_col_pack_1 = . %>% summarise_if( is_integerish, rules(mean_low = mean(.) > 0.5) ), . %>% summarise_at(vars(vs = "vs"), rules(sum(.) > 300)) )
To check whether certain rows of mtcars
are not outliers one can write the next dplyr pipeline:
z_score <- function(x) { (x - mean(x)) / sd(x) } mtcars %>% mutate(rowMean = rowMeans(.)) %>% transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>% slice(10:15)
The output has the following structure:
Pipeline like the one above is quite common: for every row compute some value based on all rows and then validate only some of them. However in the validation report column id
should represent the row index in the original data frame and this information is missing after applying slice()
.
This problem is solved by using keyholder package. Its main purpose is to track information about rows while modifying data frame. During exposure pack is applied to the keyed version of input data with key equals to row index. Note that to use this feature one should create rule packs using composition of functions supported by keyholder
.
The next code creates a list with one row pack my_row_pack_1
. It contains one rule is_common_row_mean
that checks 6 rows (from 10 to 15) for not being an outlier (based on information from all rows) in terms of row means.
my_row_packs <- row_packs( my_row_pack_1 = . %>% mutate(rowMean = rowMeans(.)) %>% transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>% slice(10:15) )
To check whether certain cells of mtcars
are not outliers one can write the next dplyr pipeline:
mtcars %>% transmute_if( is_integerish, list(is_common = ~ abs(z_score(.)) < 1) ) %>% slice(20:24)
The output has the following structure:
Basically cell rule pack is a combination of column and row rule packs. It means:
rules()
instead of pure list in scoped variants of transmute()
.keyholder
.The next code creates a list with one cell pack my_cell_pack_1
. It checks cells of every integer-like column in rows 20-24 for not being an outlier within column.
my_cell_packs <- cell_packs( my_cell_pack_1 = . %>% transmute_if( is_integerish, rules(is_common = abs(z_score(.)) < 1) ) %>% slice(20:24) )
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.