contains_exactly: Check records using a predifined table of (im)possible values
In validate: Data Validation Infrastructure

contains_exactly

R Documentation

Check records using a predifined table of (im)possible values

Description

Given a set of keys or key combinations, check whether all thos combinations occur, or check that they do not occur. Supports globbing and regular expressions.

Usage

contains_exactly(keys, by = NULL, allow_duplicates = FALSE)

contains_at_least(keys, by = NULL)

contains_at_most(keys, by = NULL)

does_not_contain(keys)

Arguments

`keys`	A data frame or bare (unquoted) name of a data frame passed as a reference to `confront` (see examples). The column names of `keys` must also occurr in the columns of the data under scrutiny.
`by`	A bare (unquoted) variable or list of variable names that occur in the data under scrutiny. The data will be split into groups according to these variables and the check is performed on each group.
`allow_duplicates`	`[logical]` toggle whether key combinations can occur more than once.

Details

`contains_exactly`	dataset contains exactly the key set, no more, no less.
`contains_at_least`	dataset contains at least the given keys.
`contains_at_most`	all keys in the data set are contained the given keys.
`does_not_contain`	The keys are interpreted as forbidden key combinations.

Value

For contains_exactly, contains_at_least, and contains_at_most a logical vector with one entry for each record in the dataset. Any group not conforming to the test keys will have FALSE assigned to each record in the group (see examples).

For contains_at_least: a logical vector equal to the number of records under scrutiny. It is FALSE where key combinations do not match any value in keys.

For does_not_contain: a logical vector with size equal to the number of records under scrutiny. It is FALSE where key combinations do not match any value in keys.

Globbing

Globbing is a simple method of defining string patterns where the asterisks (*) is used a wildcard. For example, the globbing pattern "abc*" stands for any string starting with "abc".

Examples


## Check that data is present for all quarters in 2018-2019
dat <- data.frame(
    year    = rep(c("2018","2019"),each=4)
  , quarter = rep(sprintf("Q%d",1:4), 2)
  , value   = sample(20:50,8)
)

# Method 1: creating a data frame in-place (only for simple cases)
rule <- validator(contains_exactly(
           expand.grid(year=c("2018","2019"), quarter=c("Q1","Q2","Q3","Q4"))
          )
        )
out <- confront(dat, rule)
values(out)

# Method 2: pass the keyset to 'confront', and reference it in the rule.
# this scales to larger key sets but it needs a 'contract' between the
# rule definition and how 'confront' is called.

keyset <- expand.grid(year=c("2018","2019"), quarter=c("Q1","Q2","Q3","Q4"))
rule <- validator(contains_exactly(all_keys))
out <- confront(dat, rule, ref=list(all_keys = keyset))
values(out)

## Globbing (use * as a wildcard)

# transaction data 
transactions <- data.frame(
    sender   = c("S21", "X34", "S45","Z22")
  , receiver = c("FG0", "FG2", "DF1","KK2")
  , value    = sample(70:100,4)
)

# forbidden combinations: if the sender starts with "S", 
# the receiver can not start "FG"
forbidden <- data.frame(sender="S*",receiver = "FG*")

rule <- validator(does_not_contain(glob(forbidden_keys)))
out <- confront(transactions, rule, ref=list(forbidden_keys=forbidden))
values(out)


## Quick interactive testing
# use 'with':
with(transactions, does_not_contain(forbidden)) 



## Grouping 

# data in 'long' format
dat <- expand.grid(
  year = c("2018","2019")
  , quarter = c("Q1","Q2","Q3","Q4")
  , variable = c("import","export")
)
dat$value <- sample(50:100,nrow(dat))


periods <- expand.grid(
  year = c("2018","2019")
  , quarter = c("Q1","Q2","Q3","Q4")
)

rule <- validator(contains_exactly(all_periods, by=variable))

out <- confront(dat, rule, ref=list(all_periods=periods))
values(out)

# remove one  export record

dat1 <- dat[-15,]
out1 <- confront(dat1, rule, ref=list(all_periods=periods))
values(out1)
values(out1)

validate documentation built on July 4, 2024, 9:07 a.m.