README.md

Val: A Minimum Dataset Validator (for BACPAC)

This library builds validators.

This will eventually be a conforming R package but at present is in an early development stage.

Development

This repo comes with its own development environment in the form of a Dockerfile.

> source docker-aliases.sh
> bu ## build 
> r ## run r
> rss ## run an r studio server on port 8787

Building the Package

> source docker-aliases.sh
> build_package

Shiny Interface (Bare Bones)

A shiny interface to the validator can be started by running:

> ./scripts/val-shiny.sh <port-number>

This script builds an appropriate docker container in the background, so you will need to run build_package first.

Command Line Interface

There is a dockerified command line interface. To build the CLI you first need to build the package itself as above and then:

> docker build . -f Dockerfile.cli -t valcli

You can invoke this at the command line like this:

> ./scripts/val.sh INPUT-FILENAME OUTPUT-FILENAME

You can copy the val script ot your path or add the appropriate directory, eg:

> sudo cp ./scripts/val.sh /usr/local/bin/val

And then access the validation script from anywhere. Alternatively, you can install the package locally and put the script scripts/cli.R somewhere on your path.

> R -e "devtools::install_local('val_0.0.0.9000.tar.gz')"
> sudo cp ./scripts/cli.R /usr/local/bin/val

Running Tests

Tests are defined in the tests/testthat directory. You can run them like so:

> r
>> devtools::test()

Tests should be grouped by context in some semi-meaningful way.

Concepts

The main idea here is chaining up little validation functions, which have the signature:

validation-state -> validation-state

That is, one validation state to another. See R/validation-state.R for details.

A validation state is a collection of a data frame (to be validated), a status (one of "ok", "halted" and "continuable"), messages indicating which validations have been executed and how they went, and finally a set of miscellaneous messages.

The idea here is to support chaining functions in a way that validations can be continued even if the data set is already known to be invalid.

The function validation_chain converts a list of validation functions to a single validation function. However, this validation function "short circuits" if any of the input functions returns "halted".

If a validation function return "continuable" then the rest of the checks are run.

In this way, we can arrange tests sequentially so as to maximize the number of tests run, even on bad data sets.

In other words, only return a validation status of "halted" if it really is impossible to proceed.

Development

The main task here is to map the rules (as represented in the specification object automatically loaded in the package) to validation functions.

For instance:

check_domain_known <- function(domains=unique(specification$Datasets$Dataset)){
    function(state){
        domain <- unique(state$data$DOMAIN);
        if(domain %in% domains)
            extend_state(state, "ok",
                         check_report("Known DOMAIN", T, "Domain %s is valid.", domain))
        else extend_state(state, "halted",
                          check_report("Known DOMAIN", F, "Domain %s is not valid (valid domains %s).",domain,
                                       collapse_commas(domains)));
    }
}

check_domain_known returns a validation function which ensures that the domain in a given dataset is one of the known domains.

Note that for this validation function to be total it must be the case that we've already checked the presence of the DOMAIN column in the data set and that it is homogeneous. These functions already exist in the repository: check_domain_presence and check_domain_homogeneity.

Once we know a given domain we can then write checks for indvidual columns using utility functions like column_is_textual which takes a column name and returns a validation function to assert that columns presence.

NB: because we can still check other columns if a column is missing or otherwise malformed, this validation function returns "continuable" on failure rather than "halted".

Todo



Vincent-Toups/bacpac_val documentation built on Dec. 2, 2022, 10:20 a.m.