README.md
In Vincent-Toups/bacpac_val: Validator for the BACPAC Minimum Data Set

Val: A Minimum Dataset Validator (for BACPAC)

This library builds validators.

This will eventually be a conforming R package but at present is in an early development stage.

Development

This repo comes with its own development environment in the form of a Dockerfile.

> source docker-aliases.sh
> bu ## build 
> r ## run r
> rss ## run an r studio server on port 8787

> source docker-aliases.sh
> build_package

A shiny interface to the validator can be started by running:

> ./scripts/val-shiny.sh <port-number>

This script builds an appropriate docker container in the background, so you will need to run build_package first.

There is a dockerified command line interface. To build the CLI you first need to build the package itself as above and then:

> docker build . -f Dockerfile.cli -t valcli

You can invoke this at the command line like this:

> ./scripts/val.sh INPUT-FILENAME OUTPUT-FILENAME

You can copy the val script ot your path or add the appropriate directory, eg:

> sudo cp ./scripts/val.sh /usr/local/bin/val

And then access the validation script from anywhere. Alternatively, you can install the package locally and put the script scripts/cli.R somewhere on your path.

> R -e "devtools::install_local('val_0.0.0.9000.tar.gz')"
> sudo cp ./scripts/cli.R /usr/local/bin/val

Tests are defined in the tests/testthat directory. You can run them like so:

> r
>> devtools::test()

Tests should be grouped by context in some semi-meaningful way.

Concepts

The main idea here is chaining up little validation functions, which have the signature:

validation-state -> validation-state

That is, one validation state to another. See R/validation-state.R for details.

A validation state is a collection of a data frame (to be validated), a status (one of "ok", "halted" and "continuable"), messages indicating which validations have been executed and how they went, and finally a set of miscellaneous messages.

The idea here is to support chaining functions in a way that validations can be continued even if the data set is already known to be invalid.

The function validation_chain converts a list of validation functions to a single validation function. However, this validation function "short circuits" if any of the input functions returns "halted".

If a validation function return "continuable" then the rest of the checks are run.

In this way, we can arrange tests sequentially so as to maximize the number of tests run, even on bad data sets.

In other words, only return a validation status of "halted" if it really is impossible to proceed.

Development

The main task here is to map the rules (as represented in the specification object automatically loaded in the package) to validation functions.

For instance:

check_domain_known <- function(domains=unique(specification$Datasets$Dataset)){
    function(state){
        domain <- unique(state$data$DOMAIN);
        if(domain %in% domains)
            extend_state(state, "ok",
                         check_report("Known DOMAIN", T, "Domain %s is valid.", domain))
        else extend_state(state, "halted",
                          check_report("Known DOMAIN", F, "Domain %s is not valid (valid domains %s).",domain,
                                       collapse_commas(domains)));
    }
}

check_domain_known returns a validation function which ensures that the domain in a given dataset is one of the known domains.

Note that for this validation function to be total it must be the case that we've already checked the presence of the DOMAIN column in the data set and that it is homogeneous. These functions already exist in the repository: check_domain_presence and check_domain_homogeneity.

Once we know a given domain we can then write checks for indvidual columns using utility functions like column_is_textual which takes a column name and returns a validation function to assert that columns presence.

NB: because we can still check other columns if a column is missing or otherwise malformed, this validation function returns "continuable" on failure rather than "halted".

check_column_entire
check_column_type
check_column_range
check_column_string_lengths
check_column_values_in_codelist
code which maps the above utilities to validation functions based on the specification object for our known domains.

Vincent-Toups/bacpac_val documentation built on Dec. 2, 2022, 10:20 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com