inst/tdda/constraints/tdda_json_file_format.md

The .tdda JSON File Format

The tdda constraints library (Repository http://github.com/tdda/tdda, module constraints) uses a JSON file to store constraints.

This document seeks to specify that file format.

Purpose

TDDA files describe constraints on a dataset with a view to verifying the dataset to check whether any or all of the specified constraints are satisfied.

A dataset is assumed to consist of one or more fields (also known as columns), each of which has a (different) name and a type. Each field has a value for each of zero or more records (also known as rows). In some cases, values may be null (or missing). Even a field consting entirely of nulls can be considered to have a type.

Familiar examples of datasets include:

In principle, TDDA files are intended to contain any kind of constraint regarding datasets. Today, we are mostly concerned with field types, minimum and maximum values, whether nulls are allowed, whether repeated values are allowed within a field, and the allowed values for a field. We may also be concerned with relations between fields.

Likely extensions might include

The motivation for generating, storing and verifying datasets against such sets of constraints is that they can provide a powerful way of detecting bad or unexpected inputs to or outputs from a data analysis process. They can also be valuable as checks on intermediate results.

Filename and encoding

The preferred extension for TDDA Constraints files is .tdda.

.tdda files must be encoded as UTF-8.

The file should (must) be valid JSON.

Example

This is an extremely simple example TDDA file:

{
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        },
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [
                "one",
                "two"
            ]
        }
    }
}

General Structure

A .tdda file is a dictionary with two currently-supported top-level keys:

Both top-level keys are optional (though if you have neither, there's not a whole lot of constraining going on!)

In future, we certainly expect to add further top-level keys (e.g. for possible constraints on the number of rows, required or banned fields etc.)

The order of fields in the file is immaterial (of course; this is JSON), though writers may choose to present fields in a particular order, e.g. dataset order or sorted on fieldname.

Field Constraints

The value of a field constraints entry is a dictionary keyed on constraint kind. For example, the constraints on field a in the example above are specified as:

"a": {
    "type": "int",
    "min": 1,
    "max": 9,
    "sign": "positive",
    "max_nulls": 0,
    "no_duplicates": true
}

The TDDA library recognized the following kinds of constraints:

other constraint libraries are free to define their own, custom kinds of constraints.

The value of a constraint is often simply a scalar value, but can be a list or a dictionary; when it is a dictionary, it should include value.

If the value of a constraint (the scalar value, or the value key if the value is a dictionary) is null (Python: None), this is taken to indicate the absense of a constraint. A constraint with value None should be completely ignored, so that a constraints file including null-valued constraints should produce identical results to one omitting those constraints.

The semantics and values of the standard field constraint types are as follows:

Examples are:

 * `{"min": 1}`,
 * `{"min": 1.2}`,
 * `{"min": {"value": 3.4}, {"precision": "fuzzy"}}`.

MultiField Constraints

Multifield constraints are not yet being generated by this implementation and will be documented shortly, as they are added.



noamross/rexr documentation built on May 23, 2019, 9:30 p.m.