Package dqassess: definition data format, description and utilization

Using a definition data format

All controls, and more generally verification process, are focusing around a definition data format. This element contains metadata and all the information necessary to define your data. For example, this definition format was able to answers to the kinds of questions:

To explain how we can make and read this definition data format, we will use as an example, the data from the RECOLAPE datacall. The characteristics of the data collected were described in the following document.

First, we will use a template for creating an empty definition data format. This template, provide the squeleton or the initial architecture of the format definition. To have it, launch the code below:

format_db_empty <- build_template_format_db(format_name = "name_format") 

This function create an R list. If you want to export it (in xlsx or xls format), you could use the function below:

# Don't forget to check function documentation with ?write_format_db_excel or help(write_format_db_excel)
write_format_db_excel(format_db = format_db_empty)

You can see an example of the empty template for the definition data format here and the example of the definition data format for the RECOLAPE data call here.

The last basic command to know is if you want to read an existing definition data format:

format_db <- read_format_db(input_file_path = "path_definition_data_format")

Definition data format architecture

In this section, we will explain the different sheets of the definition data format, though the empty template and the format definition for the RECOLAPE data call.

Sheet "format_infos"

The first sheet, "format_infos", contains the name of the definition data format (column "format_name") and the indication of the format's version (column "format_version").

Even if the column "format_name" does not need a specific explanation, this parameter is the opportunity to introduce the notion of code style or notation style. Good writing style is like using correct punctuation: you can manage without it, but it sure makes things easier to read and especially avoid conflicts. Several rules and framework exist for maximization compatibility of your text. Here, we are only speaking about the most common advice:

Dealing with the colum "format_version", there's no imposed rule, just like the writing style, but here we decide to use specific semantic versioning: .. The first number () is related to a major updated. It's when you have added many new features or conceptual changes impacted whose directly to the user interface (typically the new user interface is not compatible with the previous). The second number () is when you add functionality in a backwards-compatible manner. The third number () is related to bug resolutions and more preciously when you make backwards-compatible bug fixes. If you want to know more about this kind of thing, you should go on Semantic Versioning or X.Org.

You can modify the version throught the parameter "format_version" (default on 0.1.0) in the function "build_format_db".

Sheet "slot"

The second sheet, "slot", contain all the tables of your dataset. Each table is called a slot.

In the RECOLAPE datacall example, the structure of the slot is that:

| slot_name | mandatory | definition_table | |-----------|-----------|------------------| | effort | TRUE | effort_table | | landing | TRUE | landing_table | | sampling | TRUE | sampling_table |

We have 3 slots (=tables), each slot is mandatory (if not we should have FALSE as an argument) and the name of the slot in the definition data format is indicated in the definition_table column. We will see that in the next section, but all these slots need a sheet named according to the definition_table modalities.

Sheet "slots_hierarchy"

This sheet indicates is there being any relations between your slots. This information is related to the cardinality of your data. In database design and more precisely in the relation model, tables can be related as "one to many", "many to many", "one to one", "zero to many", ect... This kind of information leads very powerful constraints and ensure the consistency of your data.

For understanding this specification, which seems complicated for several people foreign in the field of databases. Look at our data as an example. Here we have 3 slots and potentially 0 relation between them or a maximum of 6 relations (if each slot is related to another, on both sides). Like we said before, we can have a lot kind of relationship between data, but we should consider (for simplification) two levels:

For better understand a relation between two, a solution is to try to make a sentence which explain the relation. Let take an example between the effort slot and the landing slot. In our case, an effort data could be related to no landing (if non-catch during a set for example) or several landings (if we have a partial landing). Relation between the effort slot and the landing slot is a "zero to many" type. Similarly, a landing should be associated with one and only one effort (if landing append that mean an effort related to). The relation between the landing and the effort is a "one to many" type (if we generalize, with many equal to one). In the definition data format, this relation can be specified like that:

| link | level_1 | level_2 | level_3 | level_4 | level_5 | level_6 | level_7 | level_8 | |----------------|--------------|-----------|---------|---------|---------|----------------|---------|---------| | landing_effort | flag_country | vessel_id | year | month | area | fishing_level7 | | |

To define a relation in the definition data format, you do not need to specify what kind of relation is but only direction of the link separated by "_" (here landing to effort). The different levels are all the variables involved in the relation.

For now, the following verification functions can use to check "one to many" relations. However the code was thought to be upgradable and it's possible to add more verification type (also specific verification like "one to one" associated with her constraints). Furthermore, the default template of the sheet make available 8 levels of hierarchies, but the function could manage more than 8. Just add columns and keep the current nomenclature ("level_x"). Finally, if your data does not present relationship (lucky you are!), just leave this sheet with the empty template.

With our data, we can underscore two more relations, "one to many":

| link | level_1 | level_2 | level_3 | level_4 | level_5 | level_6 | level_7 | level_8 | |------------------|--------------|-----------|-----------|---------|----------------|----------------|-------------|--------------| | sampling_landing | flag_country | vessel_id | year | area | species | fishing_level7 | | | | sampling_effort | flag_country | year | vessel_id | area | fishing_level7 | species | length_code | length_class |

Here we have:

Sheet slot definition

The definition data format contains at least one slot definition in relation to slot define in the sheet "slot".

For our example, we would describe the slot definition of the slot "sampling". You could also find the slot definition of the other slot in the definition data format of RECOLAPE data call.

For the slot "sampling"" we have the slot definition "sampling_table":

| column_name | nullable | mandatory | pk | type_name | category | |--------------------|----------|-----------|-------|-------------------------|----------| | sampling_type | FALSE | TRUE | TRUE | sampling_type | codelist | | flag_country | FALSE | TRUE | TRUE | country_type | codelist | | year | FALSE | TRUE | TRUE | year_type | numeric | | trip_code | FALSE | TRUE | FALSE | trip_code_type | text | | vessel_id | FALSE | FALSE | TRUE | vessel_type | codelist | | nb_set | FALSE | FALSE | FALSE | nb_set_type | numeric | | day_at_sea | FALSE | TRUE | FALSE | day_sea_type | numeric | | sampling_method | FALSE | TRUE | TRUE | sampling_method_type | codelist | | aggregation_level | FALSE | TRUE | FALSE | aggregation_level_type | codelist | | station_number | FALSE | TRUE | TRUE | station_number_type | text | | catch_registration | FALSE | TRUE | TRUE | catch_registration_type | codelist | | date | FALSE | TRUE | FALSE | date_type | date | | area | FALSE | TRUE | TRUE | area_type | text | | fishing_level6 | FALSE | TRUE | FALSE | fishing_l6_type | codelist | | fishing_level7 | FALSE | TRUE | TRUE | fishing_l7_type | codelist | | species | FALSE | TRUE | TRUE | species_type | codelist | | catch_category | FALSE | TRUE | TRUE | catch_category_type | codelist | | weight | TRUE | TRUE | FALSE | weight_type | numeric | | weight_allspecies | FALSE | TRUE | FALSE | weight_allspecies_type | numeric | | length_code | FALSE | TRUE | TRUE | length_code_type | codelist | | length_class | FALSE | TRUE | TRUE | length_class_type | numeric | | number_at_length | FALSE | TRUE | FALSE | number_at_length_type | numeric |

Her we have 6 column:

All of this information will be used for different checks and verification (see section below).

Categories definitions

In the previous sheet, we have defined several categories types. These categories apply specifications on data associated.

For now, there are 5 categories possible: codelist, numeric, logical, text and date.

Codelist category

In our example, we have several codelist category. A codelist is a list of codes or meanings that represent the only allowed values for a particular data item. All the codelist type are referenced in the sheet "codelist_types" and link column "type_name" of the different sheet slot definition with a column "enumeration_table" and a sheet "enumeration_table". This last sheet contains all the codes/meanings of the codelist associated and a description of it.

On the "sampling_table" use before as an example (the first table is the "codelist_types" and the second table are a focus on "codelist_specie"):

| type_name | enumeration_table | |-------------------------|-----------------------------| | sampling_type | codelist_sampling_type | | country_type | codelist_country | | vessel_type | codelist_vessel | | sampling_method_type | codelist_sampling_method | | aggregation_level_type | codelist_agregation_level | | catch_registration_type | codelist_catch_registration | | fishing_l6_type | codelist_fishing_level6 | | fishing_l7_type | codelist_fishing_level7 | | species_type | codelist_specie | | catch_category_type | codelist_catch_category | | length_code_type | codelist_lenght_code |

| code | description | |------|--------------------| | SWO | Xiphias gladius | | YFT | Thunnus albacares | | SKJ | Katsuwonus pelamis | | BET | Thunnus obesus |

Warning! To increase the generalization of this package and is appropriation by everyone, a choice was made to let the user complete and update the template of the definition data format. In return, it's very important to keep all the structure of the sheet (column name for example). If you add another codelist sheet (in our example we should have 11 codelist sheets, one for each "type_name") be careful to use the same template as the sheet "codelist_example" in the empty definition data format.

Numeric category

In our example, we have 7 numeric categories:

| type_name | is_integer | min | max | |------------------------|------------|------|------| | year_type | TRUE | 1950 | 2018 | | nb_set_type | TRUE | 1 | 50 | | day_sea_type | FALSE | 1 | 90 | | weight_type | FALSE | 10 | 5000 | | weight_allspecies_type | FALSE | 10 | 5000 | | length_class_type | TRUE | 10 | 300 | | number_at_length_type | FALSE | 0,1 | 500 |

Like in the codelist category before, all information about this category is referenced in the sheet "numeric_types" (with a link between with the sheet slot definition throught "type_name"). Furthermore, we have 3 new column:

Logical category

A logical argument in R only contains TRUE or FALSE values. In R:

Like before, all information about the logical categories are referenced in the sheet "logical_types".

In our example, we do not have the kind of category but the template of the sheet "logical_types" could be like that:

| type_name | |-------------| | yes_no_type |

In the sheet, we see that we have only the information of the "type_name" (link to the sheet slot definition associated). We could think that this sheet should not be relevant because we do not perform any check, except the verification of logical format and we could do that directly from the information of the sheet slot definition associated. This chooses was made from a perspective of the evolution of the package: a new function/check could be easier incremented and not need, a priori, a modification of the definition data format.

Text category

All the information about the text categories are referenced in the sheet "text_types".

For our example, we have these data referenced as text:

| type_name | |---------------------| | area_type | | trip_code_type | | station_number_type |

Like in the logical category and for the same reason, we have only the information of the "type_name".

Date category

The last category is the date category. In our example, we have one data category:

| type_name | time_zone_utc | format_1 | format_2 | format_3 | format_4 | |-----------|---------------|----------|----------|----------|----------| | date_type | TRUE | ymd | ym | yQq | ymd_HMS |

We can find:

Like sheet "slots_hierarchy", you can add any number of date format (incrementing by one the formats). Be sure to leave column empty if you do not want to use it (or delete it).



OB7-IRD/dqassess documentation built on June 6, 2019, 10 a.m.