Home

/

GitHub

/

OB7-IRD/dqassess

/

In OB7-IRD/dqassess: Data Quality ASSESSment

Using a definition data format
Definition data format architecture

Package dqassess: definition data format, description and utilization

Using a definition data format

All controls, and more generally verification process, are focusing around a definition data format. This element contains metadata and all the information necessary to define your data. For example, this definition format was able to answers to the kinds of questions:

What is the structure of my data? Do I have several tables?
I there are, what are the relations between my data?
What is the type of my data? numeric, codelist, free text?
...

To explain how we can make and read this definition data format, we will use as an example, the data from the RECOLAPE datacall. The characteristics of the data collected were described in the following document.

First, we will use a template for creating an empty definition data format. This template, provide the squeleton or the initial architecture of the format definition. To have it, launch the code below:

format_db_empty <- build_template_format_db(format_name = "name_format")

This function create an R list. If you want to export it (in xlsx or xls format), you could use the function below:

# Don't forget to check function documentation with ?write_format_db_excel or help(write_format_db_excel)
write_format_db_excel(format_db = format_db_empty)

You can see an example of the empty template for the definition data format here and the example of the definition data format for the RECOLAPE data call here.

The last basic command to know is if you want to read an existing definition data format:

format_db <- read_format_db(input_file_path = "path_definition_data_format")

Definition data format architecture

In this section, we will explain the different sheets of the definition data format, though the empty template and the format definition for the RECOLAPE data call.

Sheet "format_infos"

The first sheet, "format_infos", contains the name of the definition data format (column "format_name") and the indication of the format's version (column "format_version").

Even if the column "format_name" does not need a specific explanation, this parameter is the opportunity to introduce the notion of code style or notation style. Good writing style is like using correct punctuation: you can manage without it, but it sure makes things easier to read and especially avoid conflicts. Several rules and framework exist for maximization compatibility of your text. Here, we are only speaking about the most common advice:

Use significant words: use only short words, not sentence, which is representative of your global idea. For example, a good format name should inform the reader of the field of the definition data format without having to read it.
Use short sentences: shorter it will be easier to understand you text will be!
Avoid specific characters (, ; . : ! ? % & ( ) # / * etc.), diacritics (accent, cedilla, etc.) and empty spaces (replace it by _). These kinds of elements should bring incompatibility and crash functions or code launch.

Dealing with the colum "format_version", there's no imposed rule, just like the writing style, but here we decide to use specific semantic versioning: .. The first number () is related to a major updated. It's when you have added many new features or conceptual changes impacted whose directly to the user interface (typically the new user interface is not compatible with the previous). The second number () is when you add functionality in a backwards-compatible manner. The third number () is related to bug resolutions and more preciously when you make backwards-compatible bug fixes. If you want to know more about this kind of thing, you should go on Semantic Versioning or X.Org.

You can modify the version throught the parameter "format_version" (default on 0.1.0) in the function "build_format_db".

Sheet "slot"

The second sheet, "slot", contain all the tables of your dataset. Each table is called a slot.

In the RECOLAPE datacall example, the structure of the slot is that:

| slot_name | mandatory | definition_table | |-----------|-----------|------------------| | effort | TRUE | effort_table | | landing | TRUE | landing_table | | sampling | TRUE | sampling_table |

We have 3 slots (=tables), each slot is mandatory (if not we should have FALSE as an argument) and the name of the slot in the definition data format is indicated in the definition_table column. We will see that in the next section, but all these slots need a sheet named according to the definition_table modalities.

Sheet "slots_hierarchy"

This sheet indicates is there being any relations between your slots. This information is related to the cardinality of your data. In database design and more precisely in the relation model, tables can be related as "one to many", "many to many", "one to one", "zero to many", ect... This kind of information leads very powerful constraints and ensure the consistency of your data.

For understanding this specification, which seems complicated for several people foreign in the field of databases. Look at our data as an example. Here we have 3 slots and potentially 0 relation between them or a maximum of 6 relations (if each slot is related to another, on both sides). Like we said before, we can have a lot kind of relationship between data, but we should consider (for simplification) two levels:

"One to many" relationship, this is a generalization of the major case (for example, a relation "one to one" is a specific case where many is equal to zero).
"Zero to many" relationship (this is the same argument as the relation before).

For better understand a relation between two, a solution is to try to make a sentence which explain the relation. Let take an example between the effort slot and the landing slot. In our case, an effort data could be related to no landing (if non-catch during a set for example) or several landings (if we have a partial landing). Relation between the effort slot and the landing slot is a "zero to many" type. Similarly, a landing should be associated with one and only one effort (if landing append that mean an effort related to). The relation between the landing and the effort is a "one to many" type (if we generalize, with many equal to one). In the definition data format, this relation can be specified like that:

| link | level_1 | level_2 | level_3 | level_4 | level_5 | level_6 | level_7 | level_8 | |----------------|--------------|-----------|---------|---------|---------|----------------|---------|---------| | landing_effort | flag_country | vessel_id | year | month | area | fishing_level7 | | |

To define a relation in the definition data format, you do not need to specify what kind of relation is but only direction of the link separated by "_" (here landing to effort). The different levels are all the variables involved in the relation.

For now, the following verification functions can use to check "one to many" relations. However the code was thought to be upgradable and it's possible to add more verification type (also specific verification like "one to one" associated with her constraints). Furthermore, the default template of the sheet make available 8 levels of hierarchies, but the function could manage more than 8. Just add columns and keep the current nomenclature ("level_x"). Finally, if your data does not present relationship (lucky you are!), just leave this sheet with the empty template.

With our data, we can underscore two more relations, "one to many":

| link | level_1 | level_2 | level_3 | level_4 | level_5 | level_6 | level_7 | level_8 | |------------------|--------------|-----------|-----------|---------|----------------|----------------|-------------|--------------| | sampling_landing | flag_country | vessel_id | year | area | species | fishing_level7 | | | | sampling_effort | flag_country | year | vessel_id | area | fishing_level7 | species | length_code | length_class |

Here we have:

A sampling could have one and only one landing associated ("one to many" relationship).
A sampling could have one or several effort data associated ("one to many" relationship in case of partial landings).

Sheet slot definition

The definition data format contains at least one slot definition in relation to slot define in the sheet "slot".

For our example, we would describe the slot definition of the slot "sampling". You could also find the slot definition of the other slot in the definition data format of RECOLAPE data call.

For the slot "sampling"" we have the slot definition "sampling_table":

| column_name |--------------------|------ | sampling_type | FALSE | flag_country | FALSE | year | FALSE | trip_code | FALSE | vessel_id | FALSE | nb_set | FALSE | day_at_sea | FALSE | sampling_method | FALSE | aggregation_level | FALSE | station_number | FALSE | catch_registration | FALSE | date | FALSE | area | FALSE | fishing_level6 | FALSE | fishing_level7 | FALSE | species | FALSE | catch_category | FALSE | weight | TRUE | weight_allspecies | FALSE | length_code | FALSE | length_class | FALSE | number_at_length | FALSE | nullable | mandatory | pk | type_name | category | ----|-----------|-------|-------------------------|----------| | TRUE | TRUE | sampling_type | codelist | | TRUE | TRUE | country_type | codelist | | TRUE | TRUE | year_type | numeric | | TRUE | FALSE | trip_code_type | text | | FALSE | TRUE | vessel_type | codelist | | FALSE | FALSE | nb_set_type | numeric | | TRUE | FALSE | day_sea_type | numeric | | TRUE | TRUE | sampling_method_type | codelist | | TRUE | FALSE | aggregation_level_type | codelist | | TRUE | TRUE | station_number_type | text | | TRUE | TRUE | catch_registration_type | codelist | | TRUE | FALSE | date_type | date | | TRUE | TRUE | area_type | text | | TRUE | FALSE | fishing_l6_type | codelist | | TRUE | TRUE | fishing_l7_type | codelist | | TRUE | TRUE | species_type | codelist | | TRUE | TRUE | catch_category_type | codelist | | TRUE | FALSE | weight_type | numeric | | TRUE | FALSE | weight_allspecies_type | numeric | | TRUE | TRUE | length_code_type | codelist | | TRUE | TRUE | length_class_type | numeric | | TRUE | FALSE | number_at_length_type | numeric |

Her we have 6 column:

"column_name", names of the column in our data.
"nullable", data in the column could be null (TRUE) or not (FALSE).
"mandatory", column are mandatory (TRUE) or not (FALSE).
"pk", the column is a primary key (TRUE) or not (FALSE). In relational databases, a primary key is one or a concatenation of variables that uniquely specify an element (a row) in a slot. A primary key could not be repeated and have to be unique. This is a very important parameter in the database model.
"type_name", this is the name of the type of data present in the associated column.
"category", category of data present in the associated column.

All of this information will be used for different checks and verification (see section below).

Categories definitions

In the previous sheet, we have defined several categories types. These categories apply specifications on data associated.

For now, there are 5 categories possible: codelist, numeric, logical, text and date.

Codelist category

In our example, we have several codelist category. A codelist is a list of codes or meanings that represent the only allowed values for a particular data item. All the codelist type are referenced in the sheet "codelist_types" and link column "type_name" of the different sheet slot definition with a column "enumeration_table" and a sheet "enumeration_table". This last sheet contains all the codes/meanings of the codelist associated and a description of it.

On the "sampling_table" use before as an example (the first table is the "codelist_types" and the second table are a focus on "codelist_specie"):

| type_name | enumeration_table | |-------------------------|-----------------------------| | sampling_type | codelist_sampling_type | | country_type | codelist_country | | vessel_type | codelist_vessel | | sampling_method_type | codelist_sampling_method | | aggregation_level_type | codelist_agregation_level | | catch_registration_type | codelist_catch_registration | | fishing_l6_type | codelist_fishing_level6 | | fishing_l7_type | codelist_fishing_level7 | | species_type | codelist_specie | | catch_category_type | codelist_catch_category | | length_code_type | codelist_lenght_code |

| code | description | |------|--------------------| | SWO | Xiphias gladius | | YFT | Thunnus albacares | | SKJ | Katsuwonus pelamis | | BET | Thunnus obesus |

Warning! To increase the generalization of this package and is appropriation by everyone, a choice was made to let the user complete and update the template of the definition data format. In return, it's very important to keep all the structure of the sheet (column name for example). If you add another codelist sheet (in our example we should have 11 codelist sheets, one for each "type_name") be careful to use the same template as the sheet "codelist_example" in the empty definition data format.

Numeric category

In our example, we have 7 numeric categories:

| type_name | is_integer | min | max | |------------------------|------------|------|------| | year_type | TRUE | 1950 | 2018 | | nb_set_type | TRUE | 1 | 50 | | day_sea_type | FALSE | 1 | 90 | | weight_type | FALSE | 10 | 5000 | | weight_allspecies_type | FALSE | 10 | 5000 | | length_class_type | TRUE | 10 | 300 | | number_at_length_type | FALSE | 0,1 | 500 |

Like in the codelist category before, all information about this category is referenced in the sheet "numeric_types" (with a link between with the sheet slot definition throught "type_name"). Furthermore, we have 3 new column:

"is_integer", does data stock in the column are integers (decimals not allowed, TRUE) or not (FALSE)?
"min", minmum value (included) of our data. You have to fill NA in the cell if not applicable.
"max", maximum value (included) of our data. You have to fill NA in the cell if not applicable.

Logical category

A logical argument in R only contains TRUE or FALSE values. In R:

TRUE values could be T, TRUE, True or true.
FALSE values could be F, FALSE, False or false.

Like before, all information about the logical categories are referenced in the sheet "logical_types".

In our example, we do not have the kind of category but the template of the sheet "logical_types" could be like that:

| type_name | |-------------| | yes_no_type |

In the sheet, we see that we have only the information of the "type_name" (link to the sheet slot definition associated). We could think that this sheet should not be relevant because we do not perform any check, except the verification of logical format and we could do that directly from the information of the sheet slot definition associated. This chooses was made from a perspective of the evolution of the package: a new function/check could be easier incremented and not need, a priori, a modification of the definition data format.

Text category

All the information about the text categories are referenced in the sheet "text_types".

For our example, we have these data referenced as text:

| type_name | |---------------------| | area_type | | trip_code_type | | station_number_type |

Like in the logical category and for the same reason, we have only the information of the "type_name".

Date category

The last category is the date category. In our example, we have one data category:

| type_name | time_zone_utc | format_1 | format_2 | format_3 | format_4 | |-----------|---------------|----------|----------|----------|----------| | date_type | TRUE | ymd | ym | yQq | ymd_HMS |

We can find:

"data_type", link with the sheet slot definition.
"time_zone_utc", data are stock in UTC (Coordinated Universal Time, TRUE) or not (FALSE). For now all the data in date format have to be in UTC (by default in the verification function associated). However, it could be possible, if necessary, to update the function and add a dynamic parameter for a specific time zone.
"format_x", specification of the date format. The date verification use the function "parse_date_time" of the package lubridate. This function have several specific formats and inherited formats in relation to the function "strptime" of the base package. For more details and format specifications, you could see the help of two functions. Furthermore, the list below contains the most common format used with large pelagic data:
- y or Y, the first one is year without century (00-99 or 0-99) and the second is year with century (only years 0:9999 are accepted). In the function used here, year matches with century so you can use either indifferently.
- m, month as decimal number (01-12 or 1-12).
- b, abbreviated or full-month names in the current locale. Be careful because these parameters could be affected by R options for localization. Here only English month names are understood.
- d, day of the month as decimal number (01-31 or 0-31).
- H, hours as decimal number (00-24 or 0-24).
- M, minute as decimal number (00–59 or 0–59).
- S, second as decimal number (00–61 or 0–61), allowing for up to two leap seconds (but POSIX-compliant in R implementations will ignore leap seconds).
- q, quarter of the month (1-4). A specification was made for format composed format with quarter. For now, the most common was implemented and should follow template like "yQq" or "yqq" (with always the last q the quarter of the month, for example 2018Q1 or 2018q1).

Like sheet "slots_hierarchy", you can add any number of date format (incrementing by one the formats). Be sure to leave column empty if you do not want to use it (or delete it).

OB7-IRD/dqassess documentation built on June 6, 2019, 10 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

OB7-IRD/dqassess
Data Quality ASSESSment

In OB7-IRD/dqassess: Data Quality ASSESSment

Package dqassess: definition data format, description and utilization

Using a definition data format

Definition data format architecture

Sheet "format_infos"

Sheet "slot"

Sheet "slots_hierarchy"

Sheet slot definition

Categories definitions

Codelist category

Numeric category

Logical category

Text category

Date category

R Package Documentation

Browse R Packages

We want your feedback!

OB7-IRD/dqassess Data Quality ASSESSment

In OB7-IRD/dqassess: Data Quality ASSESSment

Package dqassess: definition data format, description and utilization

Using a definition data format

Definition data format architecture

Sheet "format_infos"

Sheet "slot"

Sheet "slots_hierarchy"

Sheet slot definition

Categories definitions

Codelist category

Numeric category

Logical category

Text category

Date category

R Package Documentation

Browse R Packages

We want your feedback!

OB7-IRD/dqassess
Data Quality ASSESSment