This section provides explanations about methodology and process of checking and controls under the package dqassess.
{width=700px}
For now, all the source of the package is stored on GitHub. This is web-based hosting service for version control system or tracking changes. First, you need to install and load the library in R (you need an internet connection):
# Devtools is a necessary package # If it is not installed, run the following line install.packages("devtools") # Load the package from the Git devtools::install_github("https://github.com/OB7-IRD/dqassess.git", build_opts = c("--no-resave-data", "--no-manual")) # Load the library library(dqassess) # You can access the package documentation with the following line ?dqassess # If you want the documentation of a specific package function use the same syntax, for example for the function build_template_format_db ?build_template_format_db
As an example of controls, we use a fictive dataset build to be closer to the RECOLAPE data call. This dataset is stored in the data directory of the package source and composed of 3 files :
Several errors were introduced in the dataset to providing a panel of different output report and explanation for it. Errors are focusing in red color in the two excel files.
The definition data format used was built in according to the data call of the RECOLAPE project (you can find it here) For confidential reason we can have full access to the data of the project, but all of the package was tested under them.
To launch the checking of data, you have to run the following lines:
result_checking <- checking_data(obj, format_db, ignore_case_in_codelist, report, report_dir, text_file_sep, text_file_dec, file_name_slot)
Like explain in the function documentation (run ?checking_data in R console), you have to fill 8 parameters:
In the following sections, we will run different scenarios and check in detail the output report.
For the first example, we used a dataset composed of 2 slots (effort and landing) from a xlsx file (test_fictive_data1.xlsx). Associated with this data we used the definition data format built during the RECOLAPE project (recolape_definition_data_format.xlsx).
Now, run the following lines (you need to adapt parameters, especially paths, to your configuration):
result_checking1 <- checking_data(obj = "path_test_fictive_data1.xlsx", format_db = "path_recolape_definition_data_format.xlsx", report_dir = "path_output_directory")
It's very important to take a look at the R console, to see what appends and if the function return information, warnings or errors. For our example, you should have this in the R console:
cat(paste("(INFO) Empty enumeration table (code list) \"codelist_vessel\" in the format definition file."), "(INFO) Empty enumeration table (code list) \"codelist_vessel\" in the format definition file.", "(INFO) Empty enumeration table (code list) \"codelist_vessel\" in the format definition file.", "Correct import of the format file definition", "Correct import of data", "Slot effort found", "Checking in progress, be patient or take a coffee", "Slot landing found", "Checking in progress, be patient or take a coffee", "Slot sampling not found", sep = "\n")
The first line gives to us some information. When you put a definition data format in the function, verification was done through a sub function "read_format_db". This sub function check if the definition data format was relevant. Here, it seems that the sheet "codelist_vessel" was empty but cited in a slot definition. Furthermore, the information was repeated 3 times. If we check in our file, we can see that the "codelist_vessel" was specified in 3 slots definition: effort_table, landing_table and sampling_table.
Lines number 4 and 5 says that the function imported successfully a definition data format and data.
The next following lines explain to users what the function doing. Here we can see that 2 slots were founding (effort and landing) and 1 was missing (sampling). Indeed, in the data file provided we have only these 2 slots. This case could be appended for example where you a very large dataset. If you launch all your data through the function, you could saturate R software and collapse it. It could a better option to split your data in multiple datasets (we will see that in the seconde example).
Now let check our outputs. There are in the reporting directory (specify in the parameter "report_dir" of the function below) and her names are built through a concatenation between data name, time when the function begins to run and information on global content of the file. In any case, we 4 kinds of csv files:
Structure and data reports have the same file structure:
In this example, all mistakes included in our data are highlighting. Look at outputs reports str.csv and data.csv for identified where problems are, and if necessary, use the slot report for focusing on data associated.
For this second example, we used the same definition data format as before (recolape_definition_data_format.xlsx) but associated with data in a csv file (by analogy composed of one slot, "sampling").
For this example, run these lines in R console:
result_checking2 <- checking_data(obj = "path_test_fictive_data2.csv", format_db = "path_recolape_definition_data_format.xlsx", report_dir = "path_output_directory",, file_name_slot = "sampling")
Like before, the R console provides useful information of what append:
cat(paste("(INFO) Empty enumeration table (code list) \"codelist_vessel\" in the format definition file."), "(INFO) Empty enumeration table (code list) \"codelist_vessel\" in the format definition file.", "(INFO) Empty enumeration table (code list) \"codelist_vessel\" in the format definition file.", "Correct import of the format file definition", "Correct import of data", "Slot effort not found", "Process for the next slot available", "Slot landing not found", "Process for the next slot available", "Slot sampling found", "Checking in progress, be patient or take a coffee", sep = "\n")
As we expected, only the table "sampling" was found. For a description of the other outputs, you can look to the previous section (Checking of the R console example 1).
In this case, we can see 4 outputs: meta.csv file, str.csv file, data.csv file and one file called slot_sampling.csv.
Like the previous example, mistakes included in the "sampling" data are identified. For example, there are troubles with the codelist of the column "flag_country" ("TOF" code are not a valid code according to the definition data format associated). Like before, look at outputs reports str.csv and data.csv for identified where problems are, and if necessary, use the slot report for focusing on data associated.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.