Fishery data have to be inspected in order to detect errors before to use them in the stock assessment procedure [@chen2003]. Finding and correcting errors is one of the first tasks one needs to perform on a dataset in this case. Currently these checks are made at national level using mainly manual methods (based on graphs and numerical summary, see @vigneau2007 for example). When the amount of data is large (as it will become with the implementation of the regional database) manual methods are (1) too time consuming and (2) are difficult to track in time (when and how the corrections were made). Hence automated procedures are needed and participate intrinsically to build the data quality.
This document is a tutorial related to the use of the R fishPifct
package to assess data quality on fishery
sampling data.
The fishPifct
package was developped for the work package 4 of the fishPi project (project DG-MARE 2014/19 WP4).
Its main objectives are to provide to the end users a
framework to assess the quality of sampling data related to fishery.
This framework concerns sampling data and leans on the csPi
format in term of
data structure. The csPi
format is a format under development based on the fishFrame
format.
The fishFrame format is used in the Regional Database and by the COST packages (a collection of tools to deal with
data compilation, @cost2009). Its complete definition is given in @fishframe2009.
An update of this format, called csPi, is in discussion since 2014 [@wkrdb2014], and the version 2.1 of this format
is used in this report.
Tools to export fishFrame
files in csPi
as provided.
And in order to insure a wide range of application, most of the packages functions works on csPi
and fishFrame
objects.
This framework follows the recommandation of the reproducible research statment [@gentleman2004]. Consequently this report is self-consistent: the code used to process and to analyse the sampling data are embedded in the report itself. An effort was made to select computing tools who give to the users the ability to reproduce the analyses using only a computer and an internet connection (for installation purposes mainly). Therefore all the tools are open source software, available free of charges, and running on the three main operating systems available nowadays (Linux, Windows and Mac OS).
Coding and analyses are carried out using the R environment [@r2016]. R^noteR is a free software environment for statistical computing and graphics. The reproducibility of the results presented in this report relies on the use of a dialect of the Mardown language called Pandoc for word processing using the Knitr R package. Markdown is a plain text formatting syntax designed so that it can optionally be converted to HTML using a tool by the same name. Pandoc^noteMd is a Markdown dialect which extends the conversion capability to word processing file (docx, doc and odt), html and pdf, among other formats. Pandoc understands a number of markdown syntax extensions, including document metadata (title, author, date), footnotes, tables, figures and references. Knitr^noteKnitr is an R package (a set of functions extending the R capabilities). With this package, the R code used to process and analyze the data is included directly in the report. Results are then produced dynamically. This framework has demonstrated the capacity to improve the conduct and the presentation of data analysis in a way that another person can understand and replicate [@baumer2014].
For example, if the calculus of 1+1 is needed, the code to compute it is written in the report using special hooks, as in this simple example:
'''{r test00,warn=FALSE,cache=TRUE,echo=TRUE} #comment: addition example. 1+1 '''
This code is evaluated during the compilation of the report by the knitr command and it prints the following result:
#comment: addition example. 1+1
The result is 2. In this tutorial all the numerical values, tables and figures are produced following this procedure. The scripts and the report can be elaborated in a single integrated development environment (IDE), called Rstudio^noteRstudio. It includes a console and a syntax-highlighting editor that supports direct code execution, as well as tools for plotting, debugging and writing report. Consequently, all the tools and code presented here are already available to the end user.
This package is available
in the fihPifct repository on
Github.
The installation procedure is simple as :
install.packages("devtools") library(devtools) install_github("ldbk/fishPifct")
Technical problem support during the installation process (R version, missing packages...) is far beyond the scope of this tutorial. In case of problem, please contact your IT support.
Some users reported issues with the openxlsx package installation (needed to import and export csPi and csData object in excel file). Please read carefully the error messages R gives to you (the way to fix these errors are explained to you in these messages). The average procedure to fix them should be something like:
install.packages("installr") installr::installr("Rtools")
During the installation, tick the PATH modification option. Then, restart your computer.
If needed, COST related package (for windows) can be found here :
and here for Unix system :
The COST manuel can be downloaded here :
In this tutorial, only the main characteristics of this format are illustrated.
A detailed version of the csPi
format specifications is given in @wkrdb2014 and
in the help page of the csPi
function.
csPi
is an S4 object containing 10 slots :
library(pander);library(fishPifct) pander(format_definition_csPi$slots,split.table=Inf)
The slots desc
, popData
, design
are not mandatory and serve as descriptive fields for
future applications.
The slots classVersion
provides the version number of the csPi format. This format is still
in development, and keeping the format version will insure retrocompatibility with the future
development of the package.
The slots hold the sampling information : the sampling events description (se
),
the trip information (tr
), the hauls caracteristics (hh
), the species sampled (sl
)
and the correspondings length measurments (hl
), and the biological parameters (ca
).
Each of these slots is a data.frame
who lists the different parameters requested for each sample.
Type of the vessel, its characteristic, the fishing location and the quantity landed,
the scientific name of the sampled species, the length class of the
fishes, the age, etc... are reported in these tables. These variables can be numeric, text or codelist.
For each table, a group of variables represent the primary key and insure the links with the other tables.
The next figure gives an overview of the structure of the table.
The data are generated based on the sole dataset coming from the COST
package. The fishFrame COST format is exported in the csPi format using the function csDataTocsPi
:
library(fishPifct) data(sole) sole <- csDataTocsPi(sole.cs) head(sole)
The csPi object is named sole in our example.
A collection of methods and functions gives to the user the ability to explore and visualize a csPi objects:
methods(class="csPi") #subset csPi object #?csSubset
Their behaviours are similar to the generic one (ie dim
gives the dimension of all the csPi
slots).
Fishing data rely usually on national database. Correction procedures in these systems can be a tedious work, not
really in accordance to quick corrections (during working groups, to harmonize datasets between countries for example).
Manual data corrections are difficult and spreadsheet is nowadays the common tools to correct locally the data.
A local import/export procedure is available to export the csPi
in excel file format.
Thus, the user can use a spreadsheet to do some corrections in the tables and then import directly the corrected tables in
a csPi
object in R.
In this package the import
and export
functions do these transformation easily:
export(sole,file="sole.xlsx",type="xlsx") #use a spreadsheet to open the sole.xlsx file and do some corrections if needed #save the file, and import it in R with: solecorrected<-importxlsx(file="sole.xlsx")
A seminal step in data quality is to check the structure of the data. The structure check includes the ordered verification of :
The data structure definition is given for csPi
objects by the list format_definition_csPi
.
This list is built from the excel file format_definition_csPi.xlsx
in the data directory of the
installation directory of the package. Providing the excel file gives to the end user the possibility to modify
the data structure check (for example the lower and upper limits of the length class, or a limited list of metier).
A slot definition is a table reporting the characteristics of a given slot :
|slot_name | mandatory | definition_table | |----------|-----------|------------------| |base | TRUE |slot_base|
Here the slot names base
is mandatory and its definition is given
by the table slot_base
. During the structure check, each slot is checked against its
definition given by the structure definition list.
A table definition is a table reporting the characteristics of a given table.
For example here, the first 8 lines of the tr
table definition :
library(pander);library(fishPifct) pander(format_definition_csPi$slot_tr[1:8,],split.table=Inf)
Each table's column is checked against its definition. For example, the trpCode
variable has to be
non nullable, is mandatory and is part of the primary key of the tr
table. It is a text variable (category
),
and its category definition is referenced in the type_trpCode
of the definition file (or the excel sheet with
this name).
After the table definition, each variable are checked according to their types.
For example in the previous table vslFlagCtry
is non nullable, mandatory and is not included in the primary key. The variable's type is a codelist, and the corresponding authorized value are registered in the codelist_type
list of the format description, namely the list codelist_ctry
(here the first 10 lines):
pander(format_definition_csPi$codelist_ctry[1:10,],split.table=Inf)
This list is the list of the ISO 3166-1 alpha-3 country codes. Limiting this list strengths the data quality check, according to the end user needs.
For the variables with a numeric type, the numeric_type
list of the data definition brings information related to the numerical limits and if the numbers are integer (number of samples, age...) or real (probability...).
For example here, the first 8 lines of the numeric_type
table definition:
library(pander);library(fishPifct) pander(format_definition_csPi$numeric_type[1:8,],split.table=Inf)
In this example, year
is an integer between 1900 and 2020.
As previously stated the modification of the data structure is open to the end user needs.
The data structure checks were developped by the sister project of fishPi related to the Mediterranean area, for fishFrame object (https://git.outils-is.ird.fr/billet/SDEFQuality/wikis/home). Consequently, this data structure check is applicable to any object structure, and it can be extended to landings or effort file in a near future for example.
The results of the data structure checks are given in a report summarizing all the checks,
if these checks pass, and why.
Using the sole
dataset previously loaded:
#generating a report in an R object structurecheck<-validateData(obj=sole,formatDb=format_definition_csPi,report="list")
The meta information related to the check are:
pander(structurecheck$meta,split.table=Inf)
The 10 first lines of the slots checks are:
pander(structurecheck$struct[1:10,],split.table=Inf)
The 10 first lines of the variables checks are:
pander(structurecheck$data[1:10,],split.table=Inf)
The tables are explicits and doesn't need any comments. To generate a complete report in pdf or html format :
#generating a pdf report renderValidationReport(obj=sole,formatDb=format_definition_csPi, title="test",reportFormat="pdf") #a copy of this report can be found in system.file('data', 'dataValidationReport_20160518_235140_29d51c808f9b.pdf', package='fishPifct')
In this section, the consistency of the information between the 'csPi' slots is checked, e.g. identification of trips without fishing operations.
To do so, the function consistency
performs hierachical anti jointure between related table and generates
a simple table reporting the trpCode
who have to be checked between the tables:
#consistency check generating a pdf report consistencycheck<-consistency(sole) pander(consistencycheck,split.table=Inf)
In our sole example, no consistency errors were detected.
The literature on outliers is extensive, and cover all the areas of science,
but determining whether or not
an observation is an outlier is ultimately a subjective exercise and hence makes automation a difficult task.
Here we will use the definition of @barnett1994 for outlier:
"Indicate that an outlying observation, or outlier, is one that
appears to deviate markedly from other members of the sample in which it occurs".
Outlier detection methods can be divide between univariate methods (looking at only one
variable) and multivariate methods (looking at more than one variable and their
relationships). For example univariate methods spot
observations reported in tons instead of kilos in landings, while multivariate methods can identify
wrong weigths in a size-weight relationship.
Then outlier detection methods can be categorized between
parametric (statistical)
methods and non-parametric methods that are model free.
Statistical parametric methods either assume a known underlying distribution of the data or, at
least, they are based on statistical estimates of unknown distribution parameters. Observations
that deviate from the model assumptions are flagged as outliers.
Here we focus on two generic non parametric methods for numerical and non numerical univariate data.
The function outliers
do the outliers detection for a csPi
object.
The adjusted outlyingness index is used to detect outliers. It's a non parametric methods, adapted to
skewed data.
The function adjOutliness
of
the package robustBase
is used. More details of this method can be found in the help page of
this function.
An example on the fishing duration (variable foDur
of slot hh
):
tabaoutlier<-outliers(sole,slot="hh",var="foDur")
High fishing duration values are flagged as outliers, as presented in the figure. The function output gives to the user the complete lines who includes the outliers:
#10 first lines and 5 first columns of the outliers pander(tabaoutlier[1:10,1:5],split.table=Inf)
For a non-numerical variable, the outliers are detected using the occurence of the modality of the value, expressed in percentage and a treshold (by default 1%). If a modality is expressed less than this treshold, an outlier is considered detected. The treshold can be fixed by the user. Here an example using the statistical rectangle fished :
tabaoutlier<-outliers(sole,slot="hh",var="rect")
Rare fished rectangle are flagged as outliers. The function output gives to the user the complete lines who includes the outliers:
#10 first lines and 5 first columns of the outliers pander(tabaoutlier[1:10,1:5],split.table=Inf)
These fonctions are generic and can be applied to the whole set of variables of a csPi
object.
The function csMap
gives the ability to the user to map easily a variable over the ICES
statistical rectangle.
csMap(sole,"hh","foDur","tile","sum") csMap(sole,"hh","foCatEu5","bubble") csMap(sole,"hl","lenNum","tile","sum") csMap(sole,"tr","daysAtSea","bubble","sum") csMap(sole,"tr","vslId","bubble","n_distinct")
The function lengthHist
plots histograms of the length frequency data from the hl table of csPi object.
lengthHist(sole)
The function agelenPlot
plots age given length from the ca table of a csPi
object.
agelenPlot(sole)
This vignette was built using the vignette engine knitr::rmarkdown
in the knitr package. You can find the source
in the fihPifct repository on
Github, or if the fishPifct
package is installed on your computer:
system.file('doc', 'tutorial.Rmd', package='fishPifct')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.