knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This R package is dedicated to the construction of fishery-dependent data container objects. The mechanism to build up such containers is generic in order to facilitate the answers to the fishery-dependent data calls and to tackle the issues of providing the same information in different formats. The construction ensures intrinsically the quality of the data these objects contain. In this framework, we address the definition of data quality in the first part of this document. The document provides some practical examples of the construction of specific data containers.
Fishery data are usually collected at the national level, following national working plans and using ad-hoc infrastructure and database format. Upon agreement, these data are then transmitted to the Regional Fishery Management organizations (RFMO). RFMO's data calls define the type of data countries have to provide, the form these pieces of information have to follow, the way how the data are transmitted and the deadlines.
This document focus only on fishery-dependent data: data that are collected from the fishery, not the one collected during the scientific surveys or other activities not based on fishermen activities. A complete description of the information collectable from the fisheries is outside the scope of this document. This document will focus mainly on fishery characteristics (vessels characteristics, fishing areas, period of fishing, metiers used, fishing effort), the catches (landings and discards of the species), and the population descriptors of the catches: the numbers at length, the numbers at age and some biological parameters (maturity, sex...) by species.
The fishery-dependent data are one of the inputs of stock assessment models. Besides this critical use, these data by itself provide information about the fisheries states and behaviours: a standard prerequisite to the understanding of the fishing pressures on the stock and its influence on the dynamics of the stock. The demands on fishery-dependent data by RFMO has increased in the past few years. For example, the data calls issuing by the International Council of the Exploration of the Seas (ICES) have increased by three for the stocks identified as data-limited in 2016 and 2017. For such stocks, the data provided has to cover three years instead of one, logically multiplying the processing time of the information by three. Moreover, differents RFMOs can ask for the same information in different formats. It is the case for the European Countries having fishing fleets in the Mediterranean Sea: both the General Fishery Commission of the Mediterranean (GFCM) and the Joint Research Center (JRC, a research institution belonging to the European Union) request the same fishery-dependent data for these countries but follow different data calls (deadline and file formats differ profoundly).
The quality of a dataset is a vague concept. A definition of this concept has to mix philosophical and practical considerations. Philosophy helps to understand the link between reality (the fishery activities and its impact on stocks) and the data representing this process. Floridi's paper (2005) discuss the link between semantic information and meaningful data in a general way. Practical considerations consider first the real use of the data collected. Wang and Strong (1996) define "data quality" as data that are fit for use by data consumers. Between these two extremes views, this document relies on the data quality framework proposed by the GFCM (GFCM, 2019). This choice does not tell that this framework is better than another one, but it has the merit to be a complete one available for fishery data. Data quality indicators, their definitions and the corresponding practical implementation are defined pages 82 and 83 of the document. They are seven indicators we redefined here in a broader sense:
The timeliness, completeness and adequacy indicators will be not considered in this document. Their assessments are particular to the data calls and ask for extra information not belonging to the construction of the data container. However, considering the "fit for use" proposed by Wand and Strong (1996), they are probably the most important in term of quality: if no, partial or inadequate data are provided to answer a data call, what is the meaning of the related data collection?
The conformity, stability, consistency and accuracy indicators are data quality indicators which can be verified during the elaboration of the data at the national level. In these indicators, the conformity is an indicator intrinsically linked to the data container. For example, the conformity data quality indicators who checks if a landing value is a positive number, means that the column of the data table containing the landing value has to be defined as a positive number. If the data container of this data table is well defined, nothing else than a positive number can be entered by the user for this variable.
This document will focus on the construction of data containers conform by definition. Shorter examples will illustrate the implementation of some of the three other indicators (stability, consistency and accuracy). These indicators are less critical in term of data quality. Stability indicator signals a drastic change of a parameter, but this change can be a real change in the fishery dynamic. This indication can be interesting to point out for the end-user, as stock assessments models can be sensitive to a significant variation in their input, but per se we consider stability as a secondary step in data quality assessment. The consistency indicator is dependent on the format of the data calls. It has to be checked in the data calls where redundant information is requested. For example, if catches, landings and discards are all requested, the total catches have to be equal to the sum of the total landings and the total discards. This example seems to be trivial, but at the national level, the data calls answers can involve different institutes (to complete the previous example, institute A providing the discards estimates, and institute B the census of the landings), and this kind of check is rarely done across different institutes.
The accuracy indicator covers a large body of practice in statistics. As the stability indicator, its implementation can lead to false positive, because the definition of what is a realistic value of fishery parameters in the context of a given data call is open to discussion. However, like stability, this indicator can inform the user to some errors or change in the fishery or stock dynamic, and but per se we consider accuracy as a secondary step in data quality assessment.
The need to have a versatile tool able to build different data container for the same information relies on the multiplicity of the data format requested by the RFMOs. A data format is a formal definition of how the data are transmitted to the RFMOs. The definition includes the files format and the codification of the data in these files. For European Countries, according to the distribution of the stock their fleets target, a short list could be:
Other RFMOs exist (CCAMLR, SPF...) and are not considered here, as they are less concerned by European fleets. The main RFMOs (ICES, JRC, GFCM) for the European countries lead to the manipulation of at least 4 data format to answer the regular data call. All these formats are documented:
An interesting common feature of these documents is their evolution in time. All of them have experienced during the past three years (FDI, GFCM) or are experiencing significant changes (ICES). These significant changes regard the basic structure of the files exchanged with the countries, and on a less critical level, these changes also concern the definitions of the reference list of some individual parameters. As an example, the interested reader can compare the data structure described in Jansen et al. (2016) with the one in development in ICES 2018a.
To conclude this short view on the fishery-dependent data, the increasing of the data calls and the instability on the data formats requested by the RFMO highlight the need to develop a tool able to cope with these constraints, with the guarantee of ensuring the quality of the data transmitted to the RFMOs.
\newpage
The construction principle of fishery-dependent data containers is illustrated using the Fishframe data format (see Fig. 1 from Jansen et al. 2016).
\newpage
The careful inspection of this current format highlights essential properties:
trip
locates the trips in their landing harbours (variable harbour
), the table haul locates the fishing haul three times (by its geographical coordinates, the ICES Statistical rectangle
and the ICES Area
)...These comments apply to other RFMOs data formats. In summary, time and space references are essential information, and other fishery objects information (fishing vessel description, characteristics of a haul, sampling reference) are requested by all the data calls. Inside each data calls, the labelling and the aggregation of the information can change. For example, the temporal unit of GFCM data call is the year, while for ICES Intercatch related data it can be the month, the quarter or the year. For fixed information, the codification can change while the parameter conveys the same information. The metier lists are not the same for ICES, GFCM or FDI data call but a trawler of 25 meters long remains the same trawler even if its name is OTB_DEF_70-99_0 for ICES, T12 for GFCM and TRAWL/OTB/70D100 for the FDI data call.
Considering these properties, the main task to translate national data to data calls (apart from the estimation work that this document do not address) is a translation work in time and space plus a renaming of the information properties using a different convention. To do this, this package provides some generic data containers for time, for space and for general fishery object. These data containers embed some strict definition of the data types and ensure the conformity of the information intrinsically.
The data containers are build using S4 class objects. Advanced knowledge of R and an understanding of the oriented-object OO programming can be helpful to understand the way the data containers are built and translated into data files conform to the data-calls specification. The book of H. Wickham (2014) covers all these topics. The S4 objects in R have fundamental properties related to the fishery data container: they have a formal definition to ensure conformity and inheritance properties who simplify the construction of complex data containers.
The data container has to record information at its smallest resolution. If a trip is located in space by the vessel trajectory, the data container will contain this trajectory. If for another trip, only the ICES division is known, then this information is recorded in the data container. This consideration is the same for the time, and for all the other fishery information: one record by parameters at the smallest resolution possible. The R classes and objects will then ensure the translation of the data into the data-calls requirements using methods and transparent reference lists and algorithms.
A data-call asks for the monthly total landings for all the fishing areas where vessels operate.
To answer this data-call, the creation of an object Landing
containing (1) the landings values, (2) the corresponding dates (in this case the months) and (3) areas where fishing occurs.
The object Landing
contains three entities:
Time
object,Spatial
object.The objects (2) and (3) refer only to the time and the spatial characteristics of the landing events. They are built independently, and their conformity checked during the building process. The same construction principles apply for the object (1): in this simple example, this object is a positive numerical vector. To build the final Landing
object, the objects (1), (2) and (3) are associated together, in a way they keep their internal validations mechanisms. The Landing
object inherits the conformity checks and the objects types from which it was built. In the R language, the definition of the objects, their validities and their inheritance are based on the class definition, here the S4 classes.
\newpage
The classes identified as elementary are presented in the next section. They are implemented in the present package, but their implementations are subject to change according to the user needs and the data-calls scopes. We believe that the Time
and the Space
classes are generic enough to be needed across every data calls.
The library is then loaded:
library(CLEFRDB)
\newpage
Time
classTemporal references are defined by a type (TimeType
) and a date (TimeDate
).
new("Time")
The type provides the temporal resolution at which the information is recorded,
and the date the central date following the POSIXct
format of the temporal event.
Five TimeType
are available:
print(Timetype)
corresponding respectively to annual, quarterly, monthly, daily and date with time events.
The definition of a Time
object for a regular date (a nightly haul in May 2011 for example) is:
hauldate<-as.POSIXct(strptime("2011-03-27 01:30:03", "%Y-%m-%d %H:%M:%S")) new("Time",TimeType="date",TimeDate=hauldate)
Why adding such complex definition of a temporal event when adding column for year, month, day or quarter could be more straightforward ? For two reasons:
POSIXct
object guarantees the conformity of the date, and gives to the user the possibility to extract time informaiton using R methods without any error,To illustrate these properties, let see a simple example with a set of four differents temporal events recorded at different time scales:
haultime<-c("2011-03-27 01:30:03", "2011-03-27 12:00:00", "2011-03-15 12:00:00", "2011-02-14 00:00:00") haultime<-as.POSIXct(strptime(haultime, "%Y-%m-%d %H:%M:%S")) haultime<-new("Time",TimeType=c("date","day","month","quarter"), TimeDate=haultime) print(haultime)
The four hauls date are recorded respectively at the date, the day, the month and the quarter scale. The corresponding dates are the dates at the middle of the day, the month and the quarter. If a data-call request date in year, the conversion is easy to compute:
substr(haultime@TimeDate,1,4)
More complex time manipulations are available using the lubridate
R package.
The conformity of the Time
object is checked during the object construction.
A Time
class object is checked:
TimeDate
has to be POSIXct
,TimeType
has to be a character vector and presents in the Timetype
object,These commands will generate an error:
new("Time",TimeDate=c(NA),TimeType=c("day")) new("Time",TimeDate=.POSIXct(c(NA,NA)),TimeType=c("day","day")) new("Time",TimeType="date",TimeDate="2011-03-27") hauldate<-as.POSIXct(strptime("2011-03-27 01:30:03", "%Y-%m-%d %H:%M:%S")) new("Time",TimeDate=hauldate,TimeType=c("a day")) new("Time",TimeDate=hauldate)
This behaviour is not user-friendly, but simply asks the user to be consistent for a very simple field (time !). Adressing formatting error at the very beginning of the data preparation is the best way to propagate conformity quality inside a dataset.
If a data-call request a data table containing the time series of the total
landings.
The S4 object related to this demand can be defined using the Time
class as the container
for the temporal definition, adding a new slot for the landings values.
In this case, the new object class named Landings
will
inherit the Time
class properties. The conformity check and the other methods
belonging to the Time
class will be transmitted to the new Landings
object.
The function initinherit
of this package copies the parameters from one object
to another based on their typology. For the Landings
class containing the
Time
class, the Time
parameters will be copied inside the Landings
object.
#a Time object related to four haul haultime<-c("2011-03-27 01:30:03", "2011-03-27 12:00:00", "2011-03-15 12:00:00", "2011-02-14 00:00:00") haultime<-as.POSIXct(strptime(haultime, "%Y-%m-%d %H:%M:%S")) haultime<-new("Time",TimeType=c("date","day","month","quarter"), TimeDate=haultime) # the value of the total landings of the four hauls w<-c(10000,3000,2000,10) #Definition of the Landings class setClass(Class="Landings", slots=c("landings"="numeric"), contains=c("Time"), prototype=prototype(Landings=numeric(), Time=new("Time")) ) #a new Landings object new("Landings") #a new Landings object initialized with the previous values initinherit(new("Landings"),landings=w,Time=haultime)
The Space
class follow the same class structure of the Time
class.
Spatial references are defined by a type (SpaceType
) and a place
(SpatialPlace
).
new("Space")
The type provides the spatial type at which the information is recorded,
and the place the name of the spatial area corresponding to the type
Both paramters are character
vector
Four SpaceType
are available:
print(Spacetype)
corresponding respectively to the ICES division (e.g. 27.7.g), the ICES rectangle
(e.g. 50H6), the harbour in UNLOCODE (e.g. DEBRB) and the GSA area (e.g. GSA07).
The object defspace
of class sf
gives the spatial geometry of these differents spatial
entities using simple features ISO standard definition.
head(defspace)
A plot of this object shows a map of all the geometry available.
plot(defspace[1:10,"type"],axes=T)
The contruction of a simple Space
object containing nine spatial objects (three
ICES divisions, three ICES rectangles, two GSA areas and a harbour) is:
tripgeo<-new("Space",SpacePlace=c("27.7.f","27.7.g","27.7.h", "27E8","27F0","28F0", "GSA07","GSA08", "FRRTB"), SpaceType=c("ICESdiv","ICESdiv","ICESdiv", "ICESrect","ICESrect","ICESrect", "GSA","GSA", "harbour")) print(tripgeo)
The plot
method attached to the Space
class map the geometry of the
resulting object:
map(tripgeo,axes=T)
Validation mechanisms are operating on the SpacePlace
and SpaceType
slots.
The following commands will return an error.
new("Space",SpacePlace=c("27.7.h","GSA078"), SpaceType=c("ICEdiv","GSA")) new("Space",SpacePlace=c("27.7.h","GSA07","FRRTB","DEBRB"), SpaceType=c("ICESdiv","GSA","harbour"))
To illustrate the Space
class use in an operational example, the previous
example is extended. The data-call demands now the landings located in time and
space. To define the new object, as previously, a Time
object is build related
to the date at which the four hauls were made, and a Space
object referencing
the areas where the hauls occur is added. Then the new object is defined as a
new S4 class containing the landings values (the landings
slot), the date of
the hauls (the Time
slot) and their positions (the Space
slot). The values
of the haultime
and haulspace
are copied in the new Landings
object using
the initinherit function. There is no need to check the conformity of the
haultime
and haulspace
objects; this property was checked during the
construction.
#a Time object related to four hauls haultime<-c("2011-03-27 01:30:03", "2011-03-27 12:00:00", "2011-03-15 12:00:00", "2011-02-14 00:00:00") haultime<-as.POSIXct(strptime(haultime, "%Y-%m-%d %H:%M:%S")) haultime<-new("Time",TimeType=c("date","day","month","quarter"), TimeDate=haultime) # the value of the total landings of the four hauls w<-c(10000,3000,2000,10) #the area where the hauls were located haulspace<-new("Space",SpacePlace=c("27.7.f","27.7.g", "27E8","GSA07"), SpaceType=c("ICESdiv","ICESdiv", "ICESrect","GSA")) #Definition of the Landings class setClass(Class="Landings", slots=c("landings"="numeric"), contains=c("Time","Space"), prototype=prototype(landings=numeric(), Time=new("Time"), Space=new("Space")) ) #a new Landings object new("Landings") #a new Landings object initialized with the previous values haullanding<-initinherit(new("Landings",landings=w),Time=haultime,Space=haulspace) print(haullanding) map(haullanding) #dimclass(haullanding)
Regarding the conformity checks of the Landings
object, one issue remains.
The initinherit
function checks if the numbers of objects in each class are the
same, and the conformity of the Time
and Space
objects was checked during
their constructions, but the condition of having only positive values in the landing slot
is not test.
To implement this new conformity check, a validity function can
be define and associated to the Landings
cass.
This function checks for an equal number of objects in each class,
and test if the landings are positive values.
For a more advanced discussion on how to define this validation mechanism, see the R's help about the
setValidity
method.
#validity of the landings object validLandings<-function(object){ if(any(object@landings<0)){print("negative landings !!");FALSE}else{TRUE} } #associate the validation function to the landings object setValidity("Landings",validLandings) #the original object is still valid initinherit(new("Landings"),landings=w,Time=haultime,Space=haulspace) #a negative value in the landings slot will throw an error ## w<-c(10000,3000,-2000,10) ## initinherit(new("Landings",landings=w),Time=haultime,Space=haulspace)
In this section, two examples are provided. The first example show how to
generate object from the one available in this package, and some new objects to
generate the table TR
in the Fishframe format (ICES, 2018b). The second example will use
these objects to generate the Fishing Trip
table of the new RDBES format
(ICES, 2019).
Trip
in FishframeThis table contains information regarding the fishing trip (identifier of the
trip, number of hauls, days at sea, harbour of landing), the vessel (identifier, flag, length, power, size, type), and the
sampling (type, year, method, sampling country, project name).
To complete the information needed in this table, the class Space
and Time
will be used to provide the harbour and the year, while the Vessel
and
Sampling
classes are built from scratch to provide the requested information.
Vessel
classThe Vessel
class contains all the information related to the vessel. In the six slots
of this class, the vessel identifier, its flag, type, length, power and size are recorded.
To lighten this example, the conformity of the object will be not adressed here. Only the type
of the slot (character, integer, numeric...) are checked. To validate the conformity of the object, the previous example shows how to define a validation function and to associate it to the object.
The information of a fictional vessel populate this new object.
#Vessel class setClass(Class="Vessel", slots=c(VesselId="character", VesselFlag="character", VesselType="character", VesselLength="integer", VesselPower="integer", VesselSize="integer" ), prototype=prototype(VesselId=character(), VesselFlag=character(), VesselType=character(), VesselLength=integer(), VesselPower=integer(), VesselSize=integer()), ) #a fictional vessel myVessel<-new("Vessel", VesselId="AAAA123", VesselFlag="FRA", VesselType=c("Trawler"), VesselLength=as.integer(20), VesselPower=as.integer(5000), VesselSize=as.integer(1000))
A new class Sampling
is defined following the same approach.
#Sampling class" setClass(Class="Sampling", slots=c(SamplingId="character", SamplingType="character", SamplingMethod="character", SamplingProject="character", SamplingCountry="character"), prototype=prototype(SamplingId=character(), SamplingType=character(), SamplingMethod=character(), SamplingProject=character(), SamplingCountry=character()) ) #a fictionnal sample mySampling<-new("Sampling", SamplingId="SAMP2199", SamplingType="S", SamplingMethod="Observer", SamplingProject="SamplingProjectFRA", SamplingCountry="FRA")
To complete the trip information, two Time
and wto Spatial
objects are defined
for this trip. They refer to the location and the date of the beginning and the
end of fishing trip:
mySpaceDep<-new("Space",SpacePlace=c("FRCOC"),SpaceType=c("harbour")) mySpaceArr<-new("Space",SpacePlace=c("FRCOC"),SpaceType=c("harbour")) myTimeDep<-new("Time", TimeDate=as.POSIXct(strptime("2011-03-15 12:16:00","%Y-%m-%d %H:%M:%S")), TimeType="date") myTimeArr<-new("Time", TimeDate=as.POSIXct(strptime("2011-03-19 04:00:32","%Y-%m-%d %H:%M:%S")), TimeType="date")
Then, a new class corresponding to the Trip table is defined:
# a new Trip class setClass(Class="Trip", slots=c(nbhaul="integer", daysatsea="integer", TripId="character", TimeDep="Time", TimeArr="Time", SpaceDep="Space", SpaceArr="Space" ), contains=c("Vessel", "Sampling" ), prototype=prototype(TripId=character(), nbhaul=integer(), daysatsea=integer(), Vessel=new("Vessel"), Sampling=new("Sampling"), TimeDep=new("Time"), TimeArr=new("Time"), SpaceDep=new("Space"), SpaceArr=new("Space") ) ) myTrip<-initinherit(new("Trip", TripId="FR872", nbhaul=as.integer(12), daysatsea=as.integer(5)), Sampling=mySampling, Vessel=myVessel, TimeDep=myTimeDep, TimeArr=myTimeArr, SpaceDep=mySpaceDep, SpaceArr=mySpaceArr)
Then, the export this object into the Trip
format requested by Fishframe
is only a matter of formatting (we suppose that all the field were
checked for conformity, with the right validation function associated to each
object). Here an example of such formatting:
FishframeTrip<-data.frame(`Record type`="TR", `Sampling type`=myTrip@SamplingType, `Landing country`=myTrip@SamplingCountry, `Vessel flag country`=myTrip@VesselFlag, `Year`=substr(myTrip@TimeArr@TimeDate,1,4), `Project`=myTrip@SamplingProject, `Trip code`=myTrip@TripId, `Vessel length`=myTrip@VesselLength, `Vessel power`=myTrip@VesselPower, `Vessel size`=myTrip@VesselSize, `Vessel type`=myTrip@VesselType, `Harbour`=myTrip@SpaceArr@SpacePlace, `Number of hauls`=myTrip@nbhaul, `Days at sea`=myTrip@daysatsea, `Vessel identifier`=myTrip@VesselId, `Sampling country`=myTrip@SamplingCountry, `Sampling method`=myTrip@SamplingMethod)
Fishing Trip
in the RDBESAccording to the Trip
object, the formatting for the Fishing Trip
table of
the RDBES is done using the same information. Some parameters are not available
in the original Trip
object, but they can be added by the user according to
its needs. For example the probability of sampling can be added to the Sampling
object
in order to complete the information requested by the RDBES.
The export with the original Trip
object to the RDBES format is:
RDBESTrip<-data.frame(`FishingTripID`=myTrip@SamplingId, `OnShoreEventID`="", `VesselSelectionID `="", `VesselDetailsID `=myTrip@VesselId, `SampleDetailsId`=myTrip@SamplingId, `Record type`="FT", `National Fishing trip Code`=myTrip@TripId, `Stratification`="", `Trip Stratum`="", `FishingTripsClustering`="", `FishingTripsClusterName`="", `Sampler`="", `NumberOfHauls`=myTrip@nbhaul, `Departure location`=myTrip@SpaceDep@SpacePlace, `Depature date`=substr(myTrip@TimeDep@TimeDate,1,10), `Depature time`=substr(myTrip@TimeDep@TimeDate,12,19), `Arrival location`=myTrip@SpaceArr@SpacePlace, `Arrival date`=substr(myTrip@TimeArr@TimeDate,1,10), `Arrival time`=substr(myTrip@TimeArr@TimeDate,12,19), `FishingTrips Total`="", `FishingTrips Sampled`=1, `FishingTripSampelProbability`="", `Selection method`="", `Selection Method Cluster`="", `FishingTripsTotClusters`="", `FishingTripsSampClusters`="", `FishingTripsClustersProb`="")
s R library proposes an implementation of classes of objects currently used by fishery-dependent data. The definition of S4 classes and its strict validation mechanism lead to generating objects which strictly follow the conformity of their description (variable type, numerical range, compliance to a lookup table...). The inheritance among S4 objects gives to the user the ability to build more complex objects from smaller objects, conservating the conformity of these smaller objects. These objects have no prerequisite in term of format: they can be created from the national data available in each country. Then exporting these objects into the multiple formats requested by the different RFMOs is straightforward, and these objects, by construction, will pass the conformity checks implemented on the various upload facilities.
If the two examples presented in this document generate raw sampling dataset, other data-call request elevated data at the population level (GFCM, ICCAT, ICES Intercatch). So there is a strong need for methods associated with these object to raise the sampling data to the population level. Future work will implement such methods including generic statistical sound sampling elevation methods, from ratio-estimators to probability-based sampling estimators. The main difficulty will be to free the elevation methods from any data format. A significant benefit should be the smooth transmission and adaptation of these methods to the RDBES format, and possibly to other fishery-dependent data formats.
\newpage
http://www.ices.dk/sites/pub/Publication%20Reports/Expert%20Group%20Report/acom/2018/WKRDB/wkrdb-model_2018.pdf
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.