knitr::opts_chunk$set(
  collapse = TRUE,
  error = TRUE,
  comment = "#>",
  fig.width=8, fig.height=8
)
library(prepRECA)

Preparing data for Catch At Age estimation

This packages contains functions for preparing data for Catch At Age estimation, using the package 'Reca'. For common use-cases it takes care of complex data conversion, and facilitates back-conversion when needed.

Reca is provided by the Norwegian Computing Centre, and is available at: https://github.com/NorskRegnesentral/Reca . It implements a Bayesian model for estimating Aatch At Age, and is designed as a component in StoX, an estimation system used at the Institute of Marine Research (IMR) in Norway. Data checks, and data conversion are handled by other StoX components, and are closely tied to the data models supported by StoX. This package is being developed in order to ease adaptation of other data models to Reca, and provides conversion functions and wrapper functions. For purposes of illustration, example data sets from Norwegian port sampling, and aggregated Norwegian sales-notes data is included. These are formatted according to a development version of the data model for the regional data base and estimation system (RDBES). For details see data set documentation:

?prepRECA::CLCodHadNOR #sales notes
?prepRECA::NORportsampling2018 #port sampling

Because this data format was chosen for the examples, the package also contains a few functions specific to this format.

The key functions for configuring and running the model are: * rEcaDataReport * prepRECA * runRECA

Reca first runs a parameter estimation to estimate the psoterior distribution of parameter values. This is a Metropolis-Hasting simulation. Afterwards it runs a prediction where posterior distributions are sampled to calculate estimates of Total Catch At Age. The results are provided as a distribution of Catch At Age estimates, and can be further inspected and processed to show point estimates, credible intervals etc, using functions in this package: * plotAgeTraces * makeResultTableRECA * plotCatchAtAge

The general workflow is to use 'rEcaDataReport' to inspect what model configurations are reasonable for your samples, then use 'prepRECA' to specify model configuration and to convert data to the data format accepted by Reca, before finally running Reca using 'runReca'. In what follows, I will present an example of such a workflow for estimation of Haddock catches in Norwegian coastal waters.

Reca model configuration

Reca is an implementation of ECA, a Bayesian Framework for estimating catch age age, descirbed in Hirst et al. (2004, 2005 and 2012). Briefly, fisheries are broken down into cells, for which total landings can be obtained (via sales-notes or similar census), and for which appropriate meta-information is available for allocating samples to cells. The covariates that define the cells are typically Area, Season and Gear, but are configurable in Reca and can sometimes be better defined based on knowledge of the sample designs incorporated. Reca allows covariates beside those defining the cells to be included as random effects if appropriate meta-information is available for the samples used.

Reca discriminates three kinds of effects for covariates: fixed effect, random effect and CAR-effect (conditional autoregressive effect). The terminology does not correspond directly to the way the terms 'fixed' and 'random' effect are used in frequentialist modelling. The key distinction between the effects is that random effects are modelled in a more complex way, which allows them to be simulated if they are not sampled in some cells. The CAR-effect is a special kind of random effect which allows for additional control for how the covariate is dealt with if it is not sampled in some cells.

In pratical terms, in case not all cells are sampled, Reca allows for modelling cell-covariates as random effects. This typically have little effect on total estimates when samples are only missing in cells with low volume of landings, but should be otherwise be used with caution, and grouping of different covariate values should be considered(e.g. grouping different gears).

The CAR-effect provides a more advanced adaptation to cells not sampled. It requires that some definition of adjacency can be obtained for the covariate modelled as CAR-effect. It is typically used for spatial effects, for instance Area, when it is easy to determine which areas are adjacent.

Additional covariates not in cells must always be configured as random effects.

A reasonable sampling program will sample cells with low volume of landings with low probability. So for a fine-cell structure unsampled cells are to be expected, and deciding on when to group and when to configure are commonly the most important tasks of configuring the model.

inspecting data

In order to decide how to configure the model, we need to inspect the data. This package provides the function 'rEcaDataReport' for that purpose.

rEcaDataReport expect samples formatted as a data table (package data.table) with one row for each fish, and landings formated as a data table with the total landings partioned over the rows, with columns identifying the desired cell definition. For both landings and samples, some specific columns are also required (see ?rEcaDataReport). The example sales-notes already almsot conform to this expectation, but the samples are provided in a hiearchical format, with one table for different part of the hiearchical sampling. We need to first convert this to a flat table. For estimation with Reca, zero-samples are not important, so we start by restricting or samples to those with Haddock present (samples are represented by the RDBES SA-table):

  # select samples of target species, aphia code for haddock is 126437
  SA <- prepRECA::NORportsampling2018$SA[prepRECA::NORportsampling2018$SA$SAsppCode == "126437",]

Fish measurements are coded in the RDBES BV-table for our samples (RDBES data model v 1.17), so we need to extract those. This package provides a function for that:

  # extract biological measurements corresponding to the selected samples
  BV <- prepRECA::NORportsampling2018$BV[prepRECA::NORportsampling2018$BV$SAid %in% SA$SAid,]
  samples <- extractBV(BV, c("Age", "Length", "Weight"), c("integer", "numeric", "numeric")) # set names as required by rEcaDataReport

'extractBV' does simple extration, and does not check for units etc. Checks on data heterogenity should be performed:

table(BV$BVunitVal)
table(SA$SApres)

We have heterogenoeus presentation of fish. Weight is an auxiliary variable, that Reca does not necesarily require complete observations of, so I'll try removing weights of gutted fish for the sake of this example:

samples[samples$SApres == "Gutted", "Weight"] <- NA

We note that weights and lengths are given in grams and mm. Reca expects kg and cm.

samples$Weight <- samples$Weight / 1000
samples$Length <- samples$Length / 10

Then we merge in information from the rest of the hiearchy:

  # merge in needed upper levels.
  # SS, SD and DE is not strictly needed,
  # but included for checking for stratification and clustering at these levels.
  samples <- merge(samples, SA, by="SAid")
  samples <- merge(samples, prepRECA::NORportsampling2018$SS, by="SSid")
  samples <- merge(samples, prepRECA::NORportsampling2018$LE, by="LEid", suffixes = c("", ".LE"))
  samples <- merge(samples, prepRECA::NORportsampling2018$VD, by="VDid")
  samples <- merge(samples, prepRECA::NORportsampling2018$OS, by="OSid")
  samples <- merge(samples, prepRECA::NORportsampling2018$SD, by="SDid")
  samples <- data.table::as.data.table(merge(samples, prepRECA::NORportsampling2018$DE, by="DEid"))

I will let each landing represent a catch, and set 'catchId' and 'sampledId' (as required by 'rEcaDataReport') accordingly:

  samples$catchId <- samples$LEid
  samples$sampleId <- samples$SAid

This sampling program covers almost exclusively day-catches, so sampling date is a good proxy for the date of catch (required by 'rEcaDataReport'):

 samples$date <- samples$OSsamDate

The data table 'samples' now represent our samples of Haddock from the portsampling program in 2018, in a format that conforms to 'rEcaDataReport'. For the landings (sales-notes), we only have to filter by species, and for the population of catches we want to estimate for (which will be Gillnet, Longline and Demershal Seining) in area 27.2.a and 27.1.b. This is a coarse approximation to the sampling frame for this program:

 landings <- prepRECA::CLCodHadNOR
 landings <- landings[landings$Species == "126437",]
 landings <- landings[landings$FishingActivityCategoryEuropeanLvl5 %in% c("GNS_DEF", "LLS_DEF", "LX_DEF", "SSC_DEF"),]
 landings <- landings[landings$Area %in% c("27.2.a.2", "27.1.b"),]

'rEcaDataReport' expects total weights to be identified by the column 'LiveWeightKG':

 landings$LiveWeightKG <- landings$OfficialLandingsWeight

The data table 'landings' now represent the total landings of Haddock from Norwegian vessels in 2018 for the gear and areas described above, in a format that conforms to 'rEcaDataReport'. Now, we want to run 'rEcaDataReport' to see which columns in landings can represent Reca-cells, which should be modelled as random effects, and which should be fixed effects, and which columns in 'samples' may serve as additional covariates. 'rEcaDataReport' does however also expect that columns that represent the same in samples and landings are named and formatted equally. So we have to inspect how information is coded for columns we are interested in, and see if such correspondances exists, and if any can be made. First we check that no columns match without actually representing the same:

any(names(samples) %in% names(landings))

Then we will consider the covariates commonly used in Reca estimation: season, area and gear. For season I will use quarter, which occur in landings as a column 'Quarter' and in samples as a stratification variable 'OSstratum'. Both registrations are complete. They are however formatted differently:

#registrations complete ?
all(!is.na(landings$Quarter))
all(!is.na((samples$OSstratum)))

#format of column
table(landings$Quarter)
table(samples$OSstratum)

I will rename and reformat so they are in correspondance:

landings$quarter <- paste("Q", landings$Quarter, sep="")
samples$quarter <- samples$OSstratum

For area, both formats provide ICES area, in columns LEarea and Area, respectively. I will rename:

 samples$area <- samples$LEarea
 landings$area <- landings$Area

 #check complete
 all(!is.na(landings$area))
 all(!is.na((samples$area)))

For gear, we can use the Level 5 metier, after renaming:

 samples$metier5 <- samples$LEmetier5
 landings$metier5 <- landings$FishingActivityCategoryEuropeanLvl5

 #check complete
 all(!is.na(landings$metier5))
 all(!is.na((samples$metier5)))

We are then ready to inspect the data with respect to model configuration:

prepRECA::rEcaDataReport(samples, landings, c("area", "quarter", "metier5"))

First we note that we have a sample that does not exist in landings (notice warning). One single trawl-sample, that we have defined out of the population to estimate. We will remove this:

 samples <- samples[samples$metier5 != "OTB_DEF",]

We see that not all cells are sampled, so these can not all be configureed as fixed effects. We also find that some cells with considerable amount of landings are not sampled, so simply setting to random is a bit risky. We might also consider some of these gears to be quite similar and group them. For instance, one might assume from familiarity with these fisheries, and from the conversion scheme used, that the large volume of landings in "LX_DEF" indicate that many set lines are coded as "LX". grouping these with "LLS" would solve the first and second cell with zero samples:

landings[landings$metier5 == "LX_DEF", "metier5"] <- "LSS_LX_DEF"
landings[landings$metier5 == "LLS_DEF", "metier5"] <- "LSS_LX_DEF"
samples[samples$metier5 == "LX_DEF", "metier5"] <- "LSS_LX_DEF"
samples[samples$metier5 == "LLS_DEF", "metier5"] <- "LSS_LX_DEF"

prepRECA::rEcaDataReport(samples, landings, c("area", "quarter", "metier5"))

Remaining issues are mostly related to Q3 and SSC (for which we have samples in Q3). Will group Q3 and Q2

landings[landings$quarter == "Q2", "quarter"] <- "Q2-Q3"
landings[landings$quarter == "Q3", "quarter"] <- "Q2-Q3"
samples[samples$quarter == "Q2", "quarter"] <- "Q2-Q3"
samples[samples$quarter == "Q3", "quarter"] <- "Q2-Q3"

prepRECA::rEcaDataReport(samples, landings, c("area", "quarter", "metier5"))

Remaining issues are now in cells with low volume of landings. I will deal with these by configuring quarter as a random effect. We can then proceed to preparing the data for Reca:

converting data and running Reca

In order to prepare data for Reca, we run the function 'prepRECA'. In addition to converting data, it provides some data checks, and will give an error on our first attempt:

recaObj <- prepRECA::prepRECA(samples, landings, fixedEffects = c("area", "metier5"), randomEffects = c("quarter"), NULL, minAge = 1, maxAge = 20, lengthResolution = 1, quarter = landings$Quarter)

This error arises because there are a few catches in the data which are sampled several times. In such cases Reca needs an estimate of how many fish each of these samples represent. This is only needed for catches that are sampled more than once (otherwise Reca assumes sample is representative of entire catch), and is therefore supported by passing the fish count estimates as an optinal argument.

In this case these are not replicate samples, but reflect some kind of undocumented stratification, with estimated total weights for each strata coded in the column 'SAtotalWtLive'. I will provide an estimate for all samples, even if it is not needed for more than a few:

meanWeights <- aggregate(list(meanweight=samples$Weight), by=list(sampleId=samples$sampleId), FUN=mean, na.rm=T)
nFish <- merge(unique(samples[,c("sampleId", "SAtotalWtLive")]), meanWeights)
nFish$count <- nFish$SAtotalWtLive / nFish$meanweight
nFish$SAtotalWtLive <- NULL
nFish$meanweight <- NULL

We can then try prepRECA again:

recaObj <- prepRECA::prepRECA(samples, landings, fixedEffects = c("area", "metier5"), randomEffects = c("quarter"), NULL, minAge = 1, maxAge = 20, lengthResolution = 1, quarter = landings$Quarter, nFish = nFish)

We can then proceed to run Reca:

recaResult <- runRECA(recaObj, nSamples=500, burnin = 5000)

Process results

To inspect the result in tabular form, we run 'makeResultTableRECA:

makeResultTableRECA(recaResult$prediction)

Before spending too much time interpreting the result, we should inspect for signs of non-convergence. One way of doing that is to look at the traces of the simulation. These are run after prediction, and should preferably look like independent random samples from the distribution. Based on the table above, I will group ages 12 and up in one group.:

plotAgeTraces(recaResult$prediction, plusGroup = 12)

Beware of plots showing spikes or autocorrelations. Those are signs of non-converged parameterization. In those cases consider re-running with higher value for the parameter 'burnin'. Low-abundance groups, typically take longer to converge, as evident from age 1, in the plot above. Non-convergence of these estimates are also of less concern, as long as one can justify from other sources that they really are of low abundance. For age-group 1, low abundance in catches is expected from gear selectivity.

Finally, I extract means and credible intervals, and plot them using 'plotCatchAtAge':

plotCatchAtAge(recaResult$prediction, plusGroup = 12, title="Catch At Age, Haddock example, 95% CI")

For this particular example, we have run Reca for a single sampling program, for an approximately correct sampling frame. That does not do the problem justice, as one commonly have to provide estimates beyond our sampling frame as well. For this purpose, Reca admits samples from different sampling programs, with covariates configured for the clustering introduced by each (e.g. vessel if sampling programs selecting samples by vessels are introduced).

While the port-sampling analysed here is not strictly probabilistic, it is intended to give approximately random access to catches in the sampling frame, through expert judgement sampling. I am therefore curious about how the results would look, if we merged all cells and run with only the non-configurable cathc / haul as covariate:

recaObjNoCells <- prepRECA::prepRECA(samples, landings, NULL, NULL, NULL, minAge = 1, maxAge = 20, lengthResolution = 1, quarter = landings$Quarter, nFish = nFish)
recaResultNoCells <- runRECA(recaObjNoCells, nSamples=500, burnin = 5000)
plotAgeTraces(recaResultNoCells$prediction, plusGroup = 12, nclust = 6)

Because the results were a bit hard to see, I have increased the number of plots used for the trace-inspection here.

plotCatchAtAge(recaResultNoCells$prediction, plusGroup = 12, title="Catch At Age, Haddock no-cells, 95% CI")

The overall impression from this simpler analysis is in quite good agreement with the one with proper cell definitions.

References



edvinf/prepRECA documentation built on Nov. 11, 2019, 6:30 a.m.