knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(RstoxFDA)
This vignette explains how to use RstoxFDA functions in the StoX user-interface to prepare and run analytical catch-at-age estimates. Some data preparation may be necessary, and the vignette StoX FDA data preparation (baseline) introduces some common data-preparation tasks for catch at age estimation, with formats commonly used at the Institute of Marine Research (IMR). In order to make informed decisions on data preparation and model configuration, it will be necessary to get an overview of the fisheries and how it is covered by the available samples, for instance using the functions introduced in Stox Fisheries overview (report).
This document introduces some problems and tasks, and how RstoxFDA-functions may be applied to solve them. The details of how to use the functions introduced here are provided as function documentation viewable in the StoX "function description"-tab, or in an R-console via '?', e.g:
?RstoxFDA::AnalyticalPopulationEstimate
The "function description"-tab in StoX have some limitations and may not display equations. To render the equation, inspect the function documentation in R or consult the RstoxFDA manual.
Documentation for data formats are provided in the same way. E.g.:
?RstoxFDA::AnalyticalPopulationEstimateData
StoX is composed of several R-packages. Functions will be referred to by their package name with the notation package::function. In the StoX user-interface, the package that functions belong to are not visible, so the function denoted here as package::function, will be available as just function. Any data formats are similarly denoted package::format.
RstoxFDA is an optional packages in StoX. In order to use the functions introduced in this vignette in the StoX user-interface, make sure RstoxFDA is installed. See installation instructions on github: StoXProject/RstoxFDA. To install the StoX user-interface, see instructions on github: StoXProject/StoX.
For testing pre-releases of StoX, obtain the stox packages from the testing repo:
install.packages(c("RstoxData", "RstoxBase", "RstoxFramework", "RstoxFDA"), repos = c("https://stoxproject.github.io/testingRepo/"))
This will install latest pre-release available for your version of R and operating system. Make sure you have R 4.3.
After you have finished testing, you may want to revert to the production packages. This may require explicitly uninstalling the packages first, as they have higher version numbers than production packages:
remove.packages(c("RstoxData", "RstoxBase", "RstoxFramework", "RstoxFDA")) install.packages(c("RstoxData", "RstoxBase", "RstoxFramework", "RstoxFDA"), repos = c("https://stoxproject.github.io/repo/"))
In order to construct StoX projects like the example below, sampling parameters must be obtained from processes outside of StoX and formatted as described in the documentation for RstoxFDA::DefinePSUSamplingParameters. Examples provided for testing purposes can be located internally at IMR (https://git.imr.no/pelagic/internal/fangstprovelotteriet/samplingparameters/-/tree/main/formatted_files).
Below is provided an overview of the functions that can be used to perform analytical estimates from fisheries dependent sampling.
RstoxFDA provides a generic framework for analytical estimates of multi-stage sampling data, through a 2-stage approximation. The sampling design is conceptually divided into two steps: Selection of primary sampling units (PSUs), and selection of individuals within each sampling unit. If selection probabilities can be inferred for each of these two steps:
In most circumstances, only a small fraction of the PSUs available in a stratum is actually sampled, and typically this provide a good approximation to the sampling variance, even if within-PSU variance is not accounted for.
As an example, we will demonstrate estimation with data from the catch lottery for North-sea Herring in 2022. This is an unequal probability sampling program, which selects hauls in real-time from an active fishery, with selection probabilities proportional to the first reports of catch size. The sampling frame covers Norwegian vessel larger than 15 m which has declared that they are targeting herring when leaving port, and has declared North-Sea herring in their live catch reports.
See the vignette StoX FDA data preparation (baseline) for how to read data. We use RstoxData::ReadBiotic and RstoxData::StoxBiotic to get the samples in the format we like. And then we use RstoxData::AddToStoxBiotic to add the column 'serialnumber' which is needed for identifying which data records belong to the which sampling parameters. The result of those operations is included as an example data set in RstoxFDA, called 'CatchLotteryExample'. Similarly, we obtain landings data with RstoxData::ReadLanding, RstoxData::StoxLanding, and RstoxData::FilterLanding. An example data set with landings of North Sea Herring from Norwegian vessels larger than 15 meters is included as an example data set in RstoxFDA, called 'CatchLotteryLandingExample'.
We will start by reading these sampling parameters, which has been prepared outside of StoX:
DefinitionMethod = "ResourceFile" FileName = system.file("testresources", "lotteryParameters", "lotteryDesignNSHstrata.txt", package="RstoxFDA") PSUparameters <- RstoxFDA::DefinePSUSamplingParameters(DefinitionMethod=DefinitionMethod, FileName=FileName)
Note that this files contains selection and inclusion probabilites for every haul selected for sampling:
head(PSUparameters$SelectionTable)
The column 'SamplingUnitId' identifies the data records for selected hauls that have been sampled. This corresponds to 'serialnumber' in the sample data. We also have records for selected hauls that was not sampled, which gives us a basis for non-response corrections. These have missing values in the column SamplingUnitId. You may find tabulated below, the sampling parameters for all sampled hauls:
head(PSUparameters$SelectionTable[is.na(PSUparameters$SelectionTable$SamplingUnitId),])
The samples are all selected from the same stratum, called "Nordsjo". Later on, we will need to put this stratum in correspondance with census data (sales-notes). We have therefore added a defining columns to the table 'StratificationVariables':
PSUparameters$StratificationVariables
This tells use that the Stratum "Nordsjo" is defined by the Species code "061104", which we will later find in the landings data. You can use other or additional columns i stead, if that is more appropriate for your fishery or the way you prepare the landings data. You may also use StoX functions for adding columns to the landings, to make sure that the strata sampled are identifiable in the landings, and you may edit the resource-file for sampling parameters to make sure they are identifiable in the samples. For some kinds of estimation the landings are not necessary. In these cases it is sufficient that the strata is encoded in the column "Stratum", and additional stratification variables are not necessary. For purposes of estimation, the case when sampling is not stratified is the same as if we have only one stratum. So the Stratum column can always be encoded.
We also need to know the sampling parameters for each fish in each haul. RstoxFDA provides support for calculating those for some common sampling schemes. We can infer sampling parameters for observations of individual fish within a haul with the function RstoxFDA::DefineIndvidiualSamplingParameters. In the case of the catch lottery sampling, we will assume that fish that are selected for measurement is a simple random sample of the catch (DefinitionMethod="SRS"). Sometimes different sampling schemes are employed for different parameters, so we have to specify a set of parameter which signals that the specimen is selected. We will specify the selected individuals as those where either sex or age is observed, as sex and age is always observed together in this program, and specifying sex rather than just age allows us to catch missing age records (due to for instance readability):
StoxBioticData = RstoxFDA::CatchLotteryExample DefinitionMethod = "SRS" Parameters = c("IndividualSex", "IndividualAge") indParameters <- RstoxFDA::DefineIndividualSamplingParameters(StoxBioticData = StoxBioticData, DefinitionMethod = DefinitionMethod, Parameters=Parameters)
Note that we have obtained sampling parameters for individual fish sampled for sex or age, the haul is identified by the column 'SampleId' that corresponds to 'Haul' in StoxBioticData, and the individual is identified by 'IndividualId' which corresponds to 'Individual' in StoxBioticData. Within-haul stratification is also inferred:
head(indParameters$SelectionTable)
We will now put data and selections in correspondence with the function RstoxFDA::AssignPSUsamplingParameters. We will Identify hauls with the StoxBioticData variable Haul, which is how the PSU is identified in the specification of sampling parameters for individual fish (individualParametersData, computed above):
PSUSamplingParametersData = PSUparameters StoxBioticData = RstoxFDA::CatchLotteryExample SamplingUnitId = "serialnumber" DataRecordId = "Haul" DefinitionMethod = "MissingAtRandom" PSUassigment = RstoxFDA::AssignPSUSamplingParameters(DefinitionMethod=DefinitionMethod, StoxBioticData=StoxBioticData, PSUSamplingParametersData=PSUparameters, SamplingUnitId=SamplingUnitId, DataRecordId=DataRecordId)
The procedure has corrected sampling parameters for non-response, and removed non-respondents from the table of selections. We have also changed the content of the column 'SamplingUnitId', so that it now corresponds to the column used to identify hauls in StoxBiotic:
head(PSUassigment$SelectionTable)
Since these estimators are based on the sampling design, they are generic with respect to variables of interest. For all numeric variables, they can provide an estimate of totals and means by some specified domain definition. In addition total abundance in domain and the frequency of occurrence in domain is estimated. A domain is any defined group of fish in the catches. Of primary interest is groups defined by fish measurements and observations, such as age or length, but caught fish are also distinguished by fishing activity variables, such as the gear used to catch them, or the time of catch. For purposes of estimation we need to distinguish between domain variables that are exclusive to higher level selection and those that are not. Age is for example available for any fish, so estimates of abundance in an age group is backed by all samples. Gear is however exclusive to a catch so estimates of abundance in gear groups are only backed by the samples that are in that gear-domain. In this example we will only estimate domains that are based on observations of fish, but in general this text and the documentation for functions refer to PSU domains for the domains that are exclusive to a primary sampling unit (gear, quarter, area, etc.).
We first estimate abundance, frequencies, totals and means for each PSU. We need to specify the domain we want to estimate for. We provide age so that we get estimates for each age group. We also need to specify which variables we want means and totals for. We will specify weight and length:
StoxBioticData = RstoxFDA::CatchLotteryExample IndividualSamplingParametersData = indParameters DomainVariables = c("IndividualAge") Variables = c("IndividualRoundWeight", "IndividualTotalLength") psuEstimates <- RstoxFDA::AnalyticalPSUEstimate(StoxBioticData=StoxBioticData, IndividualSamplingParametersData = IndividualSamplingParametersData, DomainVariables=DomainVariables, Variables=Variables)
We now have estimates of abundance and frequency for each age group for each PSU/haul:
head(psuEstimates$Abundance)
Note that domains in this table are just identified by a text-variable. As for stratification, more detailed information about the domain is stored in a separate table:
psuEstimates$DomainVariables
This reveals that the domain "1" is defined by fish with the value 1 for the variable "IndividualAge". Note that fish with missing ages have become their own domain called 'NA'.
We also have estimates of totals and means of provided variables:
head(psuEstimates$Variables)
For domains / age groups with estimated zero abundance, totals are 0, and means are NaN.
We can no proceed with estimating these parameters for the total popuation of catches. We use RstoxFDA::AnalyticalPopulationEstimate:
PSUSamplingParametersData = PSUassigment AnalyticalPSUEstimateData = psuEstimates populationEstimates = RstoxFDA::AnalyticalPopulationEstimate(PSUSamplingParametersData, AnalyticalPSUEstimateData = AnalyticalPSUEstimateData)
Which gives abundance, frequencies, means and totals for the entire population:
head(populationEstimates$Abundance) head(populationEstimates$Variables)
Note that the names of domains have changed, to incorporate any domains that vary between PSUs (none in this case). The definition is however still the same, as can be confirmed by inspecting the table of DomainVariables:
populationEstimates$DomainVariables
We also have full sampling covariance matrices, which will give a basis for reporting sampling variances:
head(populationEstimates$AbundanceCovariance)
We have not yet utilized the census information on landings. We can improve our estimate with a ratio estimator that corrects the abundance with the ratio between estimated total weight across all domains / age groups and the reported total landings. The function RstoxFDA::AnalyticalRatioEstimate supports this, but needs the strata or PSU-domains to be in correspondence with variables in the landings, which is why we wanted the PSUs stratification to be defined by 'Species'. The ratio estimate function automatically recognizes identical column names between stratification columns and domain columns in the estimate and the columns available for partitioning the landings. If no corresponding column names are found, or if they have no overlapping codes, an error will be raised. In this case our PSU domain cover the entire strata, which cover the entire sampling frame. We perform the ratio estimate:
AnalyticalPopulationEstimateData = populationEstimates StoxLandingData = RstoxFDA::CatchLotteryLandingExample WeightVariable = "IndividualRoundWeight" Method = "TotalDomainWeight" ratioEst <- RstoxFDA::AnalyticalRatioEstimate(AnalyticalPopulationEstimateData=AnalyticalPopulationEstimateData, StoxLandingData = StoxLandingData, WeightVariable = WeightVariable, Method = Method)
Comparing this with the estimate above, the frequencies should be nearly identical, but some corrections are expected for the abundances:
merge(ratioEst$Abundance, populationEstimates$Abundance, by=c("Stratum", "Domain"), suffix=c(".ratioEst", ".designEst"))
The function RstoxFDA::AnalyticalRatioEstimate also supports other kinds of ratio estimation based on only frequencies and mean weights, so that a total abundance estimate can be obtained even for sampling programs with less detailed information about the sampling parameters.
Finally we can report the domain estimate:
AnalyticalPopulationEstimateData = ratioEst PlusGroup = 9 IntervalWidth = .9 Decimals = 0 caaReport <- ReportAnalyticalCatchAtAge(AnalyticalPopulationEstimateData = AnalyticalPopulationEstimateData, PlusGroup = PlusGroup, IntervalWidth = IntervalWidth, Decimals = Decimals, "10^6 individuals") caaReport$NbyAge
Note that the domain of fish with missing age is preserved and assigned the minimum age 0, and we have added a plus group with minimum age 9. Downstream use may not support undefined age groups, and NAs may have to be dealt with. RstoxFDA have limited support for this. The easiest is to remove NA individuals, which corresponds to the somewhat unrealistic assumption that ages are missing at random. This table is compatible with those produced by other estimation procedures, such as Reca, so that we can use the same plotting functions:
ReportFdaCatchAtAgeData = caaReport PlotCatchAtAgeTotals(ReportFdaCatchAtAgeData = ReportFdaCatchAtAgeData)
For the sake of illustration we will redo the computation with domains that also include other variables beside age. We will first add another fish-observation to the domain defintion, namely sex:
# Redo psu estimates with new domain definition DomainVariables = c("IndividualAge", "IndividualSex") psuEstimates <- RstoxFDA::AnalyticalPSUEstimate(StoxBioticData=StoxBioticData, IndividualSamplingParametersData = IndividualSamplingParametersData, DomainVariables=DomainVariables, Variables=Variables) # Redo population estimate AnalyticalPSUEstimateData = psuEstimates populationEstimates = RstoxFDA::AnalyticalPopulationEstimate(PSUSamplingParametersData, AnalyticalPSUEstimateData = AnalyticalPSUEstimateData) # Redo ratio estimate AnalyticalPopulationEstimateData = populationEstimates ratioEst <- RstoxFDA::AnalyticalRatioEstimate(AnalyticalPopulationEstimateData=AnalyticalPopulationEstimateData, StoxLandingData = StoxLandingData, WeightVariable = WeightVariable, Method = Method) # Report AnalyticalPopulationEstimateData = ratioEst caaReport <- ReportAnalyticalCatchAtAge(AnalyticalPopulationEstimateData = AnalyticalPopulationEstimateData, PlusGroup = PlusGroup, IntervalWidth = IntervalWidth, Decimals = Decimals, "10^6 individuals") # Plot ReportFdaCatchAtAgeData = caaReport PlotCatchAtAgeTotals(ReportFdaCatchAtAgeData = ReportFdaCatchAtAgeData)
As before, note that the exact variables and values that define the domains are documented on the 'DomainVariabes'-table:
head(AnalyticalPopulationEstimateData$DomainVariables)
We can also add PSU-domains, such as quarter. Some of these may be put in correspondence with landing-statistics, and provide a higher-resolution allocation of landings in the ratio-estimation, although typically at the cost of somewhat lower precision in domain estimates.
We will first use standard StoX tools to add the quarter to both landings and samples:
quarter <- RstoxFDA::DefinePeriod(TemporalCategory = "Quarter") StoxBioticDataWquarter <- RstoxFDA::AddPeriodStoxBiotic(StoxBioticData, quarter, "Quarter") StoxLandingDataWquarter <- RstoxFDA::AddPeriodStoxLanding(StoxLandingData, quarter, "Quarter")
And then specify the PSU domains, when inferring sampling parameters for individuals:
DomainVariables = c("IndividualAge") PSUDomainVariables <- c("Quarter") StoxBioticData <- StoxBioticDataWquarter StoxLandingData <- StoxLandingDataWquarter psuEstimates <- RstoxFDA::AnalyticalPSUEstimate(StoxBioticData=StoxBioticData, IndividualSamplingParametersData = IndividualSamplingParametersData, DomainVariables=DomainVariables, Variables=Variables, PSUDomainVariables = PSUDomainVariables)
Otherwise the estimation is exactly as before:
# Redo psu estimates with new domain definition DomainVariables = c("IndividualAge") PSUDomainVariables <- c("Quarter") StoxBioticData <- StoxBioticDataWquarter StoxLandingData <- StoxLandingDataWquarter psuEstimates <- RstoxFDA::AnalyticalPSUEstimate(StoxBioticData=StoxBioticData, IndividualSamplingParametersData = IndividualSamplingParametersData, DomainVariables=DomainVariables, Variables=Variables, PSUDomainVariables = PSUDomainVariables) # Redo population estimate AnalyticalPSUEstimateData = psuEstimates populationEstimates = RstoxFDA::AnalyticalPopulationEstimate(PSUSamplingParametersData, AnalyticalPSUEstimateData = AnalyticalPSUEstimateData) # Redo ratio estimate AnalyticalPopulationEstimateData = populationEstimates ratioEst <- RstoxFDA::AnalyticalRatioEstimate(AnalyticalPopulationEstimateData=AnalyticalPopulationEstimateData, StoxLandingData = StoxLandingData, WeightVariable = WeightVariable, Method = Method) # Report AnalyticalPopulationEstimateData = ratioEst caaReport <- ReportAnalyticalCatchAtAge(AnalyticalPopulationEstimateData = AnalyticalPopulationEstimateData, PlusGroup = PlusGroup, IntervalWidth = IntervalWidth, Decimals = Decimals, "10^6 individuals") # Plot ReportFdaCatchAtAgeData = caaReport PlotCatchAtAgeTotals(ReportFdaCatchAtAgeData = ReportFdaCatchAtAgeData)
In summary a minimal StoX project implementing this kind of analysis, could look something like this:
| Process name | Function | Description | |------|---------|--------| | ReadSamples | ReadBiotic | Reads sample data | | StoxBiotic | StoxBiotic | Convert sample data to StoxBiotic | | AddSamplingId | AddToStoxBiotic | Add serialnumber to StoxBiotic | | ReadPSUsamplingParameters | DefinePSUSamplingParameters | Defines sampling parameters for each PSU | | DefineINDsamplingParameters | DefineIndividualSamplingParameters | Defines sampling parameters for each individual, within PSU | | AssignPSUSamplingParameters | AssignPSUSamplingParameters | Matches selections to data records, handles non-response corrections | | EstimatePSU | AnalyticalPSUEstimate | Estimates abundance, frequencies, totals and means for each PSU | | EstimatePopulation | AnalyticalPopulationEstimate | Estimates abundance, frequencies, totals and means for the population of interest | | ReadLandings | ReadLandings | Reads census information | | StoxLanding | StoxLanding | Convert census data to StoxLanding | | RatioEstimateTotal | AnalyticalRatioEstimate | Improves estimates with census information. |
The following functions provides reporting from analytical estimates:
|Report function | Description | |:------|:-----| | ReportAnalyticalCatchAtAge | Total catch of each age group | | PlotCatchAtAgeTotals | Plots catch of each age group |
In addition some functions are relevant in this context, that may also be used with other estimation approaches, such as Reca:
|Report function | Description | |:------|:-----| | ReportFdaSampling | Produces overview of sampling coverage. Useful for exploring sampling. | | ReportFdaLandings | Produces summaries of landings. | | ReportFdaSOP | Performs Sum-Of-Products tests, checking that the products of mean weight at age and estimated catch at age sum to total landed weight. |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.