# devtools::install_github("rstudio/rmarkdown@v1.10")
knitr::opts_chunk$set(echo = FALSE,
                      message = FALSE,
                      warning = FALSE,
                      dev = "png",
                      dpi = 300,
                      out.width = "100%")
knitr::opts_knit$set(root.dir = "../..")
options(scipen = 999) # turn off scientific notation to simplify printing
library(SimSurvey)
library(plotly)
library(raster)
library(lattice)
library(viridis)
library(data.table)
library(magrittr)
library(here)
Sys.setenv("MAPBOX_TOKEN" = "pk.eyJ1IjoicGF1bHJlZ3VsYXIiLCJhIjoiY2pueGIxejMxMGM0MjNqbjR1aXdwYm92bSJ9.R-t1SCa0VHKf4N-PQmYI5Q")
source("analysis/paper/helper_functions.R")

replot <- FALSE # plot the figures again? (acts as a manual cache)
is_html <- knitr::is_html_output()


Paul M. Regular^1^*, Gregory J. Robertson^1^, Keith P. Lewis^1^, Jonathan Babyn^1^, Brian Healey^1^, Fran Mowbray^1^

^1^ Fisheries and Oceans Canada, Northwest Atlantic Fisheries Center, 80 East White Hills, St. John's, Newfoundland and Labrador, A1C 5X1, Canada

*Corresponding author
E-mail: Paul.Regular@dfo-mpo.gc.ca (PMR)

cat("# Things to manually fix in word  \n")

cat("1. Adjust font size of code strings (title and headers, and convert links to sections with code to code font [Using SimSurvey, test_surveys])  \n")
cat("2. Decrease font size of tables  \n")
cat("3. Indent closures in Table \\@ref(tab:fun)  \n")
cat("4. Replace pluses (`+`) with spaces in Tables \\@ref(tab:abundance) to \\@ref(tab:survey)  \n")
cat("5. Adjust size of footnote below Table \\@ref(tab:abundance)  \n")
cat("6. Adjust size of Figure \\@ref(fig:real-sim-comp)  \n")
cat("7. Place supplements in separate documents  \n")
cat("8. Remove this section  \n")

Abstract

Populations often show complex spatial and temporal dynamics, creating challenges in designing and implementing effective surveys. Inappropriate sampling designs can potentially lead to both under-sampling (reducing precision) and over-sampling (through the extensive and potentially expensive sampling of correlated metrics). These issues can be difficult to identify and avoid in sample surveys of fish populations as they tend to be costly and comprised of multiple levels of sampling. Population estimates are therefore affected by each level of sampling as well as the pathway taken to analyze such data. Though simulations are a useful tool for exploring the efficacy of specific sampling strategies and statistical methods, there are a limited number of tools that facilitate the simulation testing of a range of sampling and analytical pathways for multi-stage survey data. Here we introduce the R package SimSurvey, which has been designed to simplify the process of simulating surveys of age-structured and spatially-distributed populations. The package allows the user to simulate age-structured populations that vary in space and time and explore the efficacy of a range of built-in or user-defined sampling protocols to reproduce the population parameters of the known population. SimSurvey also includes a function for estimating the stratified mean and variance of the population from the simulated survey data. We demonstrate the use of this package using a case study and show that it can reveal unexpected sources of bias and be used to explore design-based solutions to such problems. In summary, SimSurvey can serve as a convenient, accessible and flexible platform for simulating a wide range of sampling strategies for fish stocks and other populations that show complex structuring. Various statistical approaches can then be applied to the results to test the efficacy of different analytical approaches.

Introduction

Good survey design is a critical prerequisite for obtaining accurate and precise estimates of a population of interest. The need for sound sampling techniques is as relevant today as it was in the 1930's when much of the theory and principles behind the collection of scientific data were developed by statisticians such as R.A. Fisher and Jerzy Neyman [@fienberg1996]. Then and now, we are experiencing a period of increasing human activity and robust methods are required to detect the impacts of our actions on the natural world [@nichols2006]. Survey quality is especially important for conservation-oriented science as monitoring data are used to identify species at risk of extinction [@martin2007], to assess the efficacy of recovery plans [@campbell2002], and to determine precautionary levels of exploitation [@sutherland2001]. Surveys with a solid basis in sampling theory have therefore become a core component of many ecological monitoring programs throughout the world.

One of the biggest challenges associated with surveys is the cost of planning and implementing effective and efficient sampling designs [@reynolds2011]. Despite their costs, surveys have become a mainstay in the management of dynamic fish stocks as survey data are not easily replaced by other sources of information [@Pennington1998]. For instance, large population declines can be overlooked by relying solely on catch and effort data from commercial fisheries [@rose1999]. Such difficulties have contributed to the growing reliance on fisheries-independent surveys to monitor fish populations. Planning or improving such surveys is an important but non-trivial task because of the inherent complexity of both the survey and target population. Specifically, fisheries-independent surveys usually involve multiple levels of sampling and fish populations often show size-structured spatial and temporal dynamics [@Nelson2014]. These features are not easily captured by simple statistical models, which makes it difficult to determine optimal sampling and sub-sampling schemes [@Pennington1994; @Pennington2002; @Stewart2014]. In such situations it is common for practitioners to resort to simulations to test various sampling and analytical pathways (e.g. [@thomas2010]). However, such tests are rare in fisheries research (but see [@schnute2003; @puerta2019]), perhaps because of the complexities associated with fisheries-independent surveys.

Here we document SimSurvey, an R package designed to facilitate realistic simulations of fisheries-independent trawl surveys. In short, the package allows for the simulation of random or stratified-random surveys of an age-structured population that varies in space and time. The package has two main components: the first focuses on mimicking realistic fish stocks by simulating a spatially and age-correlated population distributed across a habitat gradient, and the second component focuses on simulating various surveys of these virtual fish stocks. Although constructed to assess the design of fisheries-independent trawl surveys, this package can be used to model any population that shows complex spatial and age structure.

This simulation framework has similarities to those presented by Schnute and Haigh [-@schnute2003] and Puerta et al. [-@puerta2019], however, we focused our efforts on developing a series of general and accessible functions to simplify the process of testing multiple sampling scenarios and analytical pathways. The steps taken to simulate surveys of spatial, age-structure populations are outlined below. We first outline the equations underlying the package ([Model structure] section) and then we demonstrate how to use its core functions ([Core functions] section). The core functions of the package are largely demonstrated using default settings, and these settings are based on a case study (see for [S1 Appendix][S1 Appendix: Case study] details on the case study, and see [S2 Appendix][S2 Appendix: Parameterisation] for guidance on how to modify default settings to suit specific needs). Several of the results from the case study are described and discussed in the [Interpretation] section as they highlight one use case of the package. Finally, we discuss the broader research opportunities and future directions of the package ([Discussion] section).

Model structure

In this section, we describe the framework currently implemented in the SimSurvey package. With this framework, we tried to strike a balance between realism, simplicity, generality and computational feasibility. The framework follows four general steps: 1) simulate a spatially-aggregated age-structured population; 2) distribute the population throughout a spatial grid, imposing correlation across space, time and age; 3) sample the population using random sampling; and, 4) obtain population estimates using a design-based analysis. Though there is a degree of flexibility in each of these steps, users can circumvent specific components by applying user defined equations, inputs and/or analyses. Details on how to use the package and, if desired, circumvent some aspects of its structure are outlined in the [Core functions] section.

Simulate abundance

The simulation starts with an exponential decay cohort model where the abundance at age $a$ in year $y$ ($N_{a,y}$) equals the abundance of that cohort in the previous year multiplied by the associated survival rate, which is expressed in terms of total mortality ($Z_{a,y}$):

$$N_{a,y} = N_{a-1,y-1} e^{-Z_{a-1,y-1}}$$

Here, numbers at age in the first year are filled via exponential decay, $N_{a,1} = N_{a-1,1} e^{-Z_{a-1,1}}$, numbers at age 1 (i.e. recruits) vary around a baseline value, $log(N_{1,y}) = log(\mu_r) + \epsilon_{y}$, and total mortality is set to a baseline level plus process error, $log(Z_{a,y}) = log(\mu_Z) + \delta_{a,y}$. The error around the recruitment process was set to follow a random walk, $\epsilon_{y} \sim N(\epsilon_{y - 1}, \sigma_{r}^{2})$, and the process error was simulated using the covariance structure described in Cadigan [-@Cadigan2016], $\delta_{a,y} \sim N(0, \Sigma_{a,y})$. The covariance across ages and years is controlled by a process error variance parameter ($\sigma_{\delta}^2$) along with age and year correlation parameters ($\varphi_{\delta, \mathrm{age}}$ and $\varphi_{\delta, \mathrm{year}}$, respectively). This structure allows for autocorrelation in process errors across ages and years (i.e. total mortality can be made to be more similar for fish that are closer together in age and/or time). Note that a plus group is not modeled as the number of ages can easily be extended to include groups with zero fish.

In practice, abundance at age is often inferred from length data as it is easier to collect. Abundance at length is therefore simulated from abundance at age using the original von Bertalanffy growth curve [@von1938]:

$$log(L) = log(L_{\infty} - (L_{\infty} - L_0) e^{-Ka}) + \varepsilon$$

Where $L_{\infty}$ is the mean asymptotic length, $L_0$ is length at birth, $K$ is the growth rate parameter and the error is assumed to follow the normal distribution, $\varepsilon \sim N(0, \sigma_{L}^2)$. Numbers at age are distributed across discrete length groups following a lognormal distribution by calculating the probability of being in a specific length group given age, $\phi_{a,l}$. These probabilities are calculated using the standard normal cumulative density function $\Phi$ for a sequence of length groups $l$ from length 0 to 10 times the maximum predicted length $L$ at an interval of $l_{group}^{N}$:

$$ \phi_{a,l} = \Phi \left ( \frac{log(L_l)}{\sigma_{L}} \right ) - \Phi \left ( \frac{log(L_{l-1})}{\sigma_{L}} \right )$$

Overall, this formulation facilitates the simulation of a dynamic length and age structured population. Though some typical relationships have yet to be implemented (e.g. stock-recruitment), sufficient information can be simulated to assess survey performance across a range of abundance levels across years, lengths and ages.

Simulate spatial distribution

Rather than developing a full spatially-explicit model, population and spatial dynamics are modeled as independent processes for simplicity. The complexities of spatial population dynamics - such as larval dispersal, spatial differences in growth and population connectivity - are not explicitly accounted for and, as such, the model is a necessary simplification of reality. Despite this limitation, the approach taken facilitates the simulation of spatial, age-structured populations with sufficient complexity for testing the efficacy of various survey designs. The simplicity also limits the number of unknown parameters that need to be specified to simulate a population. Parameter estimates from spatially-aggregated age-structured models, which are commonly used in stock assessments, can therefore be used to simulate a population using the cohort model and the resultant abundance at age values can be distributed across a spatial grid. Here, a grid of $s$ cells is generated where each cell has an area of $S$ and depth $d$; depth is defined using a sigmoid curve, applied across one spatial axis, with a depth range of $[d_{\mathrm{min}}, d_{\mathrm{max}}]$, shelf depth of $d_{\mathrm{shelf}}$ and a shelf width of $w_{\mathrm{shelf}}$. We use depth as our main stratification variable, but note that any other appropriate stratification variables could be used. The grid can be divided into two hierarchical levels, management divisions and habitat-based survey strata. For demonstration purposes, we envision these levels as part of a stratified-random survey within international fishery divisions, i.e., $H_{\mathrm{strat}}$ depth-based strata within $H_{\mathrm{div}}$ divisions (e.g. NAFO or ICES divisions, or any other geographically bounded area). The simulated population is distributed through the grid by simulating spatial-temporal noise controlled by a parabolic relationship with depth and covariance between ages, years and space. This noise term, $\eta_{a,y,s}$, is scaled to sum to 1 to ensure that the total population of each age for each year through the grid equals the number simulated by the cohort model:

$$\begin{aligned}N_{a,y,s} &= N_{a,y} \frac{\eta_{a,y,s}}{\sum_{s = 1}^{N_s} \eta_{a,y,s}} \ \eta_{a,y,s} &= \frac{(d_s - \mu_{d}) ^ 2}{2 \sigma_d^2} + \xi_{a,y,s} \end{aligned}$$

Where $d_s$ is the depth in a specific cell of the grid, $\mu_d$ is the mean depth where abundance is typically highest and $\sigma_d$ controls the width or dispersion of abundance around the mean depth. Residual noise $\xi_{a,y,s}$ is added to this depth relationship using a combination of Matérn covariance, to control the level of spatial aggregation within ages and years, and a two dimension AR1 age-year covariance described in Cadigan [-@Cadigan2016], to control the level of similarity in distributions across ages and years. The rate at which point-to-point spatial correlation decays with distance is controlled by a smoothing ($\lambda$) and a scaling parameter ($\kappa$) (here $\kappa$ is approximated from the range parameter ($r$), $\kappa = \sqrt{8 \lambda} / r$; @Blangiardo2015]) and correlation across ages and years is controlled by $\varphi_{\xi, \mathrm{age}}$ and $\varphi_{\xi, \mathrm{year}}$, respectively. The overall variance of the spatial process is controlled by $\sigma_{\xi}^2$ (see [S3 Appendix][S3 Appendix: Age-year-space covariance] for a more detailed description of the space-age-year covariance structure). In short, this formulation allows control of depth preferences, the level of spatial aggregation and the degree of age and year specific clustering.

Simulate survey

The final step in the simulation is to sample the simulated population over the age-year-space array generated. The sampling is random or stratified random, emulating surveys conducted by many research institutions around the world. The area of each strata $A_{\mathrm{strat}}$ is calculated and this is used to define the number of sampling stations $H_{\mathrm{sets}}$, hereafter referred to as sets, allocated to one or more strata under a particular set density, $D_{\mathrm{sets}}$ (i.e. $H_{\mathrm{sets}} = A_{\mathrm{strat}}~D_{\mathrm{sets}}$). The allocated number of cells are randomly selected in each strata and the number of fish caught in each set is calculated by applying binomial sampling of the fish in each sampled cell by the proportion of the area covered by the trawl and the catchability of each age:

$$n_{a,y,s} \sim Bin(N_{a,y,s}, \frac{A_{\mathrm{trawl}}}{A_{\mathrm{cell}}} q_a )$$

Where $n_{a,y,s}$ is the number of fish of age $a$ in year $y$ sampled by a set at location $s$, $A_{\mathrm{trawl}}$ indicates the area covered by the trawl, $A_{\mathrm{cell}}$ is the area of a grid cell, and $q_a$ is the catchability coefficient of each age (i.e. the ability of the trawling gear to catch specific age groups). Here, catchabilities were defined using a logistic curve controlled by a steepness, $k$, and midpoint parameter, $x_0$. In cases where there are multiple sets in one cell, the population in that cell is divided across the sets. While this means that numbers caught in an isolated simulation cannot exceed the numbers in the population, keep in mind that the survey, no matter how intense, is assumed to have no impact on the population from one year to the next.

Once catches are simulated, lengths of the fish sampled by each set are simulated using the von Bertalanffy growth equation found above in the [Simulate abundance] section. Sub-sampling is then conducted whereby a subset of fish are sampled for length measurements and a subset of this subset are sampled for age determination. Specifically, a maximum number of lengths are measured per set, $M_{\mathrm{lengths}}$, and a maximum number of ages, $M_{\mathrm{ages}}$, are sampled per length group, $l_{\mathrm{group}}$, per division, strata or set $s_{\mathrm{group}}$. Such sub-sampling is common in fisheries-independent surveys as it is costly, impractical and unnecessary to sample every fish captured. Age determination is especially time-consuming, which is why otoliths for age-determination tend to be sub-sampled by length-bin to obtain a representative age sample across a wider range of lengths than would be obtained via random sampling.

Stratified analysis

While there are many model-based options for obtaining an abundance index from survey data (e.g. [@Thorson2015]), design-based approaches, such as stratified analyses, are often used. Here we apply formulae presented in Smith and Somerton [-@Smith1981] (equations are replicated in [S4 Appendix][S4 Appendix: Stratified analysis equations]) to calculate year $y$ and simulation $j$ specific stratified estimates of total abundance ($\hat{I}{y,j}$), abundance at length ($\hat{I}{l,y,j}$) and abundance at age ($\hat{I}{a,y,j}$). Note that estimates of total abundance are based on total numbers caught at each set, $n_i$, while abundance at length requires the sub-sampled length frequencies at each set, $m{l, i}$, to be scaled up using set-specific ratios of the number of fish measured, $m_{i}$, to numbers caught, $n_i$:

$$n_{l, i} = m_{l, i} \times \frac{n_i}{m_i}$$

Likewise, age frequencies need to be calculated to obtain stratified estimates of abundance at age. This is done by constructing an age-length key, which is the proportion of fish in each length bin that fall into specific age classes. Once these proportions are calculated, they are applied to the bumped up length frequencies, $n_{l,i}$, to approximate age frequencies, $n_{a,i}$, that is:

$$n_{a,i} = \sum_{l = 1}^{L} n_{l,i} \times \frac{m^{\prime}{a,l,g}}{m^{\prime}{l,g}}$$

Here, the prime symbol $^{\prime}$ indicates that these values are tertiary sampling units as they represent fish sub-sampled for age determination from those sub-sampled for length measurements. Within this sub-sample, $m^{\prime}{l,g}$ represents the number of fish within length bin $l$ and from area $g$, and $m^{\prime}{a,l,g}$ represents the number of fish within age-class $a$, length bin $l$ and area $g$. The area $g$ can be division, strata or set specific. Though age-length keys are typically constructed and applied over large spatial scales (e.g. division), Aanes and Vølstad [@Aanes2015] recommends calculating proportions at finer scales to better account for hierarchical sample designs.

We used root-mean-squared error (RMSE) as a measure of the precision and bias of the abundance at age estimates from each survey:

$$RMSE = \sqrt{\frac{ \sum_{a = 1}^{A} \sum_{y = 1}^{Y} \sum_{j = 1}^{J} (\hat{I}{a,y,j} - I{a,y})^2}{A \times Y \times J}}$$

Where $A$, $Y$, and $J$ are the number of ages, years and simulations, respectively, and $I_{a,y}$ is the true abundance available to the survey (i.e. catchability corrected abundance; $I_{a,y} = q_a N_{a,y}$). RMSE was also calculated for abundance at length estimates, where the above formula is indexed by length groups $l$, and total abundance, which lacks a group index of $a$ or $l$.

Core functions

The SimSurvey package was written in the programming language R [@R] and it holds a series of functions for 1) simulating the abundance and distribution of virtual fish populations with correlation across space, time and age (sim_abundance, sim_distribution), 2) simulating surveys with a range of sampling strategies and intensities (sim_survey), and 3) estimating the stratified mean and variance of simulated survey data (run_strat; Table \@ref(tab:fun)). SimSurvey relies heavily on functions from the data.table [@dowle2017], raster [@hijmans2016] and plotly [@sievert2018] packages for their efficient data processing, geographic and plotting facilities, respectively. Package documentation has been published online using pkgdown (https://paulregular.github.io/SimSurvey/) and all the source R code behind SimSurvey is available on GitHub (https://github.com/PaulRegular/SimSurvey). SimSurvey can be installed via GitHub using the remotes package:

install.packages("remotes")
remotes::install_github("PaulRegular/SimSurvey")

Table: (#tab:fun) Names and descriptions of the key functions of SimSurvey. Functions in bold font are core functions and those in medium font are designed for use inside the core functions. The latter are typically closures, which are functions that contain data and return functions [@wickham2014]; here they are used to store parameter values and return functions that require dimensions, such as ages or years, to be supplied.

| Function | Description | |:---------|:------------------------------| | sim_abundance | Simulate a basic age-structured population dynamics model | | sim_R, sim_Z, sim_N0, sim_vonB | Closures, to use inside sim_abundance, for simulating recruitment, total mortality, initial abundance, and growth, respectively | | sim_distribution | Simulate spatial and temporal distribution of an age-structured population | | sim_ays_covar, sim_parabola | Closures, to use inside sim_distribution, for simulating age-year-space covariance and parabolic relationships with covariates (e.g. depth), respectively | | make_grid | Make a basic depth stratified square grid to use inside sim_distribution | | sim_survey | Simulate a survey of a spatial, age-structured population | | sim_logistic | Closure, to use inside sim_survey, for simulating age-specific catchability as a logistic curve | | run_strat | Run a stratified analysis on simulated survey data | | strat_error | Calculate the error of stratified estimates (e.g. root mean squared error of stratified estimates from true values) | | test_surveys | Test the sampling design of multiple surveys using a stratified analysis (internally loops over sim_survey, run_strat and strat_error) | | expand_surveys | Create a data frame, for use in test_surveys, with all combinations of supplied survey settings |


The equations behind the functions listed in Table \@ref(tab:fun) are detailed in the [Model structure] section. Note that several of the core equations are implemented using "closures", which are functions that contain data and return functions [@wickham2014]. For example, sim_R returns a function that holds the supplied parameter values and requires a sequence of years to be supplied.

R_fun <- sim_R(log_mean = log(500), log_sd = 0.5)
R_vec <- R_fun(years = 1:100)

Here, the R_vec object holds 100 years of simulated recruitment values and each run of the R_fun function will result in different simulated values using the internal formulation and the parameters supplied to sim_R. The other closures included in the package operate in a similar way in that parameter inputs are supplied to the closure and the functions returned by the closure requires inputs such as ages and/or years. This was done to avoid the repeated specifications of key arguments, such as ages and years. Moreover, this approach provides an option for advanced R users to inspect and modify the closures implemented in the package to supply custom closures with alternate equations (see https://paulregular.github.io/SimSurvey/articles/custom_closures.html for a short vignette on creating custom closures). Also note that each of the closures implemented in the package includes a plot argument such that quick visuals can be obtained using a line of code like this: sim_R(log_mean = log(500), log_sd = 0.5, plot = TRUE)(years = 1:100)).

sim_abundance

Table: (#tab:abundance) Default sim_abundance function call, with descriptions, default values and associated parameter symbols of key arguments.

| Function call | Description | Symbol | |:------------------------------------------- |:---------------------------------------------------------- | :---------------------------------- | | sim_abundance( |   |   | | + ages = 1:20, | Ages | $a$ | | + years = 1:20, | Years | $y$ | | + R = sim_R(log_mean = log(30000000), | Mean log recruitment^1^ | $\mu_r$ | | + log_sd = 0.5), | Standard deviation of log recruitment | $\sigma_r$ | | + Z = sim_Z(log_mean = log(0.5), | Mean log total mortality ^2^ | $\mu_Z$ | | + log_sd = 0.2, | Standard deviation of log total mortality | $\sigma_Z$ | | + phi_age = 0.9, | Correlation across ages in error around total mortality | $\varphi_{\delta, \mathrm{age}}$ | | + phi_year = 0.5), | Correlation across years in error around total mortality | $\varphi_{\delta, \mathrm{year}}$ | | + growth = sim_vonB(Linf = 120, | Mean asymptotic length (cm) | $L_{\infty}$ | | + L0 = 5, | Length in birth year (cm) | $L_0$ | | + K = 0.1, | Growth rate | $K$ | | + log_sd = 0.1, | Standard deviation of the von Bertalanffy growth curve | $\sigma_L$ | | + length_group = 3)) | Length group bin size for abundance at length (cm) | $l_{\mathrm{group}}^{N}$ |

^1^ Can be a vector of means with a length equal to the number of years in the simulation.
^2^ Can be a matrix of means with number of rows and columns equaling the number of ages and years in the simulation, respectively.


Abundance at age and length is simulated using the sim_abundance function and a default function call is described in Table \@ref(tab:abundance) along with associated symbols from the equations outlined in the [Simulate abundance] section. This function has a simple structure and requires the specification of a series of ages and years along with a series of closures, such as sim_R, sim_Z and sim_vonB, for simulating recruitment (R), total mortality (Z) and growth (growth), respectively. Overall, the function provides a simple tool for simulating a range of dynamic age-structured populations. For instance, below we provide examples where we simulate a relatively long and short lived species (note that default variance, starting abundance and growth settings were used in both simulations).

set.seed(438)
long <- sim_abundance(ages = 1:20,
                      R = sim_R(log_mean = log(3e+07)),
                      Z = sim_Z(log_mean = log(0.2)))
short <- sim_abundance(ages = 1:6,
                       R = sim_R(log_mean = log(1e+10)),
                       Z = sim_Z(log_mean = log(0.8)))

The sim_abundance function returns a list with the sequence of ages (ages), sequence of years (years), sequence of lengths (lengths), numbers of recruits across all years (R), numbers at age in the first year (N0), total mortality matrix (Z), abundance at age matrix (N), abundance at length matrix (N_at_length) and the function supplied to the growth argument (sim_length). The growth function is retained for later use in sim_survey to simulate lengths given simulated catch at age in a simulated survey.

The package also includes several plotting functions for making quick plotly-based [@sievert2018] interactive visuals of the simulated population. For instance, the plot_surface function can be used to make quick visuals of matrices contained within the list returned by sim_abundance. As an example, we display the abundance at age matrix (object named N in the list produced by sim_abundance; Figure \@ref(fig:surface)); other names can be supplied to the mat argument to visualize a different matrix from the sim_abundance list, such as Z.

plot_surface(long, mat = "N")
plot_surface(short, mat = "N")
pa <- plot_surface(long, mat = "N", scene = "scene1", showscale = FALSE) %>% 
  add_labels(x = 0.05, y = 0.9, text = "a)")
pb <- plot_surface(short, mat = "N", scene = "scene2", showscale = FALSE) %>% 
  add_labels(x = 0.05, y = 0.9, text = "b)")

p <- subplot(pa, pb) %>%
  layout(scene = list(domain = list(x = c(0, 0.5), y = c(0, 1))),
         scene2 = list(domain = list(x = c(0.5, 1), y = c(0, 1)),
                       xaxis = list(title = "Age"),
                       yaxis = list(title = "Year"),
                       zaxis = list(title = "N")))

if (replot) export_plot(p, file = "analysis/paper/figures/plot_surface.png", width = 800, height = 370)
if (is_html) {
  p
} else {
  knitr::include_graphics(here("analysis", "paper", "figures", "plot_surface.png"))
}

sim_distribution

Table: (#tab:distribution) Default sim_distribution function call, with descriptions and associated parameter symbols of key arguments.

| Function call | Description | Symbol | |:------------------------------------------------------- |:------------------------------------------------------------ | :-------------------------------------- | | sim_distribution( |   |   | | + sim, | Simulated population from sim_abundance |   | | + grid = make_grid(x_range = c(-140, 140), | Range of grid in the x dimension |   | | + y_range = c(-140, 140), | Range of grid in the y dimension |   | | + res = c(3.5, 3.5), | Grid resolution in x and y dimensions (km) - i.e. cell area | $A_{\mathrm{cell}}$ | | + shelf_depth = 200, | Shelf depth (m) | $d_{\mathrm{shelf}}$ | | + shelf_width = 100, | Shelf width (km) | $w_{\mathrm{shelf}}$ | | + depth_range = c(0, 1000), | Depth range from coast to slope (m) | $[d_{\mathrm{min}}, d_{\mathrm{max}}]$ | | + n_div = 1, | Number of divisions | $H_{\mathrm{div}}$ | | + strat_splits = 2, | Number of strata within each depth class |   | | + strat_breaks = seq(0, 1000, 40)), | Series of depth breaks for defining strata |   | | + ays_covar = sim_ays_covar(sd = 2.8, | Standard deviation of age-year-space distribution | $\sigma_{\xi}$ | | + range = 300, | Range of spatial correlation (km) | $r$ | | + lambda = 1, | Smoothness of spatial correlation | $\lambda$ | | + phi_age = 0.5, | Correlation across ages in spatial distribution | $\varphi_{\xi, \mathrm{age}}$ | | + phi_year = 0.9, | Correlation across years in spatial distribution | $\varphi_{\xi, \mathrm{year}}$ | | + group_ages = 5:20, | Make space-age-year noise equal across these ages^1^ |   | | + group_years = NULL), | Make space-age-year noise equal across these years^1^ |   | | + depth_par = sim_parabola(mu = 200, | Depth at which abundance is typically highest (m) | $\mu_d$ | | + sigma = 70)) | Dispersion around depth of peak abundance (m) | $\sigma_d$ |

^1^ All ages or years are independent if argument values is NULL.


The equations outlined in the [Simulate spatial distribution] section are used in the make_grid, sim_ays_covar and sim_parabola functions, and these functions are used within sim_distribution to distribute a population simulated using sim_abundance throughout a grid (Table \@ref(tab:distribution)). The output from make_grid is a raster object [@hijmans2016] with four layers: depth, cell, division and strat. If a more detailed and realistic grid is required, users can manually generate their own survey grid using real data and this grid can be supplied as a raster to sim_distribution if the same structure and correct projection is used. The package includes a manually constructed survey grid of NAFO Subdivision 3Ps off the southern coast of Newfoundland (named survey_grid) and the data-raw folder in the GitHub directory includes the data and code used to construct this grid. However, for simplicity, we use make_grid to construct a square grid for a default run of sim_distribution. Below we generate and plot (Figure \@ref(fig:grid)) a default grid, another grid with the number of strata increased by increasing the number of strat_splits, and another with the number of divisions increased using n_div and a linear depth gradient (the sigmoid curve is forced to be linear when shelf_width is set to zero).

a <- make_grid(n_div = 1, strat_splits = 2, shelf_depth = 200,
               shelf_width = 100, depth_range = c(0, 1000))
b <- make_grid(n_div = 1, strat_splits = 3, shelf_depth = 200, 
               shelf_width = 100, depth_range = c(0, 1000))
c <- make_grid(n_div = 4, strat_splits = 1, shelf_depth = 500, 
               shelf_width = 0, depth_range = c(0, 1000))
plot_grid(a)
plot_grid(b)
plot_grid(c)
a <- make_grid() %>% 
  plot_grid(showscale = TRUE) %>% 
  add_labels(x = 0, y = 1, text = "a)")
b <- make_grid(strat_splits = 3) %>% 
  plot_grid(showscale = FALSE) %>% 
  add_labels(x = 0.04, y = 1, text = "b)")
c <- make_grid(n_div = 4, shelf_depth = 500, shelf_width = 0, depth_range = c(0, 1000)) %>% 
  plot_grid(showscale = FALSE) %>% 
  add_labels(x = 0.08, y = 1, text = "c)")
p <- subplot(a, b, c)

if (replot) export_plot(p, file = "analysis/paper/figures/plot_grid.png", width = 800, height = 320)
if (is_html) {
  p
} else {
  knitr::include_graphics(here("analysis", "paper", "figures", "plot_grid.png"))
}

In addition to supplying objects produced by sim_abundance and make_grid, the sim_distribution function requires two closures that describe the age-year-space covariance and the relationship with depth. Here we use sim_ays_covar and sim_parabola to control these relationships and a wide range of age and year specific distributions can be obtained by adjusting a few parameters in these closures (Figure \@ref(fig:distribution)). While here we only describe one option for simulating a spatial noise (see [S3 Appendix][S3 Appendix: Age-year-space covariance] for details), custom closures can be used that leverage simulation models provided by packages such as RandomFields [@schlather2020] or INLA [@rue2009] (see the code behind sim_ays_covar_sped [https://github.com/PaulRegular/SimSurvey/blob/master/R/sim_dist_spde.R] for an example of how the sim_ays_covar closure was modified to apply a Stochastic Partial Differential Equation approach using the INLA package). Below we run a default sim_distribution call, which generates a population that forms tight clusters that are more strongly correlated across years than ages, and another call that generates a population that is more diffuse (i.e. wider range) and exhibits stronger correlation across ages than years (i.e. lower phi_year and higher phi_age). Distributions can also be forced to be the same across ages and years by using the group_ages and group_years arguments, respectively, in the sim_ays_covar closure. Variance in the size of the clusters can also be modified by changing the sd argument in the sim_ays_covar function. In other words, these parameters can be modified to control the degree of age-specific clustering and inter-annual site-fidelity exhibited by the simulated population. Note that the resolution of the default grid is high and, as such, the simulations below may take minutes to complete. Also note that the key functions in the SimSurvey package have been set-up to be pipe (%>%; [@bache2014]) friendly such that values from one function call are forwarded to the next function (i.e. output from the two calls below are functionally the same though the approach is slightly different).

set.seed(438)
a <- sim_distribution(sim = sim_abundance(), # nested approach
                      ays_covar = sim_ays_covar(range = 300, # clustered
                                                phi_year = 0.9,
                                                phi_age = 0.5))
b <- sim_abundance() %>%  # pipe approach
  sim_distribution(ays_covar = sim_ays_covar(range = 2000, # diffuse
                                             phi_year = 0.2,
                                             phi_age = 0.9))

The sim_distribution function retains all the data simulated by sim_abundance and adds a data.table [@dowle2017], named sp_N, with abundance (N) split by age, year and cell. The function also retains the grid object and converts these data into a data.table, named grid_xy, with headers x, y, depth, cell, division and strat. The sp_N object can be merged with the grid_xy data by cell to associate abundance with specific locations, depth, divisions or strata. The plot_distribution function can be used to provide a quick visual of the distribution across ages and years. The code below will generate interactive plots with an Age-Year slider, however, for this paper we present a facet plot of the simulated data (Figure \@ref(fig:distribution)).

plot_distribution(a, ages = 1:3, years = 1:3, type = "heatmap")
plot_distribution(b, ages = 1:3, years = 1:3, type = "heatmap")
facet_distribution <- function(data) {
  ay <- expand.grid(age = 1:3, year = 1:3)
  plot_list <- vector("list", nrow(ay))
  for (i in 1:3) {
    for (j in 1:3) {
      p <- plot_distribution(data, ages = i, years = j, type = "heatmap") %>% 
        hide_colorbar() %>% 
        layout(autosize = FALSE, width = 100, height = 100,
               margin = list(l = 0, r = 0, b = 0, t = 0, pad = 0))
      plot_list[[which(ay$age == i & ay$year == j)]] <- p
    }
  }
  subplot(plot_list, nrows = 3, shareX = TRUE, shareY = TRUE, margin = 0.01) %>% 
    layout(autosize = FALSE, width = 300, height = 300,
           margin = list(t = 20, l = 20, b = 10, r = 10))
} 

a_plots <- facet_distribution(a)
b_plots <- facet_distribution(b)

p <- subplot(a_plots, b_plots) %>% 
  add_labels(text = rep(c("Age 1", "Age 2", "Age 3"), 2), x = c(0.04, 0.20, 0.4, 0.6, 0.8, 0.95), y = rep(0.995, 6), yshift = 20) %>% 
  add_labels(text = rep(c("Year 3", "Year 2", "Year 1"), 2), x = c(rep(0.005, 3), rep(0.54, 3)), y = c(0.08, 0.5, 0.92), xshift = -20, textangle = -90) %>% 
  add_labels(text = "a)", x = 0, y = 1, yshift = 20, xshift = -20) %>% 
  add_labels(text = "b)", x = 0.54, y = 1, yshift = 20, xshift = -20)

if (replot) export_plot(p, file = "analysis/paper/figures/plot_distribution.png", width = 600, height = 300)

a_plots <- plot_distribution(a, ages = 1:3, years = 1:3, type = "heatmap") %>% 
  add_labels(x = 0.08, y = 1, text = "a)")
b_plots <- plot_distribution(b, ages = 1:3, years = 1:3, type = "heatmap") %>% 
  add_labels(x = 0.08, y = 1, text = "b)")
cap <- "Distribution plots of simulated populations that form a) tight clusters with stronger correlation through years than ages (default settings), and b) relatively diffuse clusters with stronger correlation through ages than years."
if (is_html) {
  cap <- paste(cap, "These plots were produced by `plot_distribution` when supplied simulations from `sim_distribution`.")
} else {
  cap <- paste(cap, "This plot is a facet of plots produced by `plot_distribution` when supplied simulations from `sim_distribution`.")
}
a_plots
if (is_html) {
  b_plots
} else {
  knitr::include_graphics(here("analysis", "paper", "figures", "plot_distribution.png"))
}

sim_survey

Table: (#tab:survey) Default sim_survey function call, with descriptions and associated parameter symbols of key arguments.

| Function call | Description | Symbol | |:----------------------------------- |:--------------------------------------------------------------------------- | :---------------------- | | sim_survey( |   |   | | + sim, | Simulated spatial population from sim_distribution |   | | + n_sims = 1 | Number of times to repeat the survey |   | | + q = sim_logistic(k = 2, | Steepness of logistic curve of catchability | $k$ | | + x0 = 3), | Midpoint of logistic curve of catchability (age) | $x_0$ | | + trawl_dim = c(1.5, 0.02), | Trawl dimensions (distance towed, trawl width; km) - i.e. area trawled | $A_{\mathrm{trawl}}$ | | + min_sets = 2 | Minimum number of sets to conduct per strata |   | | + set_den = 2/1000, | Set density (km^-2^) | $D_{\mathrm{sets}}$ | | + lengths_cap = 500, | Maximum number of lengths to collect / set | $M_{\mathrm{lengths}}$ | | + ages_cap = 10, | Maximum number of ages to sample / length group / spatial group ^1^ | $M_{\mathrm{ages}}$ | | + age_sampling = "stratified", | Controls whether age sampling is length "stratified" or "random" |   | | + age_length_group = 1, | Length group bin size for stratified age sampling (cm) | $l_{\mathrm{group}}$ | | + age_space_group = "division") | Spatial scale of stratified age sampling ("division", "strat", "set") | $s_{\mathrm{group}}$ |

^1^ Length group and spatial group are defined using the age_length_group and age_space_group arguments, respectively. These arguments are ignored if age_sampling is set to "random" and the value supplied to ages_cap represents the maximum number of ages to sample per set.


The function sim_survey can be used to simulate data from one survey over a population created using sim_distribution. A default function call is described in Table \@ref(tab:survey). The sim_survey function simulates the sampling process of the survey and, as such, requires a closure for defining catchability as a function of age and survey protocol settings. Specifically, the q argument requires a closure, such as sim_logistic, for defining the probability of catching specific age groups, trawl dimensions are defined in the trawl_dim argument, and set, length and age sampling effort are defined using the set_den, lengths_cap and ages_cap arguments, respectively. Also note that the min_sets argument imposes a minimum of number of sets to conduct per strata, regardless of its allocation given strata area and set density. This argument imposes a useful constraint for generating data to be analyzed using design-based approaches that require more than one value for the calculation of a mean.

Like sim_abundance and sim_distribution, custom closures can be supplied to sim_survey to impose alternate parametric curves for catchability at age (i.e. a closure including an equation for a dome can be constructed and used in lieu of sim_logistic to impose a dome-shaped catchability). Multiple simulations of the same survey can be run using the n_sims argument, however, requesting large numbers of simulations can be computationally demanding depending on the processing capacity available. Below we use sim_survey to simulate two surveys over a default population, of which one is set-up to have higher set density (set_den) than the other.

set.seed(438)
pop <- sim_abundance() %>% 
  sim_distribution()
a <- pop %>% 
  sim_survey(n_sims = 5,
             set_den = 1 / 1000,
             lengths_cap = 100,
             ages_cap = 5)
b <- pop %>% 
  sim_survey(n_sims = 5,
             set_den = 5 / 1000,
             lengths_cap = 500,
             ages_cap = 25)

Again, this function retains all the objects listed in the output of sim_distribution and adds data.tables that detail the set locations (setdet) and sampling details (samp). Catchability corrected abundance matrices (abundance at age matrix multiplied by survey catchability), named I and I_at_length, are also produced and added to the output; these matrices are useful for comparing the true abundance available to the survey to abundance estimates obtained using design-based or model-based analyses of the simulated survey data. Specific surveys can be explored using the plot_survey function, which uses plotly [@sievert2018] and crosstalk [@cheng2016] in the background to link the bubble plot of aggregate set catch to the histogram of lengths and ages sampled to facilitate explorations of set-specific catches (Figure \@ref(fig:survey)).

plot_survey(a, which_sim = 1, which_year = 20)
plot_survey(b, which_sim = 1, which_year = 20)
plot_a <- plot_survey(a, which_sim = 1, which_year = 20) %>% 
  add_labels(text = "a)", x = 0, y = 1, yshift = 5, xshift = -5) %>% 
  layout(margin = list(l = 10, r = 10, t = 10, b = 10))
plot_b <- plot_survey(b, which_sim = 1, which_year = 20) %>% 
  add_labels(text = "b)", x = 0, y = 1, yshift = 5, xshift = -5) %>% 
  layout(margin = list(l = 10, r = 10, t = 10, b = 10))

## plotly::subplot was not amenable to combining these two plots, I therefore used ImageMagick to combine a and b
if (replot) {
  paths <- c(a = "analysis/paper/figures/plot_survey_a.png", b = "analysis/paper/figures/plot_survey_b.png")
  export_plot(plot_a, file = paths["a"], width = 800, height = 400)
  export_plot(plot_b, file = paths["b"], width = 800, height = 400)

  image_a <- magick::image_read(paths["a"])
  image_b <- magick::image_read(paths["b"])
  p <- magick::image_append(c(image_a, image_b), stack = TRUE)
  magick::image_write(p, path = "analysis/paper/figures/plot_survey.png")
}
plot_a
if (is_html) {
  plot_b
} else {
  knitr::include_graphics(here("analysis", "paper", "figures", "plot_survey.png"))
}

As noted above, available RAM may limit the utility of the sim_survey function for running thousands of simulations of the same survey. The sim_survey_parallel function was therefore constructed to facilitate this process. This function is set-up to run multiple sim_survey calls in parallel using the doParallel package [@weston2015] and, as such, multiple loops can be run using the n_loops argument and, within each loop, multiple simulations can be run (controlled using the n_sims argument). Total simulations will be the product of n_loops and n_sims arguments. If more than one core (cores) is specified, then the simulations will be run in parallel to speed up the process. Low numbers of n_sims and high numbers of n_loops will be easier on RAM, but may be slower. The optimum ratio of n_sims to n_loops will depend on the amount of RAM and number of cores in a given computer. In any case, this function simplifies the process of running thousands of simulations of the same survey and the simulated data can then be supplied to survey-based or model-based analyses that require simulation testing.

run_strat

Stratified estimates of abundance are obtained by supplying the output from sim_survey to the run_strat function. There are only four arguments in the run_strat function: length_group, alk_scale, strat_data_fun and strat_means_fun. The length_group argument defines the size of the length frequency bins for both abundance at length calculations and age-length key construction; this argument is set to "inherit", by default, to utilize the length_group value defined inside the closure supplied to the growth argument in sim_abundance. The alk_scale argument defines the scale at which to construct and apply age-length keys ("division" [default], "strat" or "set"). Finally, strat_data_fun and strat_means_fun allow users to supply custom functions for processing the simulated data and calculating stratified means, respectively; the built in functions strat_data and strat_means are supplied by default. RMSE of the stratified estimates from run_strat can then be calculated using the strat_error function. Results and error of a stratified analysis of one survey over a population are obtained using the following code (using default values):

set.seed(438)
sim <- sim_abundance() %>% 
  sim_distribution() %>% 
  sim_survey() %>% 
  run_strat() %>% 
  strat_error()

The returned object will include all the objects accumulated through the sim_abundance to strat_error. The run_strat function adds three data.tables called total_strat, length_strat and age_strat that include stratified estimates of total abundance, abundance at length, and abundance at age, respectively. To this, strat_error adds data.tables ending with _strat_error or _strat_error_stats. The _strat_error objects simply contain stratified estimates of abundance (column named I_hat) with corresponding true values of abundance available to the survey (column named I) and the strat_error_stats data.frame includes metrics of mean error (ME), mean absolute error (MAE), mean-squared error (MSE) and root-mean-squared error (RMSE).

test_surveys

Assuming a stratified analysis as the default method for obtaining an index of abundance, a series of survey protocols can be tested using the test_surveys function. Provided a simulated population from sim_distribution and a series of survey protocols from expand_surveys, this function will simulate and analyze data from each survey using the sim_survey, run_strat and strat_error functions. Like sim_survey_parallel, this function operates in parallel and allows the specification of n_sims and n_loops, and the product of these two arguments equals the number of times each survey is simulated. Keep in mind that low numbers of n_sims and high numbers of n_loops will be less demanding on RAM, but may be slower, especially if the work is spread across few cores. Because most of the default settings of the functions match the case study settings, the code below will replicate the results from our case study (see [S1 Appendix][S1 Appendix: Case study] for more detail). The expand_surveys function sets up a series of r nrow(expand_surveys()) surveys to test (i.e. all possible combinations of the set_den, lengths_cap and ages_cap vectors) and the test_surveys function will run 1000 simulations of each survey and compare stratified estimates of abundance to the true abundance available to the survey. While test_surveys is set-up for testing key sampling effort settings (set_den, lengths_cap and ages_cap), other options can be assessed using independent calls of test_surveys as it accepts several arguments from sim_survey (q, trawl_dim, min_sets, age_sampling, age_length_group and age_space_group) and run_strat (length_group and alk_scale).

set.seed(438)
pop <- sim_abundance() %>%
  sim_distribution()

surveys <- expand_surveys(set_den = c(0.0005, 0.001, 0.002, 0.005, 0.01),
                          lengths_cap = c(5, 10, 20, 50, 100, 500, 1000),
                          ages_cap = c(2, 5, 10, 20, 50))

tests <- test_surveys(pop, surveys = surveys,
                      n_sims = 5, n_loops = 200, cores = 3)
load("analysis/cod_sim_exports/2018-10-26_age_clust_test/test_output.RData")
tests <- sim

Processing time will be system (i.e. amount of RAM and number of cores) and setting (i.e. n_loops and n_sims ratio) dependent. The test_survey function will print a progress bar, generated using the progress package [@csardi2016], which details percent completion and will also include an estimate time of arrival (eta) after the first step of the loop completes. The test_surveys function therefore includes an option for exporting intermediate results to a local directory, via the export_dir argument, and the resume_test function can be used to resume a test_surveys run that had to be stopped part way through the process. The final object produced will be a list that includes all objects from sim_abundance and sim_distribution with the table of survey designs tested (named surveys) and tables produced by strat_error that end with the names _strat_error and _strat_error_stats. These tables include a survey column to allow merging of the survey protocol table with the error tables. Objects produced by sim_survey (set and sampling details) and run_strat (full stratified analysis results) are not retained to minimize object size.

Like other core functions, some convenience functions are included in SimSurvey for creating interactive plots of the results from test_surveys. For instance a series of plotting functions ending in _fan produces fan charts where stratified estimates of abundance from each simulated survey are converted into a series of quantiles to depict the probability that estimates fall within a particular range. True values of abundance available to the survey are overlaid on the series of probability envelopes. These plots help visually assess the level of precision and bias from a specific set of survey protocol; ideally, the probability envelopes will be tightly centered around the true values. The three lines of code below will produce interactive fan charts for stratified estimates of total abundance, abundance at length and abundance at age, respectively (e.g. Figure \@ref(fig:fan1), \@ref(fig:fan2), \@ref(fig:fan3)).

plot_total_strat_fan(tests)
plot_length_strat_fan(tests, years = 1:20, lengths = 1:100)
plot_age_strat_fan(tests, years = 1:20, ages = 1:10)
surveys <- tests$surveys
set_dens <- c(0.0005, 0.002, 0.01)
i <- surveys$survey[surveys$set_den == set_dens[1] & surveys$lengths_cap == 10 & surveys$ages_cap == 10] # lengths_cap and ages_cap does not matter here
a <- plot_total_strat_fan(tests, surveys = i)
i <- surveys$survey[surveys$set_den == set_dens[2] & surveys$lengths_cap == 10 & surveys$ages_cap == 10]
b <- plot_total_strat_fan(tests, surveys = i)
i <- surveys$survey[surveys$set_den == set_dens[3] & surveys$lengths_cap == 10 & surveys$ages_cap == 10]
c <- plot_total_strat_fan(tests, surveys = i)

p <- subplot(a, b, c, nrows = 1, shareY = TRUE, titleX = FALSE) %>% 
  add_labels(x = 0.5, y = 0, text = "Year", yshift = -40) %>% 
  add_labels(x = c(0.1, 0.5, 0.9), y = rep(1, 3), yshift = 20, 
             text = paste("D<sub>sets</sub> =", set_dens))

if (replot) export_plot(p, file = "analysis/paper/figures/plot_total_strat_fan.png", width = 700, height = 250)

``r}$. The thick black line indicates the true trend in the total population available to the survey and the yellow to purple color gradient represents a range of probability envelopes from 10% to 90%. This plot is a facet of plots produced byplot_total_strat_fanwhen supplied results fromtest_surveys`."}

if (is_html) { p } else { knitr::include_graphics(here("analysis", "paper", "figures", "plot_total_strat_fan.png")) }

```r

facet_fan <- function(data, ages = NULL, lengths = NULL, ylim = NULL, set_dens = c(0.0005, 0.002, 0.01), lengths_caps = c(10, 100, 1000)) {
  surveys <- data$surveys
  s <- expand.grid(set_den = set_dens, lengths_cap = lengths_caps)
  plot_list <- vector("list", nrow(s))
  for (i in seq_along(set_dens)) {
    for (j in seq_along(lengths_caps)) {
      s_i <- surveys$survey[surveys$set_den == set_dens[i] & surveys$lengths_cap == lengths_caps[j] & surveys$ages_cap == 10]
      if (is.null(ages)) {
        p <- plot_length_strat_fan(data, surveys = s_i, years = 7, lengths = lengths) %>% 
         layout(yaxis = list(range = ylim))
      } else {
        p <- plot_age_strat_fan(data, surveys = s_i, years = 7, ages = ages) %>% 
          layout(yaxis = list(range = ylim))
      }
      plot_list[[which(s$set_den == set_dens[i] & s$lengths_cap == lengths_caps[j])]] <- p
    }
  }
  rm(data)
  subplot(plot_list, nrows = length(set_dens), shareX = TRUE, shareY = TRUE, titleX = FALSE, titleY = FALSE)
} 

p <- facet_fan(tests, lengths = 1:100, ylim = c(0, 21000000)) %>% 
  layout(margin = list(r = 30)) %>% 
  add_labels(x = c(0.1, 0.5, 0.9), y = rep(1, 3), yshift = 20, 
             text = paste("D<sub>sets</sub> =", c(0.0005, 0.002, 0.01))) %>% 
  add_labels(x = rep(1, 3), y = c(0.05, 0.5, 0.95), xshift = 30, textangle = 90,
             text = paste("M<sub>lengths</sub> =", rev(c(10, 100, 1000)))) %>% 
  add_labels(x = 0.5, y = 0, text = "Length", yshift = -40) %>% 
  add_labels(x = 0, y = 0.5, text = "Abundance index", xshift = -60, textangle = -90)

if (replot) export_plot(p, file = "analysis/paper/figures/plot_length_strat_fan.png", width = 700, height = 600)

``r}$, and the maximum number of length samples, $M_{\\mathrm{lengths}}$. The thick black line indicates the true trend in the total population available to the survey and the yellow to purple color gradient represents a range of probability envelopes from 10% to 90%. This plot is a facet of plots produced byplot_total_strat_fanwhen supplied results fromtest_surveys`."}

if (is_html) { p } else { knitr::include_graphics(here("analysis", "paper", "figures", "plot_length_strat_fan.png")) }

```r

p <- facet_fan(tests, ages = 1:10, ylim = c(0, 51000000)) %>% 
  layout(margin = list(r = 30)) %>% 
  add_labels(x = c(0.1, 0.5, 0.9), y = rep(1, 3), yshift = 20, 
             text = paste("D<sub>sets</sub> =", c(0.0005, 0.002, 0.01))) %>% 
  add_labels(x = rep(1, 3), y = c(0.05, 0.5, 0.95), xshift = 30, textangle = 90,
             text = paste("M<sub>length</sub> =", rev(c(10, 100, 1000)))) %>% 
  add_labels(x = 0.5, y = 0, text = "Age", yshift = -40) %>% 
  add_labels(x = 0, y = 0.5, text = "Abundance index", xshift = -60, textangle = -90)

if (replot) export_plot(p, file = "analysis/paper/figures/plot_age_strat_fan.png", width = 700, height = 600)

``r}$, and the maximum number of length samples, $M_{\\mathrm{lengths}}$. Number of ages sampled per length group, $M_{\\mathrm{ages}}$, was 10 in all scenarios. The thick black line indicates the true trend in the total population available to the survey and the yellow to purple color gradient represents a range of probability envelopes from 10% to 90%. This plot is a facet of plots produced byplot_total_strat_fanwhen supplied results fromtest_surveys`."}

if (is_html) { p } else { knitr::include_graphics(here("analysis", "paper", "figures", "plot_age_strat_fan.png")) }

The relative performance of the surveys tested can be compared using `plot_survey_rank` and `plot_error_surface`. The `plot_survey_rank` function produces a divergent dot plot of the results which ranks the surveys by RMSE, where lower values indicate sampling strategies that minimize bias and maximize precision. Using the `which_strat` argument, the plot can be focused on total, length or age based stratified results (Figure \@ref(fig:rank)). The `plot_error_surface` displays the age based stratified results by plotting a surface of RMSE (z-axis) by set (drop down selection), length (y-axis) and age (z-axis) sampling effort. The sampling effort axes can be rule or sample size based (`plot_by = "rule"` or `plot_by = "samples"`, respectively; Figure \@ref(fig:rmse-surface)).

```r

plot_survey_rank(tests, which_strat = "length")
plot_error_surface(tests, plot_by = "rule")
sim <- tests
sim$surveys <- sim$surveys[sim$surveys$set_den %in% c("0.01", "0.002", "0.0005")]

rank_p <- plot_survey_rank(sim, which_strat = "length")
if (replot) export_plot(rank_p, file = "analysis/paper/figures/plot_survey_rank.png", width = 450, height = 400)

totals <- sim$samp_totals[, list(n_sets = mean(n_sets), n_caught = mean(n_caught),
                                 n_measured = mean(n_measured), n_aged = mean(n_aged)),
                          by = "survey"]
errors <- merge(sim$surveys, sim$age_strat_error_stats, by = "survey")
d <- merge(errors, totals, by = "survey")

split_d <- split(d, d$set_den)
x <- sort(unique(d$ages_cap))
y <- sort(unique(d$lengths_cap))

scene <- list(
  xaxis = list(title = "Ages cap"),
  yaxis = list(title = "Lengths cap"),
  zaxis = list(title = "RMSE"),
  camera = list(eye = list(x = 1.5, y = 1.5, z = 1.5))
)

xshift <- c(-30, 0, 30)
p_list <- lapply(seq_along(split_d), function(i) {
  scene <- paste0("scene", i)
  z <- stats::xtabs(RMSE ~ ages_cap + lengths_cap, data = split_d[[i]], subset = NULL)
  plot_ly(x = x, y = y, scene = scene) %>% add_surface(z = t(z), showscale = FALSE) %>% 
    add_labels(x = 0.5, y = 1, text = paste("D<sub>sets</sub> =", names(split_d[i])), 
               yshift = -20, xshift = xshift[i])
})

surface_p <- subplot(p_list) %>% 
  layout(scene = c(list(domain = list(x = c(0, 1/3), y = c(0, 1))), scene),
         scene2 = c(list(domain = list(x = c(1/3, 2/3), y = c(0, 1))), scene),
         scene3 = c(list(domain = list(x = c(2/3, 1), y = c(0, 1))), scene),
         autosize = is_html, margin = list(t = 0, l = 10, r = 10, b = 10))

if (replot) export_plot(surface_p, file = "analysis/paper/figures/plot_error_surface.png", width = 1090, height = 320)

surface_p <- plot_error_surface(tests, plot_by = "rule")

```r}$] and length measurements [$N_{\mathrm{measured}}$]), under various sampling protocols (set density [$D_{\mathrm{sets}}$] and maximum number of lengths measured per set [$M_{\mathrm{lengths}}$]). Records are ranked by lowest to highest RMSE score. Within each plot, a yellow to purple color gradient is applied from lowest to highest value as an additional visual aid. Note that the exponent format of the axes defaults to SI unit symbols (e.g. M for million)."}

if (is_html) { rank_p } else { knitr::include_graphics(here("analysis", "paper", "figures", "plot_survey_rank.png")) }

```r
cap <- "Surface plots of root-mean-squared error (RMSE) from an array of surveys with different sampling protocol. Panels represent surveys with different set densities ($D_{\\mathrm{sets}}$), x-axes represent the maximum sampling effort of lengths per set ($M_{\\mathrm{lengths}}$), and y-axes represent the maximum number of ages to collect per length group ($M_{\\mathrm{ages}}$). This plot is a facet of plots produced by `plot_error_surface` when supplied results from `test_surveys`. Note that RMSE scales are different across $D_{\\mathrm{sets}}$ panels. Note that the exponent format of the axes defaults to SI unit symbols (e.g. M for million)."
if (is_html) {
  cap <- gsub(" a facet of plots ", " ", cap)
}
if (is_html) {
  surface_p
} else {
  knitr::include_graphics(here("analysis", "paper", "figures", "plot_error_surface.png"))
}

Interpretation

Results shown in the [Core functions] section demonstrates one use of the SimSurvey package that is focused on evaluating the efficacy of increasing the sampling effort of a stratified random survey of a cod population (see [S1 Appendix][S1 Appendix: Case study] for details). These results largely align with expectations from sampling theory that 1) design-based estimators for stratified random surveys are unbiased, and 2) precision is increased by increasing the number of primary sampling units [@Cochran1977]. Specifically, estimates of total abundance and abundance at length are centered around true values and their probability envelopes tighten as set density increases (Figures \@ref(fig:fan1), \@ref(fig:fan2)). Our case study results also echo the growing body of literature which concludes that extra sub-sampling is an ineffective means of improving estimates relative to sampling more locations [@Pennington1994; @Pennington2002; @Stewart2014; @Bogstad1995; @Coggins2013; @Zhang2013]. This is exemplified by the relatively large drops in RMSE when set density is increased compared to when sub-sampling effort is increased (Figures \@ref(fig:rank), \@ref(fig:rmse-surface)). Moreover, it appears that it is more advantageous to measure fewer total fish at more locations than measuring many fish at fewer locations (Figure \@ref(fig:rank)). Again, this result was expected because the size-structured spatial clustering imposed by the simulation causes set-to-set variation to be greater than within-set variation in characteristics such as length and age. This structure was imposed because fish caught together tend to have similar characteristics and it is well known that this reduces the effective sample size [@Pennington1994]. Increasing sub-sampling effort, however, should not lead to poorer population estimates like those observed for abundance at age (Figure \@ref(fig:fan3), \@ref(fig:rmse-surface)). Bias, in particular, appears to be a problem with stratified estimates of abundance at age under certain scenarios (Figure \@ref(fig:fan3)); in contrast, there appears to be little bias in the stratified estimates of abundance at length (Figure \@ref(fig:fan2)). This result was unexpected as these estimates of abundance at age were presumed to be unbiased. The contrast between the length and age based analysis indicates that the problem lies with the intervening age-length-key.

Age-length keys have long been used in conjunction with length frequency tables to estimate the age distribution of fish populations over large scales [@fridriksson1934]. By default, sim_survey spreads length-stratified age sampling across the division and run_strat constructs and applies age-length keys at the division scale. This default was chosen because it is standard protocol for the actual survey the case study is based on. There is, however, a potential cost to the spatial scale of the key. Namely, it is unlikely that one age-length key is representative for the whole region as the probability of being a specific age given length varies in space [@Berg2012]. Because different age groups sometimes occur in different places, the translation of lengths to ages may be biased by the samples used to generate the age-length key. In short, multi-stage sampling of a clustered population violates the assumption that the age-length key is generated from a simple random sample [@Aanes2015]. Bias introduced by this assumption can be resolved by explicitly accounting for the hierarchical sampling by using designed-based estimators [@Aanes2015]. For the case study, a proper design-based estimator will need to account for cluster sampling in a stratified random survey. Following recommendations in Aanes and Vølstad [-@Aanes2015], we used SimSurvey to 1) conduct a survey with concurrent length and age sampling at every set, 2) construct and apply age-length keys on a set-by-set basis, and 3) weight the resultant age frequencies at each set using stratified random estimators. This test was accomplished using this code:

alt_surveys <- expand_surveys(set_den = 2 / 1000,
                              lengths_cap = c(5, 10, 20, 50, 100, 500, 1000),
                              ages_cap = c(1, 2, 3, 5, 10))

alt_tests <- test_surveys(pop, surveys = alt_surveys,
                          n_sims = 5, n_loops = 200, cores = 3,
                          age_length_group = 3,
                          age_space_group = "set", 
                          alk_scale = "set")
load("analysis/cod_sim_exports/2020-02-10_set_alk/test_output.RData")
alt_tests <- sim

This is a minor modification of the code used in the [test_surveys] section whereby the same simulated population was used (pop object) but set density scenarios were reduced to one option for simplicity, and age sampling protocol and stratified analysis options were changed. That is, 1) age_length_group and age_space_group were changed to simulate a survey that conducts concurrent length and age sampling at every set (e.g. collect age samples from a maximum of 1 fish per 3 cm length group per set), 2) alk_scale was changed to "set" to construct and apply age-length keys on a set-by-set basis, and 3) the resultant age frequencies at each set are weighted accordingly using the design-based estimators built into run_strat. Again, results are visualized using plot_age_strat_fan and plot_error_surface:

plot_age_strat_fan(alt_tests, years = 1:20, ages = 1:10)
plot_error_surface(alt_tests, plot_by = "rule")
a <- plot_age_strat_fan(alt_tests, surveys = 19, ages = 1:10, years = 7) %>% 
  layout(yaxis = list(range = c(0, 51000000))) %>% 
  add_labels(x = -0.15, y = 1.05, text = "a)")

sim <- alt_tests
totals <- sim$samp_totals[, list(n_sets = mean(n_sets), n_caught = mean(n_caught),
                                 n_measured = mean(n_measured), n_aged = mean(n_aged)),
                          by = "survey"]
errors <- merge(sim$surveys, sim$age_strat_error_stats, by = "survey")
d <- merge(errors, totals, by = "survey")

x <- sort(unique(d$ages_cap))
y <- sort(unique(d$lengths_cap))

scene <- list(
  xaxis = list(title = "Ages cap"),
  yaxis = list(title = "Lengths cap"),
  zaxis = list(title = "RMSE"),
  camera = list(eye = list(x = 1.5, y = 1.5, z = 1.5))
)

z <- stats::xtabs(RMSE ~ ages_cap + lengths_cap, data = d, subset = NULL)

b <- plot_ly(x = x, y = y) %>% 
  add_surface(z = t(z)) %>% 
  layout(scene = scene) %>% 
  hide_guides() %>% 
  add_labels(x = 0.05, y = 1.05,  text = "b)")

p <- subplot(a, b, shareX = FALSE, shareY = FALSE, titleX = TRUE, titleY = TRUE) %>% 
  layout(xaxis1 = list(domain = list(x = c(0, 0.5), y = c(0, 1))),
         scene = list(domain = list(x = c(0.5, 1), y = c(0, 1))),
         autosize = is_html)

if (replot) export_plot(p, file = "analysis/paper/figures/alt_tests_plots.png", width = 800, height = 390)

```r}$) of 0.002 sets / km^2^, maximum number of length samples ($M_{\mathrm{lengths}}$) of 100, and maximum number of age samples per 3 cm length group per set ($M_{\mathrm{ages}}$) of 3 (the thick black line indicates the true trend in the total population available to the survey and the yellow to purple color gradient represents a range of probability envelopes from 10% to 90%); and b) is a surface plot of root-mean-squared error (RMSE) from an array of surveys with set density ($D_{\mathrm{sets}}$) fixed to 0.002 sets / km^2^ but different sub-sampling effort, where the x-axes represent the maximum number of lengths to measure per set ($M_{\mathrm{lengths}}$), and y-axes represent the maximum number of ages to collect per 3 cm length group per set ($M_{\mathrm{ages}}$)."}

if (is_html) { p } else { knitr::include_graphics(here("analysis", "paper", "figures", "alt_tests_plots.png")) }

Unlike the default approach to collecting and analyzing age samples, results from this test appear to be unbiased and additional sub-sampling effort beyond a certain point appears to have little to no effect on the estimates (Figure \@ref(fig:alt)). This is an important result with real-world implications as they imply that 1) inference from the actual survey can be improved by altering the sampling and analysis of ages and, 2) valuable survey time can be saved by cutting back on sub-sampling effort. This example demonstrates how **`SimSurvey`** can be used to reveal problems and explore solutions to the sampling designs of fisheries-independent surveys.


# Discussion

Though somewhat narrow in scope, results from our case study demonstrate why it is important to evaluate the design and analysis of complex surveys. Specifically, we tested the design and analysis of a multi-staged stratified random survey of a cod population. This type of sampling is common in fisheries research, however, the clustered nature of the samples are rarely taken into consideration [@Nelson2014]. Instead, it is commonly assumed that individuals are randomly drawn from the target populations even though such sampling is virtually impossible to accomplish in practice. The assumption of simple random sampling implies that the characteristics of fish caught at the same location are independent and, if estimators that assume independence are used, this can introduce bias and underestimate variance [@Nelson2014; @Aanes2015]. Moreover, this assumption may contribute to the perception that more sub-samples will lead to better population estimates. Increasing sub-sampling effort, however, has been shown to be an ineffective means of improving estimates because samples at the same location are usually correlated; the best way to increase the number of independent samples is to sample more locations [@Pennington1994; @Pennington2002; @Stewart2014; @Bogstad1995; @Coggins2013; @Zhang2013]. Results from the case study reiterate these points and indicate that there is room for improvements to the actual survey the case study was based on. Specifically, it appears that valuable human resources may be wasted by collecting excessive samples of correlated metrics, and abundance at age estimates may be biased by the assumption that age samples are from a simple random sample. We have used **`SimSurvey`** to identify solutions to these problems and, given current results, we suggest that time can be saved by decreasing sub-sampling effort and bias can be reduced by using an approach that account for the hierarchical design of the survey.

## Research opportunities

The case study used here provides one example of how **`SimSurvey`** can be used to simulation test the design of fisheries-independent trawl surveys. Default settings can, of course, be modified to emulate data from an array of different surveys of different populations (see **[S2 Appendix][S2 Appendix: Parameterisation]** for guidance on how to modify default settings to suit specific needs). The package therefore facilitates the exploration of a range of research questions. Below we outline some examples where the **`SimSurvey`** package may aid future research efforts.

### Design or model-based approach

The analysis of data from fisheries-independent surveys have generally been confined to design-based mean and variance estimates of abundance using standard formulae (e.g. [@Cochran1977]). Nevertheless, there has long been interest in using model-based approaches to improve abundance estimates (e.g. [@smith1990; @Berg2014; @Thorson2015]). **`SimSurvey`** can serve as a convenient tool for simulation testing mean and variance estimates provided by a range of different approaches (design-based analyses, bootstrap estimates, generalized additive models, geostatistical models, etc.). Moreover, the full analytical pathway for obtaining age-disaggregated estimates of abundance has rarely been simulation tested. Existing and future approaches for calculating age-based indices of abundance can be simulation tested using **`SimSurvey`**.

### Growth analyses

Assessing ages for a large number of fish is very time-consuming and, as such, length-stratified sampling is often used to estimate age frequencies of fish populations. The resultant sub-sample is used to construct an age-length key (i.e. the probability a fish is a specific age given length) and age frequencies are obtained by applying this key to length-frequencies obtained via more expansive random sampling. One age-length key is typically assumed to be representative of the whole stock area, however, spatial variability in the relationship may introduce bias in abundance-at-age estimates [@Berg2012; @Aanes2015]. Results from the case study (see **[S1 Appendix][S1 Appendix: Case study]**) reiterate this point and **`SimSurvey`** may serve as a platform for testing potential model-based solutions to this problem (e.g. [@Berg2012]).

### Random or stratified sampling

**`SimSurvey`** can be used to compare the precision and bias of population estimates obtained using random or stratified sampling. Simple random sampling can be implemented using a grid with one strata (e.g. `make_grid(depth_range = c(0, 1000), strat_breaks = c(0, 1000), strat_split = 0)`). Sub-sampling of ages can also be random rather than length-stratified by setting the `age_sampling` argument in the `sim_survey` function to `"random"` rather than `"stratified"`. This can facilitate research similar to work presented in Puerta et al. [-@puerta2019].


## Future directions

Up to now, the package has focused on the effects of sampling design on the precision and bias of population estimates obtained from fisheries-independent surveys; however, the costs associated with sampling has yet to be considered. In future iterations of **`SimSurvey`**, we hope to add options for integrating data on the time and monetary costs associated with each level of sampling (sets, length measurements, age determination) to facilitate cost-benefit analyses. We also realize that a single fisheries-independent survey may have multiple goals as data obtained are often used to assess multiple species or to conduct community analyses. We will therefore endeavor to add functions for simulating multi-species surveys. Another limitation of the package is that we have yet to implement alternatives to random or stratified random survey designs (e.g. systematic sampling); expanding these options would allow for a more comprehensive evaluations of various designs. Finally, it would be useful to add an option for testing the consequences of surveys with partial coverage of a population as survey coverage is a frequent concern in stock assessment.

## Assumptions

Like any model, this simulation is a simplification of a much more complex reality. For instance, the population is assumed to aggregate by age-class and be uniformly distributed within a cell, instead fish may aggregate by length and form finer-scale clusters. The survey is also an instantaneous snapshot of the population, meaning that the population is assumed to be in the same location from the beginning to the end of the survey. Also, fish are aged at random within length bins and ages are estimated without error. Finally, area trawled is assumed to be perfectly standard. These assumptions, plus a range of others, will surely under-represent the natural variability of fish populations and survey protocol. Nevertheless, the **`SimSurvey`** package provides a relatively complex and flexible operating model for simulating stratified-random survey data from a population that varies across age, year and space dimensions.

## Summary

The **`SimSurvey`** package serves as a tool for simulating sample surveys of dynamic populations that vary across ages, time and space. The core of the simulation is based on the widely used cohort equation and, even though the processes that define recruitment and total mortality are simple, a wide range of stock dynamics can be simulated by changing a few parameters. This base population can then be distributed through a grid and relationships with depth and correlation across ages, years and space can be defined. Together, two functions (`sim_abundance` and `sim_distribution`) are capable of simulating a wide range of populations with different life histories, depth associations and spatial properties. The next step to generating data similar to actual observations is to conduct a survey. In this package we implement a function, `sim_survey`, that conducts a survey of the population. The sampling process is governed by the area covered by the trawl as well as age-specific catchability. Sub-sampling protocol (length and age sampling) can also be varied. As such, data from a wide range of surveys can be simulated. 

A large number of statistical models that may be tested using these simulated data, but, implementing a variety of analytical approaches was outside the scope of this package. Instead we focus on analyzing simulated stratified-random survey data using a design-based stratified analysis. A stratified analysis is facilitated using the `run_strat` function and the precision and accuracy of the results (e.g. RMSE) can be calculated using `strat_error`. This is a simple and widely-used analysis, and the speed at which it runs allows for a wide range of survey designs to be tested, via the `test_surveys` function, in a reasonable time-frame.

Simulation testing is an important tool in the field of fisheries science as the inferred status of fish stocks hinge on the data and models used to assess fish populations. Simulations provide an opportunity to explore survey and model performance, and such explorations are becoming increasingly important as model complexity increases. It is also important to continually assess the efficacy and efficiency of sampling programs given their costs and the constant scrutiny of the value added by such surveys. These are some of the reasons multiple simulation frameworks, including **`SimSurvey`**, have been developed to test the design and analyses of complex surveys. We have made **`SimSurvey`** as open and accessible as possible to allow the broader community to validate, reuse and improve this package. We hope that open-source sharing will extend the value of such simulation frameworks and we encourage users to extend the package for their own needs and contribute to future versions.


# Acknowledgements

This work has benefited from valuable feedback from numerous colleagues, including Aaron Adamack, Alejandro Buren, Noel Cadigan, Karen Dwyer, Geoff Evans, Paul Higdon, Danny Ings, Mariano Koen-Alonso, Joanne Morgan, Derek Osborne, Pierre Pepin, Dwayne Pittman, Don Power, Craig Purchase, Martha Robertson, Mark Simpson, Brad Squires, Don Stansbury and Peter Upward. We also thank Dave Cote and Joanne Morgan for providing constructive comments on a previous version of this manuscript. Finally, the package and the manuscript were greatly improved by feedback from editor Daniel Duplisea and two anonymous reviewers. This work was supported by the NSERC visiting-fellow program and Fisheries and Oceans Canada.


# S1 Appendix: Case study

The construction of the **`SimSurvey`** package was motivated by a need to assess the sampling design of trawl surveys conducted along the Newfoundland and Labrador shelf by Fisheries and Oceans Canada. As a tangible first step, we focused our efforts on emulating data from the survey of cod in NAFO Subdivision 3Ps since it is one of the most dynamic and heavily sampled stocks in the region. Function settings were iteratively modified until it was difficult to distinguish the simulated data from real survey data; these values were set as the defaults for the main functions in the package (see **[S2 Appendix][S2 Appendix: Parameterisation]** for more details on how these defaults were chosen and for guidance on how to modify these settings to suit specific needs). Here we evaluate the efficacy of alternate sampling protocols by assessing deviation of stratified estimates of abundance [@Smith1981] from the true abundance available to the survey. We also evaluate designed-based estimators recommended by Aanes and Vølstad [@Aanes2015] that account for cluster sampling of ages. Although assessing model-based analyses was an option, we focused on the design-based analyses for simplicity and because of its widespread use. The code below displays the core defaults and will replicate the results of this case study. The last few lines of code will also replicate a simulation configured to test an alternate age sampling and analysis approach. Results produced here are the same as those produced and displayed in a series of figures in the **[`test_surveys`]** and **[Interpretation]** sections of the paper.

```r

# install.packages("remotes")
# remotes::install_github("PaulRegular/SimSurvey")
# library(SimSurvey)

set.seed(438)

abundance <- sim_abundance(
  ages = 1:20,
  years = 1:20,
  R = sim_R(log_mean = log(30000000),
            log_sd = 0.5),
  Z = sim_Z(log_mean = log(0.5),
            log_sd = 0.2,
            phi_age = 0.9,
            phi_year = 0.5),
  growth = sim_vonB(Linf = 120, 
                    L0 = 5, 
                    K = 0.1,
                    log_sd = 0.1, 
                    length_group = 3)
)

distribution <- sim_distribution(
  abundance,
  grid = make_grid(x_range = c(-140, 140),
                   y_range = c(-140, 140),
                   res = c(3.5, 3.5),
                   shelf_depth = 200,
                   shelf_width = 100,
                   depth_range = c(0, 1000),
                   n_div = 1,
                   strat_breaks = seq(0, 1000, by = 40),
                   strat_splits = 2),
  ays_covar = sim_ays_covar(sd = 2.8,
                            range = 300,
                            phi_age = 0.5,
                            phi_year = 0.9,
                            group_ages = 5:20),
  depth_par = sim_parabola(mu = 200,
                           sigma = 70)
)

surveys <- test_surveys(
  distribution,
  surveys = expand_surveys(set_den = c(0.5, 1, 2, 5, 10) / 1000,
                           lengths_cap = c(5, 10, 20, 50, 100, 500, 1000),
                           ages_cap = c(2, 5, 10, 20, 50)),
  n_sims = 5, n_loops = 200, cores = 3,
  q = sim_logistic(k = 2, x0 = 3)
)

alt_surveys <- test_surveys(
  distribution,
  surveys = expand_surveys(set_den = 2 / 1000,
                           lengths_cap = c(5, 10, 20, 50, 100, 500, 1000),
                           ages_cap = c(1, 2, 3, 5, 10)),
  n_sims = 5, n_loops = 200, cores = 3,
  q = sim_logistic(k = 2, x0 = 3),
  age_length_group = 3,
  age_space_group = "set", 
  alk_scale = "set"
)

Results

The default settings of sim_abundance and sim_distribution produces a dynamic population with a patchy and age-clustered distribution (Figure \@ref(fig:surface)a, \@ref(fig:distribution)a). In general, the function defaults dictate that all ages tend to aggregate in fairly dense clusters and that low-density zones are relatively wide-spread. Surveys, therefore, tend to be characterized by regions with large catches interspersed with regions of no catch (Figure \@ref(fig:survey)). The samples within each successful set are also variable and rarely include all length or age groups. This is because only moderate correlation ($\varphi_{\xi, \mathrm{age}}$ of 0.5) was imposed on the distributions of ages 1-5+, allowing individuals of different ages to occupy different locations, as is typically observed in the actual survey. These settings were chosen to simulated data that roughly correspond to actual survey data of cod from NAFO Subdivision 3Ps, where each age group between 1-4 tend to be caught at different locations while cod older than 4 years tend to be found at similar locations. In short, simulated data roughly emulate actual survey data with clustered catches and length samples with clear intra-haul correlation (Figure \@ref(fig:real-sim-comp)).

knitr::include_graphics(here("analysis", "paper", "figures", "real_sim_comp.png"))

Using the test_surveys function, a series of surveys were run over the simulated population and stratified analyses were run on the data obtained. The stratified analyses provided estimates of total abundance, abundance at length and abundance at age. Different stages of sampling affect these estimates, specifically, set density affects all estimates, length sampling affects estimates of abundance at length and abundance at age, and age sampling affects abundance at age estimates.

The effect of set density on estimates of total abundance, abundance at length and abundance at age are clear across all figures generated from the test_surveys results (Figures \@ref(fig:fan1), \@ref(fig:fan2), \@ref(fig:fan3), \@ref(fig:rank), \@ref(fig:rmse-surface)). Across all fan plots, it is clear that the probability envelopes tighten (i.e. estimates are more precise) as set density is increased (Figures \@ref(fig:fan1), \@ref(fig:fan2), \@ref(fig:fan3)). Clear declines in RMSE are also apparent in estimates of abundance at length as set density is increased (Figure \@ref(fig:rank)) and there are notable differences in the scale of RMSE in estimates of abundance at age as set density is increased (Figure \@ref(fig:rmse-surface)).

Compared to increases in set density, the effects of increased length sampling effort on abundance at length estimates are less clear. Regarding estimates of abundance at length, RMSE appears to reach a plateau at when length sampling effort is increased to 100 measurements per set (Figure \@ref(fig:rank)). Increasing the sampling rule above 100 measurements per set results in many more fish being measured, however, the increase in sampling effort is not matched with substantive declines in RMSE (Figure \@ref(fig:rank)). Further, it appears that measuring fewer total fish at more locations is more beneficial than measuring many fish at fewer locations (Figure \@ref(fig:rank)).

All levels of sampling affect the abundance at age estimates as set catches define the magnitude of the population, length sampling define the length distribution and data from the length-stratified age sampling is used to construct an age-length-key to convert length frequencies to age frequencies. Like the abundance at length estimates, the greatest improvements to RMSE come from increasing set density rather than sub-sampling effort (Figure \@ref(fig:rmse-surface)). Specifically, decreasing the age sampling protocol below 10 ages sampled per length group per division appears to result in relatively small increases to RMSE (Figure \@ref(fig:rmse-surface)). Length sampling effort, in contrast, has an uneven impact depending on the set density scenario. At low set densities ($D_{\mathrm{sets}}$ = 0.0005 sets / km^2^), RMSE declines when length sampling effort is increased from around 5 to around 100 measurements per set, and RMSE starts to increase as length sampling effort is increased beyond ~100 measurements per set; in fact, RMSE values appear higher at the highest length sampling scenario (1000 measurements / set) than the lowest (5 measurements / set; Figure \@ref(fig:rmse-surface)). A similar pattern is apparent under the medium set density scenario ($D_{\mathrm{sets}}$ = 0.002 sets / km^2^), however, RMSE values under the lowest and highest length sampling scenarios are of similar magnitude. Finally, RMSE continues to decline with increased length sampling effort under the high set density scenarios ($D_{\mathrm{sets}}$ = 0.01 sets / km^2^; Figure \@ref(fig:rmse-surface)).

Bias in stratified estimates of abundance at age is minimal when length-stratified age sampling is conducted and age-length keys are applied on a set-by-set basis (Figure \@ref(fig:alt)a). Under this scenario, there also appears to be a plateau, rather than a valley, in RMSE values as length and age sampling effort is increased (Figure \@ref(fig:alt)b). Like the default approach (division-wide age sampling and age-length key), RMSE declines when length sampling effort is increased from around 5 to 100 measurement per set, however, beyond 100 there is little change in the RMSE values (Figure \@ref(fig:alt)b). In contrast to the length sampling axis, there is relatively little change in the age sampling axis; a small decline in RMSE is apparent as age sampling per length group per set is decreased from 1 to 3, with little change above this level of effort (Figure \@ref(fig:alt)b).

Discussion

The simple case study presented here revealed some expected, but also a few unexpected, patterns. First, and not surprisingly, there were clear improvements to precision in all population estimates as the number of sets were increased [@sutherland2006]. Second, stratified estimates of abundance at age were often biased and, in some cases, estimates were poorer when sub-sampling effort was increased. This result was unexpected because design-based estimators, such as the stratified analysis applied here, are expected to be unbiased [@smith1990] and large increases to sub-sampling effort could be expected to be relatively ineffective, but not detrimental. Results from stratified estimates of abundance at length, in contrast, better align with these expectations. By deduction, these results indicate that the issue stems from the intervening age-length key and not the design-based estimator.

Using an age-length key in conjunction with the length distribution to estimate abundance at age is standard procedure in the analysis of fisheries-independent survey data as only a small fraction of the catch are typically aged. The aging procedure is costly and time-consuming whereas length measurements are relatively easy to obtain. Ages are generally determined from length-stratified sub-samples of the catch and raw proportions of age-at-length are used to assign ages to fish in specific length groups. Age sampling is often spread across large spatial scales and age-length-keys are usually applied at this scale (i.e. age sampling may not occur at every set). Here we construct one age-length key for the division, as is done for the analysis of 3Ps cod. There is, however, a potential cost to the spatial scale of the key. Namely, it is unlikely that one age-length key is representative for the whole region because the probability of being a specific age given length varies in space [@Berg2012]. This would not be an issue if there was no size (age) specific clustering, however, this may not always be the case because different size groups often occur in different places because of ontogenetic habitat shifts [@dahlgren2000; @galaiduk2017]. For instance, it is not uncommon for populations to form distinct nursery and spawning areas [@marteinsdottir2000; @booth2000]. Because different age groups sometimes occur in different places, the translation of lengths to ages may be biased by the samples used to generate the age-length key. This bias is perhaps compounded by the length-stratified sampling of ages such that the ages sampled may be skewed towards sets with the most catch, especially under scenarios where length sampling effort is high. If this is the case, then the representativeness of the age-length key to the whole population could be diminished by excessive length sampling. In short, the primary cause of the problem is that the classic age-length key method assumes that fish are drawn from a simple random sample; this assumption, however, is rarely satisfied in practice [@Aanes2015].

While there are potential model-based solutions to bias introduced by large scale age-length keys (e.g. spatial age-length keys [@Berg2012]), a design-based solution is to utilize estimators that account for multi-stage cluster sampling designs [@Aanes2015]. This approach was tested here and, indeed, it appears to resolve the bias in the stratified estimates of abundance at age. This implies that inference from the actual survey of 3Ps cod can be improved by a simple adjustment of the sampling design and analytical approach. In the context of stock assessment, and unbiased estimator is certainly the preferred option because under-estimates may lead to unnecessary cuts to a fishery and over-estimates may contribute to ill-advised quota increases and potentially avoidable population declines.

While much work has yet to be done, results from the simple simulation echo the growing body of literature which concludes that extra sub-sampling is an ineffective means of improving estimates relative to sampling more locations [@Pennington1994; @Pennington2002; @Stewart2014; @Bogstad1995; @Coggins2013; @Zhang2013]. In general, the most significant source of variation in fisheries-independent surveys stems from set-to-set variation, and not variability from individual sub-samples. This is largely because fish caught together tend to be more similar than those in the general population [@Pennington1994]. Such cluster sampling can also introduce bias if the hierarchical sampling design is not explicitly accounted for [@Aanes2015]. Therefore, if the goal of a trawl survey is to maximize information, it is likely better to stop collecting correlated samples and, instead, focus efforts the next set and/or other species. This is simply one example of the questions that can be explored using SimSurvey.

S2 Appendix: Parameterisation

The default parameter values used in this package were chosen to emulate survey data from a specific cod population (see [S1 Appendix][S1 Appendix: Case study] for details) and, as such, these values ought to provide a reasonable place to start for another population of cod or similar groundfish. In this section we outline a series of recommended steps to take in order to tailor the default settings to suit other systems. Throughout this outline, we also provide some context on how default parameter values were chosen as this information may help guide a users’ choices.

1. Define survey area and protocol

Parameterising the survey design should be the easiest place to start as the area and protocol should be clearly defined. The survey area can be defined using make_grid or it can be manually constructed. For the case study, make_grid arguments were modified to create a grid with similar dimensions to the stratified design actually applied to the case study area (i.e. total area, shelf depth, number of strata, etc.). Also note that the resolution (res argument) of the grid was defined to be similar to the resolution at which sets are allocated in the survey. A square grid was chosen in lieu of the actual survey grid to create an area that avoids potential issues associated with land as a barrier while being sufficiently complex to test survey designs. When setting up a grid, be careful not to drastically increase the dimensions (x_range or y_range arguments) or resolution (res argument) as simulating the spatial noise over a large field can be computationally demanding.

With a survey grid defined, the settings of sim_survey can be modified to match the protocol of a users' survey of interest. For the case study, the trawl dimensions (trawl_dim) were modified to match the dimensions of a standard trawl conducted in the region. Likewise, area is used to define the number of sets to allocate to a strata and the value used by default for set_den matches the protocol, as does the min_sets setting (i.e. at least two sets are conducted in each strata to facilitate stratified analyses). Finally, length and age sub-sampling settings were modified to match the protocol for the survey of cod in the case study area. The most difficult parameter to define in the sim_survey function is q. A logistic curve is used by default to define these unknown values and the sim_logistic parameters were tweaked to approximate the catchability curve supplied to the survey-based model that was used to assess the case study population [@ings2019]. A similar tactic can be applied to other cases with estimates of survey catchability.

2. Define abundance and growth parameters

Abundance and growth parameters are more difficult to define than the survey protocol as the true values are unknown. Nevertheless, spatially-aggregated age-structured models are frequently applied to assess fish populations and estimates from such models can be used to define sim_abundance argument settings. Alternatively, key values from these models (e.g. Z and N matrices) can be used instead of simulated values. Default parameters used in sim_abundance were largely inspired by estimates from the survey-based model used to assess the cod stock the case study was based on [@ings2019]. Based on this work, average total mortality was roughly centered around 0.5 ($\mu_Z$; argument log_mean in sim_Z closure) and we allowed it to make similar changes in magnitude through time and across ages by setting the standard deviation of log total mortality to 0.2 ($\sigma_Z$; argument log_sd in sim_Z closure) and the correlations to 0.9 and 0.5 in the age and year dimensions ($\varphi_{\delta, \mathrm{age}}$ and $\varphi_{\delta, \mathrm{year}}$; phi_age and phi_year), respectively. Likewise, sim_R values were modified to generate similar patterns in recruitment as those estimated by the model. We simulated data for ages 1 to 20 ($a$; ages argument) to capture the tails of the population, and 20 years were simulated ($y$; years argument) to capture some dynamics. Finally, von Bertalanffy growth parameters were roughly based on estimates from cod in the region [@Cadigan2016b]. While the choices made are obviously imperfect and ad hoc, it is not necessary to generate a perfect representation of the population as the goal here is to generate a dynamic population to survey. What is key is that a known population is simulated from which the performance of specific survey designs and analyses can be evaluated.

3. Define spatial parameters

Ideally this step would be no more difficult than step 2 as the inputs required could come from a spatially-explicit age-structured model fit to real data. Such estimates, however, are rare and we can confirm that no such model has been developed and fit to data from our case study population. Unless a user has access to such estimates, parameter estimates will need to be manually defined using visual comparisons of real and simulated data. A user will have an initial simulation of some survey data by following steps 1 and 2, and the plotting functions provided with the package will provide a means of visually exploring these data. For comparison purposes, actual survey data will need to be processed and visual patterns in the spatial distribution of each age will need to be inspected. It was in this step that we noticed that ages 1 to 4 tend to occupy different areas while ages 5+ occupy similar space for the cod population we focused on. Observing these patterns in real data motivated the coupling of the spatial noise of ages 5+ (group_ages argument in the sim_ays_covar closure), which forces this component of the population to occupy the same space, and the moderate value of 0.5 for the correlation in the spatial distribution across ages ($\varphi_{\xi, \mathrm{age}}$; phi_age argument in the sim_ays_covar closure). Year-to-year plots of the real data also revealed considerable inertia in the space occupied by each age group; this observation motivated a fairly high value of 0.9 for the correlation in the spatial distribution across years ($\varphi_{\xi, \mathrm{year}}$; phi_year argument in the sim_ays_covar closure). Finally, the range of the spatial correlation ($r$; range argument in the sim_ays_covar closure) was iteratively modified until the simulated population exhibited a similar level of patchiness as the real data, and the standard deviation of the age-year-space process ($\sigma_{\xi}$; sd argument in the sim_ays_covar closure) was increased to generate set catches with tails similar to those observed in the real data (cod catches in the area tend to be zero-inflated and heavy-tailed). Finally, visual inspection of the real data revealed that the largest catches tended to be centered on sets conducted in 200 m of water; a parabolic relationship, defined using the sim_parabola closure, with a mean of 200 m and standard deviation of 70 m approximated this pattern.

S3 Appendix: Age-year-space covariance

The simulation applied in this paper was set-up to control covariance across ages, years and space. To do this we used a combination of Matérn covariance, to control the level of spatial aggregation, and the age-year covariance described in Cadigan [-@Cadigan2016], to control the level of similarity in distributions across ages and years. As described in Appendix A in Cadigan [-@Cadigan2016], the age-year covariance can be broken down into a series of AR(1) processes. We integrate Matérn covariance into this series of equations:

$$\xi_{a,y,s} \sim \left{\begin{matrix} MVN\left (0, \frac{\sigma_\xi^2}{(1 - \varphi_{\mathrm{age}}^2)(1 - \varphi_{\mathrm{year}}^2)} \boldsymbol{R}s \right ) & a = 1, y = 1 & \ MVN\left (\varphi{\mathrm{year}} \xi_{1, y - 1, s}, \frac{\sigma_\xi^2}{(1 - \varphi_{\mathrm{age}}^2)} \boldsymbol{R}s \right ) & a = 1, y > 1 & \ MVN\left (\varphi{\mathrm{age}} \xi_{a - 1, 1, s}, \frac{\sigma_\xi^2}{(1 - \varphi_{\mathrm{year}}^2)} \boldsymbol{R}s \right ) & a > 1, y = 1 & \ MVN\left (\varphi{\mathrm{year}} \xi_{a, y - 1, s} + \varphi_{\mathrm{age}} (\xi_{a - 1, y, s} - \varphi_{\mathrm{year}} \xi_{a - 1, y - 1, s}), \sigma_\xi^2 \boldsymbol{R}_s \right ) & a > 1, y > 1 & \end{matrix}\right.$$

Where $MVN$ indicates the multivariate normal distribution, $\sigma_\xi^2$ controls the variance of the process, $\varphi_{\xi, \mathrm{age}}$ and $\varphi_{\xi, \mathrm{year}}$ control correlation in the age and year dimension and $\boldsymbol{R}_s$ is defined by Matérn correlation:

$$\boldsymbol{R}s = \frac{2^{1-\lambda}}{\Gamma(\lambda)} (\kappa \left |s_i - s_j \right|) K{\lambda}(\kappa \left |s_i - s_j \right|)$$ where $\left |s_i - s_j \right|$ is the Euclidean distance between two locations, $\Gamma$ is the gamma function, $K_{\lambda}$ denotes the modified Bessel function of the second kind, and $\lambda$ and $\kappa$ control the smoothness and scale of the spatial process [@Blangiardo2015]. With this structure, simulated error is correlated across ages, years and space.


S4 Appendix: Stratified analysis equations

Standard notation for use in the analysis of stratified-random survey data (modified from [@Smith1981]):

| Equation | Description | |:---------- |:----- | | $H$ | Number of strata ($h = 1, 2, ....., H$) | | $A_h$ | Area of the $h^{th}$ stratum | | $A_{\mathrm{trawl}}$ | Area covered by a standard trawl | | $N_h = \frac{A_h}{A_{\mathrm{trawl}}}$ | Total number of sample units in the $h^{th}$ stratum | | $n_h$ | Total number of units sampled in the $h^{th}$ stratum ($i = 1, 2, ....., n_h$) | | $N = \sum^{H}{h = 1} N_h$ | Total number of sample units in the survey | | $n = \sum^{H}{h = 1} n_h$ | Total number of observations in the survey | | $W_h = \frac{N_h}{N}$ | Stratum weight | | $f_h = \frac{n_h}{N_h}$ | Sampling fraction in the $h^{th}$ stratum | | $\overline{I}h = \sum^{n_h}{i = 1} \frac{I_{h, i}}{n_h}$ | Sample mean in the $h^{th}$ stratum, where $I_{h, i}$ is the $i^{th}$ observation in the $h^{th}$ stratum | | $s^2_h = \sum^{n_h}{i = 1} \frac{(I{h, i} - \overline{I}h)}{(n_h - 1)}$ | Sample variance in the $h^{th}$ stratum | | $\overline{I} = \sum^{H}{h = 1} W_h \overline{I}h$ | Estimate of the population mean per unit (i.e. stratified mean catch per tow) | | $s^2(\overline{I}) = \frac{1}{N^2} \sum^{H}{h=1} N_h (N_h - n_h) \frac{s^2_h}{n_h}$ | Estimate of the variance of the stratified mean | | $\widehat{I} = N \overline{I}$ | Estimate of the population total over the survey area |

References



PaulRegular/SimSurvey documentation built on Sept. 21, 2023, 7:42 p.m.