Load libraries and scripts:
library(maps) # load before loading purrr library(tidyverse) library(modelr) library(broom) # source in conventient functions source("R/00_main.R") library(patchwork) theme_set(theme_grey())
Load data:
hh <- read_rds("data/ns-ibts_hh.rds") ca <- read_rds("data/ns-ibts_ca.rds") hl <- read_rds("data/ns-ibts_hl.rds")
Length data from a haul are relatively cheap to record in situ. For more detail information, such as age one needs to read annual rings from hard structure (normally otoliths) which require often laborious preparatory work first. Hence in surveys the length measurements are frequently more numerous than some more detail measurements. The sampling objective for the length frequency measurements are to estimate with good precision the length distribution of the catch in each haul. This is either achieved by a full census measurement of the catch or if sub-sampling is warranted it is supposed to be based on a random sample from the catch^[NOTE: This should maybe have been emphasized in the chapter on length].
The a priori sampling strategy for more detailed measurements on each fish is most often different from the sampling strategy for length frequency measurements. It could e.g. be based on obtaining some minimum number of measurements within each predetermined length bins (1 cm, 5 cm , ...) per station or over some adjacent hauls (substrata), the objective to cover the full length span for the species in question. The bottom line is that the detailed data are not necessarily a random subsample of the distribution within the catch.
Among the objectives of collecting the detailed measurements are:
These relationships can either be derived empirically or established using some form of a statistical model. Once established, the metrics derived from the detailed data are then used in combination with the length frequency data to obtain estimates of various metrics of the actual catch.
In this section we are going to:
Lets take a peek at the detailed data:
glimpse(ca)
The variable "id" is a reference to the the haul id, the variable being a link to the more detailed information about the haul that reside in the hh-dataframe. Take note that within the ca-data each row does not represent one fish unless the variable "n" is one. Rather, the observations (rows) are the number of fish (n) observed of a species, sex and maturity and age measured within a haul. Ergo, within a haul for a particular species we could have more than one observations within a length group, this being because we may have different age, sex and maturities within the length group. As an example^[NOTE: find a better one - i.e. where n > 1.] note these set of observations (here ignoring individual weights):
ca %>% filter(latin == "Pleuronectes platessa", id == "2010_3_ARG_GOV_5", length == 26) %>% select(length, age, sex, maturity, n) %>% distinct() %>% arrange(age, sex, maturity)
I.e. in this haul we thus have 8 combinations of age, sex and maturity for Pleuronectes platessa in the length class 26 cm.
NOTE: What does the weight represent when n > 1?? E.g.:
ca %>% filter(n > 1, !is.na(wgt), latin == "Gadus morhua") %>% glimpse()
In the previous chapter we provided a code flow to determine the catch numbers for per length class per standardized tow. In this subsection we are going describe the process for getting the estimated weight of the catch. Before doing so we first introduce how one can estimate the parameters that describe the weight as a function of length.
One of the most common task in fisheries science it to convert the length frequency estimates of the catch (abundance) to weights. The mathematical relationship of weight (W) as a function of length (L) is often described as:
$$W = \alpha L^{\beta}$$ Normally the parameters $\alpha$ and $\beta$ are estimated based on the log-transformed data^[NOTE: This has more to do with normalizing the variance in the dependent variable along the independent variable than linear models being more easy to fit (in the old days).], i.e.:
$$Ln(W) = ln(\alpha) + \beta Ln(L) + \epsilon$$ Lets pick some data from a survey for some species:
Latin <- c("Merlangius merlangus", "Clupea harengus", "Melanogrammus aeglefinus", "Pleuronectes platessa", "Sprattus sprattus", "Gadus morhua", "Trisopterus esmarkii", "Pollachius virens", "Eutrigla gurnardus") hh2 <- hh %>% filter(year == 2018, quarter == 1) lw <- hh2 %>% select(id) %>% left_join(ca, by = "id") %>% filter(latin %in% Latin, # only individually recorded fish n == 1) %>% select(latin, length, wgt) %>% drop_na()
We can obtain a quick visualization of the data with ggplot:
lw %>% group_by(latin) %>% # only 500 measurments per species (hence grouping above) sample_n(size = 500) %>% ggplot(aes(length, wgt, colour = latin)) + geom_point(size = 0.2) + geom_smooth(method = "lm") + scale_colour_brewer(palette = "Set1") + scale_x_log10(breaks = c(5, 10, 25, 50, 100)) + scale_y_log10(breaks = c(5, 10, 100, 500, 2500, 5000)) + theme(legend.position = c(0.8, 0.3)) + labs(x = "Length [cm]", y = "Weight [g]")
We observe that the relationship is roughly linear and that for each species the linear-trend-line are slightly different (although not necessarily statistically different).
The code for estimating the parameters in R for one species (here for Pleuronectes platessa) is something like:
lw.one <- lw %>% filter(latin == "Pleuronectes platessa") %>% mutate(l = log(length), w = log(wgt)) fit <- lm(w ~ l, data = lw.one)
We have generated an object fit that has the following class:
class(fit)
In base-R one can get details of the model fit via:
summary(fit)
Besides observing that the fit is pretty good (all length-weight relationships have a good fit :-) the summary statistics give us a glimpse of estimated values of the parameters. To extract the coefficients in base-R one normally calls (not run):
coefficients(fit)
But since we are focusing on the tidyverse code flow lets check out what is in store for us in the broom-package:
tidy(fit)
Here the tidy
-function returns a long tibble, where parameters are listed in the variable "term" and the estimated values are in the variable "estimate". We also get additional associated statistic of each parameter as a bonus, all neatly arranged in a tidy dataframe.
For the species in question we could now use the parameters estimated to convert catch in numbers to catch in weight.
NOTE: This case scenario (lw-coefficient estimates for a bunch of species) is may not be the most stimulating (in terms of wanting to learn) - think of an alternative when introducing the concepts of multimodels.
We may want to do get the length-weight coefficients for a bunch of species. Fortunately, the tidyverse gang has come up with a very nice approach doing just that:
lw_model <- function(df) { lm(wgt ~ length, data = df) } lw %>% mutate(length = log(length), wgt = log(wgt)) %>% group_by(latin) %>% nest() %>% # | For each element in ... # | | this list variable ... # | | | apply this function. # | | | mutate(fit = map(data, lw_model), param = map(fit, tidy)) %>% unnest(param, .drop = TRUE)
Now, I guess a little explaining may be in order here. It will though be kept to a bare minimum the reader being refereed to chapters 21.5 The map functions and 25 Many models in the book R for Data Science for fuller details.
The process above can be split into three steps:
Step 1
In this step we basically "split up" the length and weight data into separate tibbles for each species, these being stored in the variable data that is a list column (as indicated in the outprint above) within the tibble "step1".
step1 <- lw %>% mutate(length = log(length), wgt = log(wgt)) %>% group_by(latin) %>% nest() step1
The tibble "step1" has only 9 records (one for each species). Take note in the outprint above that the tibble's in the "data" list-variable all contain two variables (columns) but different number of records (rows). E.g. for Clupea harengus we have 4083 records.
Like all list we can access individual elements of the list by (here the first one, corresponding to Clupea harengus):
step1$data[[1]]
Step 2a:
In this step we fit the length-weight model for each species. Lets run the code and print the output:
step2a <- step1 %>% # | For each element in ... # | | this list variable ... # | | | apply this function. # | | | mutate(fit = map(data, lw_model)) step2a
We have generated a new variable (that is what mutate in principle always does). Again this variable (fit) is a list, whose elements are of class "lm" (The same class as when we fitted the model for a single species above). Lets just check the first list element:
step2a$fit[[1]]
We here have the parametric values of the length-weight relationship for the first species in the tibble (Clupea harengus).
The map
-function above is a special function that operates on lists, the 1st argument being the reference to data to be used and the second argument is the function to be applied to the data. In this specific case the function had been defined above (repeated here for convenience):
lw_model <- function(df) { lm(wgt ~ length, data = df) }
In the code execution above each of the element in the list-column data (i.e. for each species) is passed to the lw_model
-function (where internally within the function it is referred to as "df"), a function that is wrapper around the length-weight model.
Step 2b:
Since the elements in the list-column fit are not tidy we apply an additional coding step to achieve that:
step2b <- step2a %>% # | For each element in ... # | | this list variable ... # | | | apply this function. # | | | mutate(param = map(fit, tidy)) step2b
In this step we generate one more variable, named param, this time a list of tibbles, each one having 2 rows and 5 columns. Lets take a peek into the first element:
step2b$param[[1]]
Here we have the parameter value estimates and associated statistics for the first species (again Clupea harengus).
Step 3:
In the final step we want to get one tibble for the parameter estimates of all species.
step3 <- step2b %>% unnest(param, .drop = TRUE) step3
The unnest
-function is the inverse of the nest
-function. The first argument is the list variable we want to un-nest and the outcome is a neat tibble (all tibbles are neat :-) giving us the parameter estimates and associated statistics for each species.
One more step may be warranted before we proceed: What we need is a tibble with one record per species where the parameters $\alpha$ and $\beta$ are separate variables:
lw.parameters <- step3 %>% # additional code mutate(term = ifelse(term == "length", "b", "a")) %>% select(latin:estimate) %>% spread(term, estimate) %>% mutate(a = exp(a)) lw.parameters
... may be needed, for clarity
We are now ready to proceed with estimating the catch weight per length per tow for the selected species using the length-frequency data:
Latin <- c("Merlangius merlangus", "Clupea harengus", "Melanogrammus aeglefinus", "Pleuronectes platessa", "Sprattus sprattus", "Gadus morhua", "Trisopterus esmarkii", "Pollachius virens", "Eutrigla gurnardus") hh2 <- hh %>% filter(haulval == "V", year == 2018, quarter == 1) %>% select(id, year, quarter, lon = shootlong, lat = shootlat) hl2 <- cpue_per_length_per_haul(hh2, hl, Latin) hl2 <- hl2 %>% left_join(lw.parameters, by = "latin") %>% mutate(wgt = (n * a * (length + 0.5)^b) / 1e3)
Ergo, in addition to the number (the n> variable) we have added weights (the wgt variable) by species, length class and haul. Lets check the overall mean distribution by numbers versus weight:
hl2 %>% group_by(id, length) %>% summarise(n = sum(n), wgt = sum(wgt)) %>% group_by(length) %>% summarise(n = mean(n), wgt = mean(wgt)) %>% gather(variable, value, n:wgt) %>% filter(length <= 110) %>% ggplot(aes(length, value)) + geom_line() + facet_grid(variable ~ ., scale = "free_y")
We can now easily summarize the data by haul, including coordinate position and do some spatial plot of the catch weights:
d <- hl2 %>% drop_na() %>% group_by(id, lon, lat, latin) %>% summarise(bio = sum(wgt)) xlim <- range(d$lon) ylim <- range(d$lat) d %>% ggplot() + geom_polygon(data = map_data("world", xlim = xlim, ylim = ylim), aes(long, lat, group = group), fill = "grey") + geom_point(aes(lon, lat, size = bio/1e3), colour = "red", alpha = 1/3) + scale_size_area(max_size = 20) + coord_quickmap(xlim = xlim, ylim = ylim) + facet_wrap(~ latin) + scale_x_continuous(NULL, NULL) + scale_y_continuous(NULL, NULL) + labs(size = "t hr-1")
Estimation of spawning stock biomass depends on having accurate maturity data. In the stock assessment world this part of the stock component is normally estimated as the sumproduct of stock in numbers, weight and maturity ogive for a given length or age class. The ogive is a value between 0 and 1 for a given size or length. In recent times it is most often based on measuments from surveys, earlier it was frequently based on the measurements from the commercial catch.
Lets set the stage by filtering out some detailed data from selected species from a survey:
Latin <- c("Merlangius merlangus", "Clupea harengus", "Melanogrammus aeglefinus", "Pleuronectes platessa", "Gadus morhua", "Pollachius virens","Eutrigla gurnardus") ca2 <- hh2 %>% select(id) %>% left_join(ca) %>% filter(latin %in% Latin) %>% mutate(length = floor(length))
If we look at the maturity count we have:
ca2 %>% group_by(maturity) %>% summarise(n = sum(n))
According to ICES vocabulary the scale we have here is a 6-staged scale where the keys represent the following:
According to convention when using maturity to estimate spawning potential, maturing, spawning and spent are normally used. Here we will create a new variable, mature, that gets the value FALSE (equivalent to zero when doing summation statistics) if the maturity stage is 61 and TRUE (equivalent to one) if the maturity stage is any one of 62, 63 and 64. Any other values (resting, abnormal or missing observations) will be assigned as NA:
ca2 <- ca2 %>% mutate(mature = case_when(maturity == "61" ~ FALSE, maturity %in% c("62", "63", "64") ~ TRUE, TRUE ~ NA))
Empirical estimations can easily be derived within the tidyverse code flow.
Length based
Basically we want to:
The code flow below is something like:
d <- ca2 %>% select(latin, length, n, mature) %>% drop_na() %>% group_by(latin, length, mature) %>% summarise(n = sum(n)) glimpse(d)
So far, we have only done the first step itemized above. There are different ways to code the latter step. What is done below is:
ogive.by.length <- d %>% uncount(n) %>% group_by(latin, length) %>% summarise(p = sum(mature) / n())
Visually we have this:
ogive.by.length %>% ggplot(aes(length, p)) + geom_point(colour = "red") + geom_line(linetype = 5) + scale_color_brewer(palette = "Set1") + scale_y_continuous(breaks = c(0, 0.25, 0.50, 0.75, 1.0)) + facet_wrap(~ latin) + labs(x = "Length [cm]", y = "Mature O-give")
Age based:
To derive the maturity ogive by age, one just switches variables, subsituting the length variable with the age variable in the code flow^[NOTE: In a succinct code-ing style where function is used one could call a "switch" with respect if length or the age variable is to by tallied.] (not run):
ogive.by.age <- ca2 %>% select(latin, age, n, mature) %>% drop_na() %>% group_by(latin, age, mature) %>% summarise(n = sum(n)) %>% uncount(n) %>% group_by(latin, age) %>% summarise(p = sum(mature) / n()) ogive.by.age %>% ggplot(aes(age, p)) + geom_point(colour = "red") + geom_line(linetype = 5) + scale_color_brewer(palette = "Set1") + scale_y_continuous(breaks = c(0, 0.25, 0.50, 0.75, 1.0)) + facet_wrap(~ latin) + labs(x = "Age", y = "Mature O-give")
If we have representative samples accross either length or age using the empirical approach may be sufficient. As seen e.g. for the length based saithe profile we obviously have some observation errors, most likely because of low sampling size within some length classes. We will also encounter missing length classes in the detailed data. Take e.g. the maturity length profile of haddock for the largest length classes:
ogive.by.length %>% filter(latin == "Melanogrammus aeglefinus") %>% filter(length >= 60) %>% knitr::kable()
Here we have no measurments for length classes 63 through 66. We will look further into this "missingness" in the subsection dealing with converting length to age below. Bottom line is that using a statstical model may both allow us to smooth over any sampling error (read: too few samples taken) and make predictions for cases we have missing observations.
Here below, instead of estimating the maturity ogive for different species in one survey and time we will do the estimate for one species over time.
Length based
model_ogive_length <- function(df) { glm(mature ~ length, data = df, family = binomial) } Latin <- "Pleuronectes platessa" hh2 <- hh %>% filter(year >= 1990) # NOTE: Need to double check code below mat <- hh2 %>% select(id, year, quarter) %>% left_join(ca %>% filter(latin %in% Latin), by = "id") %>% mutate(length = floor(length), # take historical maturity scale into account mature = case_when(maturity %in% c("1", "61") ~ FALSE, maturity %in% c("2", "3", "4", "62", "63", "64") ~ TRUE, TRUE ~ NA)) %>% select(year, quarter, latin, length, mature, n) %>% drop_na() %>% group_by(year, quarter, latin, length, mature) %>% summarise(n = sum(n)) %>% uncount(n) %>% group_by(year, quarter, latin) %>% nest() %>% # | For each element in ... # | | this list variable ... # | | | apply this function. # | | | mutate(fit = map(data, model_ogive_length), param = map(fit, tidy)) mat
We have now fitted 53 models, corresponding to the number of years and quarter. Lets check the worst fits:
mat %>% # tidy table of the parameter estimates unnest(param, .drop = TRUE) %>% arrange(desc(p.value)) %>% slice(1:10) %>% knitr::kable()
So it seems like the year 1991 and 1996 in quarter 1 have not a very good fit. Lets take a peek:
x <- mat %>% filter(year %in% c(1991, 1996), quarter == 1) d <- x %>% unnest(data) %>% group_by(year, quarter, length) %>% summarise(p = sum(mature) / n()) x %>% mutate(pred = map2(data, fit, add_predictions)) %>% unnest(pred) %>% ## need to transform the predictions into probabilities mutate(pred = plogis(pred)) %>% ggplot() + geom_point(data = d, aes(length, p)) + geom_line(aes(length, pred)) + facet_wrap(~ year)
So, a data issue.
mat %>% mutate(pred = map2(data, fit, add_predictions)) %>% unnest(pred) %>% mutate(pred = plogis(pred)) %>% filter(length == 25) %>% ggplot(aes(year, pred, colour = factor(quarter))) + geom_point() + geom_line() + scale_color_brewer(palette = "Set1") + ylim(0, 1) + labs(x = "Year", y = "Proportion mature", colour = "Quarter", title = "Pleuronectes platessa", subtitle = "Proportion mature at 25 cm length")
So, which quarter represents the true maturity at age in the population??
Now the formula for the l50 is −α/β which we can calulate using the tidy function:
mat %>% unnest(param) %>% select(year:estimate) %>% spread(term, estimate) %>% mutate(l50 = -`(Intercept)`/length) %>% ggplot(aes(year, l50, colour = factor(quarter))) + geom_point() + geom_line() + coord_cartesian(ylim = c(0, 35)) + scale_color_brewer(palette = "Set1")
Age based:
model_ogive_age <- function(df) { glm(mature ~ age, data = df, family = binomial) } mat <- hh2 %>% select(id, year, quarter) %>% left_join(ca %>% filter(latin %in% Latin), by = "id") %>% mutate(length = floor(length), # take historical maturity scale into account mature = case_when(maturity %in% c("1", "61") ~ FALSE, maturity %in% c("2", "3", "4", "62", "63", "64") ~ TRUE, TRUE ~ NA)) %>% select(year, quarter, latin, age, mature, n) %>% drop_na() %>% group_by(year, quarter, latin, age, mature) %>% summarise(n = sum(n)) %>% uncount(n) %>% group_by(year, quarter, latin) %>% nest() %>% # | For each element in ... # | | this list variable ... # | | | apply this function. # | | | mutate(fit = map(data, model_ogive_age), param = map(fit, tidy)) %>% mutate(pred = map2(data, fit, add_predictions)) %>% unnest(pred) %>% mutate(pred = plogis(pred)) mat %>% filter(quarter == 1, year >= 1998, age > 0) %>% ggplot(aes(year, pred, group = age)) + geom_line(col = "grey") + geom_text(aes(label = age), angle = 45) + labs(x = "Year", y = NULL, title = Latin, subtitle = "Maturity ogive from quarter 1 survey")
Let's first look at the structure of the age dataframe:
glimpse(ca)
Lets generate a code for an age-length-key for a particular species and survey from the age data, ignoring sex and maturity:
Year <- 2014 Quarter <- 3 Latin <- "Gadus morhua" hh2 <- hh %>% filter(year == Year, quarter == Quarter) ca2 <- hh2 %>% select(id) %>% left_join(ca, by = "id") %>% filter(latin == Latin) %>% mutate(length = floor(length), age = ifelse(age > 9, 9, age)) %>% group_by(age, length) %>% summarise(n = sum(n)) %>% ungroup() %>% drop_na() alk.empirical <- ca2 %>% group_by(length) %>% mutate(p = n / sum(n, na.rm = TRUE)) %>% select(-n) %>% drop_na() alk.empirical %>% ggplot(aes(length, p, colour = factor(age))) + geom_point() + geom_line(lwd = 0.3, linetype = 5) + scale_color_brewer(palette = "Set3") + labs(x = "Length", y = "Probability", colour = "Age", title = "Probability of age at length")
So for each length class we have observations (points) we have calculated the probability that a fish belongs to a particular age class. The sum of the probabilities within a length group is by definition always one. Lets take the 70 cm length class as an example:
alk.empirical %>% filter(length == 70) %>% knitr::kable()
Here we have 5 age-classes within the length class, the most numerous ones being age 3 (50%), then 4 (22%) and 5 (17%), while around 6% of the fish are of age 2 and 6.
Lets visualize jointly the length frequency and age-length-key for one species in one survey:
by.haul <- cpue_per_length_per_haul(hh2, hl, Latin) lfs <- by.haul %>% group_by(length) %>% summarise(n = sum(n)) %>% ungroup() %>% # here only take positives filter(n > 0) p1 <- lfs %>% ggplot(aes(length, n)) + theme_grey() + geom_col() + labs(x = NULL, subtitle = "Length frequency") p2 <- alk.empirical %>% mutate(age = ifelse(age > 9, 9, age)) %>% ggplot(aes(length, p, fill = factor(age))) + theme_grey() + scale_fill_brewer(palette = "Set3") + geom_col() + labs(x = "Length", y = "Proportion", fill = "Age", subtitle = "Proportion of age at length") p1 + p2 + plot_layout(ncol = 1)
In principle we want to split the each length class of the length distribution (top graph) into separate age groups based on the age-length key probabilities (lower graph). Visually this may be easy for the first mode, almost all fish in each length class belonging to age group 0. Beyond that, each length class is most often composed of more than one age groups.
Lets look at the count tally in the 70 cm length class:
lfs %>% filter(length == 70) %>% knitr::kable()
Here we have 36.5 fish counted. In the age-length key for the same length class (as already shown above) we have:
alk.empirical %>% filter(length == 70) %>% knitr::kable()
We use the probability to split the length frequency within each length class into ages. Here we simply have to join the length-frequency and the alk dataframes and split the count tally (n) in each length group into number of fish at each age (n.nage) by multiply the number measured by length with the probability:
d <- lfs %>% left_join(alk.empirical, by = "length") %>% mutate(n.age = p * n)
If we check the 70 cm length class again we now have an estimate of the number of fish by each age-class:
d %>% filter(length == 70) %>% knitr::kable()
Ergo the tally of a total of 36.5 fish in the 70 cm length category have been split up into the 5 age classes (the variable "n" should actually be dropped, just to avoid confusion). Taking all the data we visually have the following:
d %>% select(-n) %>% ggplot(aes(length, n.age, fill = factor(age))) + theme_grey() + geom_col() + scale_fill_brewer(palette = "Set3") + labs(x = "Length", y = "n", fill = "Age", subtitle = "Length frequency by age")
Rather than first grouping all the length measurements from a survey and then apply the age-length-key as done above, one could use the latter to split the length distribution in each haul into age:
by.haul.age <- by.haul %>% left_join(alk.empirical) %>% group_by(id, age) %>% summarise(n = sum(p * n)) %>% ungroup() %>% left_join(hh %>% filter(year == 2014, quarter == 3) %>% select(id, shootlong, shootlat))
So now we have for each haul the estimated number of fish in each age group per 60 minute haul. E.g.:
by.haul.age %>% # NOTE: Need to look into the NA's slice(1:11) %>% knitr::kable()
Since we have also included the coordinates we can easily make a visual representation of abundance distribution by age:
xlim <- range(by.haul.age$shootlong) ylim <- range(by.haul.age$shootlat) by.haul.age %>% filter(age %in% 0:3) %>% ggplot() + geom_polygon(data = map_data("world", xlim = xlim, ylim = ylim), aes(long, lat, group = group), fill = "grey") + geom_point(aes(shootlong, shootlat, size = n), colour = "red", alpha = 0.2) + scale_size_area(max_size = 10) + coord_quickmap(xlim = xlim, ylim = ylim) + facet_wrap(~ age) + labs(x = NULL, y = NULL, size = "Abundance per hour")
The overall age distribution in the survey can then be visualized by:
by.haul.age %>% group_by(age) %>% summarise(n = mean(n)) %>% ggplot(aes(age, n)) + geom_col() + scale_x_continuous(breaks = 0:10) + labs(x = "Age", y = "Mean numbers per 1 hour haul")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.