In einarhjorleifsson/datrasdoodle: Does not matter.

Catch composition data {#length}

Preamble

Length data from a haul are relatively cheap to record in situ. For more detail information, such as age one needs to read annual rings from hard structure (normally otoliths) which require often laborious preparatory work first. Hence in surveys the length measurements are frequently more numerous than some more detail measurements. The sampling objective for the length frequency measurements are to estimate with good precision the length distribution of the catch in each haul. This is either achieved by a full census measurement of the catch or if sub-sampling is warranted it is supposed to be based on a random sample from the catch.

In this chapter we show ...

Load libraries and data:

library(tidyverse)
library(lubridate)
library(tidyices)
library(ggridges)

hh <- read_rds("data/ns-ibts_hh.rds")
hl <- read_rds("data/ns-ibts_hl.rds")

Catch per haul (Abundance)

Since in the tidying process all length data counts were standardized to 60 minutes haul time it is relatively straight forward to calculate the abundance per 60 minute haul. Lets do this for some species of choice:

Latin <- "Pleuronectes platessa"
by.haul.positive <- 
  hl %>% 
  select(-sex) %>% 
  filter(latin == Latin) %>% 
  # NOTE: I am not sure if the next step is kosher
  #       Basically dropping rows where length and/or n is NA
  #       Check with the DATRAS experts
  drop_na() %>% 
  group_by(id) %>% 
  summarise(n = sum(n))

In the code above we have basically collapsed the abundance by length to abundance by haul. The number of records in the by.haul dataframe are r nrow(by.haul.positive) while the number of hauls in the haul dataframe are r nrow(hh). The difference are hauls where no r Latin was caught.

In order to include the stations with zero catch we:

Take the haul data and do a left join with the length data which have some catch recorded. A demonstration of the join-function family is provided using some simple data in On joins.
This results in the zero stations having NA in the n variable. Hence the mutate in combination with replace_na comes to the rescue, setting a value of zero as count for stations where no fish was caught otherwise retain the count value.

by.haul <-
  hh %>% 
  select(id, year, quarter, haulval) %>% 
  left_join(by.haul.positive) %>%
  mutate(n = replace_na(n, 0))

We now have a dataframe that looks like this:

glimpse(by.haul)

A quick count:

table(by.haul$haulval, by.haul$n == 0, useNA = "ifany")

reveals that we have r by.haul %>% filter(haulval == "V", n > 0) %>% nrow() valid hauls with r Latin and r by.haul %>% filter(haulval == "V", n == 0) %>% nrow() valid hauls where no r Latin was caught.

Since we are here we may as well look at the temporal trend in the mean abundance using only the valid hauls:

by.haul %>% 
  filter(haulval == "V") %>% 
  group_by(year, quarter) %>% 
  summarise(m = mean(n)) %>% 
  ungroup() %>% 
  mutate(year = year + 0.25 * quarter - 0.125) %>% 
  ggplot(aes(year, m, colour = factor(quarter), group = quarter)) +
  geom_point() +
  geom_line() +
  scale_color_brewer(palette = "Set1") +
  labs(x = NULL, y = NULL,
       colour = "Quarter",
       title = paste0(Latin, ": Mean number per 1 hour haul"))

Now the trend in the abundance, particularly in the earlier part of the time-series may be confounded with the survey coverage and sampling design over time (TODO: More on that later).

Using the mean may not be appropriate here, given the general distribution of abundance by haul one observes in groundfish surveys:

by.haul %>% 
  filter(haulval == "V") %>% 
  # put the values above the 99%th percentile to 99%th percentile
  #  just for display purpose
  mutate(n = ifelse(n > quantile(n, 0.99), quantile(n, 0.99), n)) %>% 
  ggplot(aes(n)) +
  geom_histogram() +
  labs(x = "Abundance per haul",
       y = "Number of stations")

Here most hauls are with relatively small or zero catches with occasional hauls with very large catches resulting in a highly negatively scewed distribution. The ggplot2 package has a very nice summary statistics (bootstrap mean and standard error) that comes to the rescue:

by.haul %>% 
  filter(haulval == "V") %>% 
  mutate(year = year + 0.25 * quarter - 0.125) %>% 
  ggplot(aes(year, n, colour = factor(quarter))) +
  stat_summary(fun.data = "mean_cl_boot", size = 0.1) +
  expand_limits(y = 0) +
  scale_color_brewer(palette = "Set1") +
  labs(x = NULL, y = NULL,
       colour = "Quarter",
       title = paste("Mean number of", Latin, "per 1 hour haul"),
       subtitle = "Bootstrap mean and confidence intervals")

The above code could easily be amended if one were interested in exploring the trend for lets say only the larger sized fish (only one line needs to be added).

More species

With some minor tweaking we can actually do this for all species. Here we will though limit ourselves to some "common" species:

Latin <- c("Clupea harengus", "Sprattus sprattus", "Gadus morhua",
           "Melanogrammus aeglefinus", "Merlangius merlangus", "Pollachius virens",
           "Trisopterus esmarkii", "Scomber scombrus", "Pleuronectes platessa",
           "Solea solea", "Anarhichas lupus", "Lophius piscatorius")

by.haul.positive <- 
  hl %>% 
  select(-sex) %>% 
  filter(latin %in% Latin) %>% 
  drop_na() %>% 
  # additional code: added latin to the grouping
  group_by(id, latin) %>% 
  summarise(n = sum(n))

Now, since we have more than one species the following will not work for us (see on missingness in the Auxiliary chapter, explaining why using some simplified haul and count dataframes as examples):

# Not run
by.haul <-
  hh %>% 
  select(id, year, quarter, haulval) %>% 
  left_join(by.haul.positive) %>% 
  mutate(n  = replace_na(n,  0))

The simplest way to include implicitly missing species is to first generate a dataframe that contains all the hauls and species combination before one does the left join with the count data. Here we use the function crossing (again see on missingness):

all <- 
  hh %>% 
  filter(haulval == "V") %>% 
  # Lets only carry forward variables "needed"
  select(id, year, quarter, shootlong, shootlat, faoarea) %>% 
  crossing(latin = Latin)
by.haul <- 
  all %>% 
  left_join(by.haul.positive) %>% 
  mutate(n = replace_na(n, 0))
glimpse(by.haul)

Now the number of records in the "all" dataframe is r nrow(all). This is equivalent to the number of valid hauls in the "hh" dataframe (r hh %>% filter(haulval == "V") %>% nrow() records) times the number of species we chose to work with (r length(Latin) species). Take also note, that the "by.haul" dataframe has the same number of records - so the left-join does not really add more records, just adds the abundance and the weight variables to the proper haul and species variable combination.

So we are now able to calculate and plot the bootstrap means and confidence interval:

by.haul %>% 
  mutate(year = year + 0.25 * quarter - 0.125) %>% 
  ggplot(aes(year, n, colour = factor(quarter))) +
  stat_summary(fun.data = "mean_cl_boot", size = 0.1) +
  expand_limits(y = 0) +
  scale_color_brewer(palette = "Set1") +
  labs(x = NULL, y = NULL,
       colour = "Quarter",
       title = "Mean number of fish per 1 hour haul",
       subtitle = "Bootstrap mean and confidence interval") +
  facet_wrap(~ latin, scale = "free_y")

Spatial distribution

Just because we can, it is easy to plot a spatial distribution of abundance from the above data. E.g. to plot the distribution of catch by species in the most recent survey all that is needed is:

m <- map_data("world", xlim = range(by.haul$shootlong), ylim = range(by.haul$shootlat))
by.haul %>% 
  filter(year == 2018,
         quarter == 1) %>% 
  ggplot() +
  geom_polygon(data = m, aes(long, lat, group = group), fill = "grey") +
  geom_point(aes(shootlong, shootlat, size = n), colour = "red", alpha = 0.5) +
  scale_size_area(max_size = 10) +
  scale_x_continuous(NULL, NULL) +
  scale_y_continuous(NULL, NULL) +
  coord_quickmap(xlim = range(by.haul$shootlong), ylim = range(by.haul$shootlat)) +
  facet_wrap(~ latin,
             nrow = 2) +
  labs(size = "Number hr-1",
       title = "Catch per standardized haul",
       subtitle = "NS-IBTS 2018, quarter 1")

Or one could select one species and plot different years:

Latin <- "Anarhichas lupus"
by.haul %>% 
  filter(year %in% 2010:2018,
         latin == Latin) %>% 
  ggplot() +
  geom_polygon(data = m, aes(long, lat, group = group), fill = "grey") +
  geom_point(aes(shootlong, shootlat, size = n), colour = "red", alpha = 0.5) +
  scale_size_area(max_size = 10) +
  scale_x_continuous(NULL, NULL) +
  scale_y_continuous(NULL, NULL) +
  coord_quickmap(xlim = range(by.haul$shootlong), ylim = range(by.haul$shootlat)) +
  facet_grid(quarter ~ year) +
  labs(size = "Fish hr-1",
       title = paste0(Latin, ": Catch per standardized haul"),
       subtitle = "NS-IBTS quarter 1 and 3")

Similarly as one rasterized temperature in the haul chapter we could rasterize the abundance by say statistical rectangles and plot it according to:

by.square <-
  by.haul %>% 
  mutate(long = gisland::grade(shootlong, 1),
         lat  = gisland::grade(shootlat, 0.5)) %>% 
  group_by(year, quarter, latin, long, lat) %>% 
  summarise(n = mean(n))
Latin <- "Pleuronectes platessa"
by.square %>% 
  filter(latin == Latin,
         quarter == 1,
         year >= 1975) %>% 
  mutate(n = ifelse(n > quantile(n, 0.50), quantile(n, 0.50), n),
         # for zero squares - plot as grey
         n = ifelse(n == 0, NA_real_, n)) %>% 
  ggplot(aes(long, lat)) +
  geom_raster(aes(fill = n)) +
  scale_fill_viridis_c(option = "B", direction = -1) +
  geom_polygon(data = m, aes(group = group), fill = "grey") +
  coord_quickmap(xlim = range(hh$shootlong), ylim = range(hh$shootlat)) +
  facet_wrap(~ year) +
  scale_x_continuous(NULL, NULL) +
  scale_y_continuous(NULL, NULL) +
  labs(fill = "Fish hr-1",
       title = paste0(Latin, ": Mean catch per square"),
       subtitle = "NS-IBTS, quarter 1")

NOTE: Must say that for the historical part the zero catch squares (for plaice) are a bit intriguing - is it a bug in the code, the way things were recorded in different cruises or a reflection of biological truth?

library(GGally)
n.glyph <-
  by.square %>% 
  filter(latin == "Pleuronectes platessa") %>% 
  #mutate(n = ifelse(n > 0 & n > quantile(n, 0.40), quantile(n, 0.40), n)) %>% 
  glyphs(x_major = "long", 
         y_major = "lat",
         x_minor = "year", 
         y_minor = "n", 
         width = 1, 
         height = 0.5)
n.glyph %>% 
  mutate(pos = ifelse(n != 0, TRUE, FALSE),
         base = lat-0.25,
         gy = ifelse(n == 0, gy + 0.005, gy)) %>% 
  ggplot() +
  theme_bw() +
  geom_polygon(data = m, aes(long, lat, group = group), fill = "grey") +
  geom_linerange(aes(x = gx, ymin = base, ymax = gy, colour = pos)) +
  labs(x = "", y = "") +
  coord_quickmap(xlim = range(by.square$long), ylim = range(by.square$lat)) +
  scale_x_continuous(breaks = seq(-20, 20, by = 1)) +
  scale_y_continuous(breaks = seq(40, 70, by = 0.5)) +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "none")

CPUE per length per haul

In section above we started off by collapsing all size structured information. However, we are often interested in exploring and using the more detailed measurements by length. Lets start by looking at only one species

Latin <- "Pleuronectes platessa"

by.haul.positive <- 
  hl %>% 
  filter(latin == Latin) %>% 
  # Ignore sex, but tally up numbers by length
  group_by(id, latin, length) %>% 
  summarise(n = sum(n)) %>% 
  ungroup() %>% 
  drop_na()

Lets take some two "random" station and look at the data by length:

set.seed(31)
IDS <-
  hh %>% 
  filter(year == 2018, 
         id %in% unique(by.haul.positive$id)) %>% 
  sample_n(2) %>% 
  pull(id)
by.haul.positive %>% 
  filter(id %in% IDS) %>% 
  select(id, length, n) %>% 
  spread(id, n) %>% 
  knitr::kable()

We observe that the fish reported in different haul do not "cover" the same length classes. Now although this is expected, when summarizing the data we need to take this into account. The "missingness" means implicitly that the value are effectively zero.

We also observe that we have a mixture of measurements, some (likely) only measuring the fish to the nearest centimeter in other cases to the nearest millimeter. So before proceeding lets "collapse" the length measurements to the nearest centimeter:

Latin <- "Pleuronectes platessa"
by.haul.positive <- 
  hl %>% 
  filter(latin == Latin) %>% 
  mutate(length = floor(length)) %>% 
  group_by(id, length) %>% 
  summarise(n = sum(n)) %>% 
  drop_na()

Note: The value of the length variable is to be understood as the lower boundary of a 1 centimeter bin. I.e. a length value of 10 represents a fish between 10-10.99... centimeters.

To take into account the "missingness" we could take the same approach as when we were coding for more than one species, i.e. we generate a full matrix of lengths at each station.

all <- 
  hh %>% 
  filter(haulval == "V") %>% 
  # Lets only carry forward variables "needed"
  select(id, year, quarter, shootlong, shootlat, faoarea) %>% 
  crossing(length = c(min(by.haul.positive$length):max(by.haul.positive$length)))

So we started with a haul dataframe with r hh %>% filter(haulval == "V") %>% nrow() observations and ending up with the "all" dataframe of some r nrow(all) records. This is a result of the span of the length classes for the selected species being r length(min(by.haul.positive$length):max(by.haul.positive$length)).

We now merge the dataframes so that we have a "complete matrix" of length, latin and number of fish for each haul:

by.haul <- 
  all %>% 
  left_join(by.haul.positive, by = c("id", "length")) %>% 
  mutate(n = replace_na(n, 0))

Lets now look at the two hauls we peeked at earlier (only 34 first records):

by.haul %>% 
  filter(id %in% IDS) %>% 
  select(id, length, n) %>% 
  spread(id, n) %>% 
  slice(1:34) %>% 
  knitr::kable()

We have now a complete record for all lengths classes reported in the survey and for each of the length class we have the number of observations including zero counts. This includes hauls were no fish for a particular species was measured. As an example (only top and bottom part shown):

by.haul %>% 
  filter(id == "2018_1_58G2_GOV_10") %>% 
  select(id, length, n) %>% 
  slice(c(1:10, 58:67)) %>% 
  knitr::kable()

Hence any statistical analysis done based on this data structure becomes consquently simple.

We can now calculate the mean catch per length class per year:

cpue <- 
  by.haul %>% 
  group_by(year, quarter, length) %>% 
  summarise(hauls = n(),
            n = mean(n))

and create a plot

cpue %>% 
  filter(year >= 2004) %>% 
  ggplot(aes(length, n, colour = factor(quarter))) +
  geom_line() +
  scale_color_brewer(palette = "Set1") +
  facet_grid(year ~ .) +
  labs(x = "Length [cm]",
       y = "Mean number of fish",
       colour = "Quarter",
       title = paste0(Latin, ": Mean number of fish per 1 hour"))

Take note that we can easily repeat the calculation of catch-per-unit-effort by year as done in the first subsection above using the "by.haul" dataframe that does not have any "missingness" created in this subsection (not run):

by.haul %>% 
  group_by(id, year, quarter) %>% 
  summarise(n = sum(n)) %>% 
  ungroup() %>% 
  mutate(year = year + 0.25 * quarter - 0.125) %>% 
  ggplot(aes(year, n, colour = factor(quarter))) +
  stat_summary(fun.data = "mean_cl_boot", size = 0.1) +
  expand_limits(y = 0) +
  scale_color_brewer(palette = "Set1") +
  labs(x = NULL, y = NULL,
       colour = "Quarter",
       title = "Mean catch [fish] per 1 hour haul",
       subtitle = "Bootstrap mean and confidence interval")

A little convenient function

In the above we got a little appreciation for not forgetting to take into account "missingness" in the DATRAS dataframes. It basically mean a little bit of extra coding where we both have to take the haul data and the length data into consideration. Since we may be doing this quite frequently a little convenient function, that encapsulates the above code may be in order:

cpue_per_length_per_haul <- function(hh, hl, Latin) {

  by.haul.positive <- 
    hl %>% 
    filter(latin == Latin) %>% 
    mutate(length = floor(length)) %>% 
    # Note: we are collapsing sex and maturity
    group_by(id, latin, length) %>% 
    summarise(n = sum(n)) %>% 
    drop_na()

  all <- 
    hh %>% 
    filter(haulval == "V") %>% 
    # Lets only carry forward variables "needed"
    select(id, year, quarter) %>% 
    crossing(length = c(min(by.haul.positive$length):max(by.haul.positive$length))),
             latin = unique(by.haul.positive$latin))

  by.haul <- 
    all %>% 
    left_join(by.haul.positive) %>% 
    mutate(n = replace_na(n, 0))

  return(by.haul)

}

The above code emulates to a large degree the calculation provide by the "DATRAS service product" referred to as "CPUE per length per hour haul". The code is available in the R/00_main.R script and hence in "production mode" one could source it (and other functions that may reside in the file) by:

source("R/00_main.R")

For any species we can now just do:

Latin <- "Solea solea"
d <-
  cpue_per_length_per_haul(hh, hl, Latin)

d %>% 
  group_by(year, quarter, length) %>% 
  summarise(hauls = n(),
            n = mean(n)) %>% 
  filter(year %in% 1994:2018,
         quarter == 1) %>% 
  ggplot(aes(length, n)) +
  geom_line() +
  facet_wrap(~ year) +
  labs(x = "Length [cm]",
       y = "Mean number of fish",
       colour = "Quarter",
       title = paste0(Latin, ": Mean number of fish per 1 hour"))

... or even a little bootstrap on abundance by length :-)

d %>% 
  filter(year %in% 2018,
         quarter == 1) %>% 
  ggplot(aes(length, n)) +
  stat_summary(fun.data = "mean_cl_boot", size = 0.1)

Recap

Starting from scratch (raw DATRAS exchange data) one can obtain the CPUE per length per hour haul via (not run):

Latin <- "Gadus morhua"

library(tidyverse)
source("R/00_main.R")
species <- read_csv("ftp://ftp.hafro.is/pub/reiknid/einar/datras_worms.csv")
raw <- read_rds("data-raw/datras/ns-ibts_raw.rds")
hh <-
  raw$hh %>% 
  tidy_hh()
hl <-
  raw$hl %>% 
  tidy_hl(hh, species)
d <-
  cpue_per_length_per_haul(hh, hl, Latin)