In pbs-assess/gfplot: Data Extraction and Plotting for DFO PBS Groundfish Stocks

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = TRUE,
  autodep = TRUE,
  fig.path = "pcod-knitr-figs/",
  cache.path = "pcod-knitr-cache/"
)

Setup

If you don't already have the package installed, then run:

# install.packages("devtools")
devtools::install_github("seananderson/gfplot")

First we will load the package along with ggplot2, dplyr, and rstan since we will use them within our code later.

library(gfplot)
library(ggplot2)
library(dplyr)
library(rstan)

An overview of gfplot

The package has a series of get_*() functions for extracting data on the fly.

The complete list is:

fns <- ls("package:gfplot")
sort(fns[grepl("get", fns)])
sort(fns[grepl("tidy", fns)])
sort(fns[grepl("fit", fns)])
sort(fns[grepl("plot", fns)])

Here's how the various functions fit together: https://github.com/seananderson/gfplot/blob/master/inst/function-web.pdf

All of the plotting functions have associated get_*() functions to extract data, tidy_*() and / or fit_*() functions to tidy that data or fit a model to be plotted. First let's look at the raw age frequencies from the surveys.

There are some options for a number of the get functions in terms of what gets extracted. Many of the columns that you might want to filter on are just retained so that you can filter yourself with, for example, dplyr::filter(dat, x = "y").

For the start of descriptions of the default filtering, see the end of the very-in-progress working paper: https://sean.updog.co/gf-synopsis-wp.pdf

which references this SQL code: https://github.com/seananderson/gfplot/tree/master/inst/sql

As an example, we could extract the data with the following function call if we were on a DFO laptop, with appropriate database permissions, and on the PBS network:

get_survey_samples("pacific cod")

If you would rather, you can also specify the species code or the species name in a variety of ways. The following all do the same thing:

get_survey_samples("pacific cod")
get_survey_samples("Pacific cod")
get_survey_samples("PaCiFiC cOD")
get_survey_samples("222")
get_survey_samples(222)

You can also extract multiple species at once:

get_survey_samples(c("pacific ocean perch", "pacific cod"))
get_survey_samples(c(396, 222))

Caching the data from the SQL servers

In addition to the individual get_*() functions, there is a function cache_pbs_data() that runs all the get_*() functions and caches the data in a folder that you specify. I'm going to use this approach here because I'm not currently on the PBS network and it saves running the SQL queries each time the R Markdown document is recompiled.

The helper function cache_pbs_data() will extract all of the data into a series of .rds files into whatever folder you specify to the path argument. I'll wrap it in a quick check just to make sure we don't download the data twice if we build this document again.

if (!file.exists(file.path("pcod-cache", "pbs-survey-sets.rds"))) { # quick check for speed
  cache_pbs_data("pacific cod", path = "pcod-cache")
}

Let's read those data files in to work with here.

cache             <- file.path("pcod-cache")
d_survey_sets     <- readRDS(file.path(cache, "pbs-survey-sets.rds"))
d_survey_samples  <- readRDS(file.path(cache, "pbs-survey-samples.rds"))
d_comm_samples    <- readRDS(file.path(cache, "pbs-comm-samples.rds"))
d_catch           <- readRDS(file.path(cache, "pbs-catch.rds"))
d_cpue_spatial    <- readRDS(file.path(cache, "pbs-cpue-spatial.rds"))
d_cpue_spatial_ll <- readRDS(file.path(cache, "pbs-cpue-spatial-ll.rds"))
d_survey_index    <- readRDS(file.path(cache, "pbs-survey-index.rds"))
d_age_precision   <- readRDS(file.path(cache, "pbs-age-precision.rds"))
d_cpue_index      <- readRDS(file.path(cache, "pbs-cpue-index.rds"))

Now let's create a folder to hold some PDF versions of our figures:

figs <- file.path("pcod-figs")
dir.create(figs, showWarnings = FALSE)

Age and length frequencies

Let's start by looking at the age and length frequencies. For this example, we used the caching function for convenience and saved the data frame into d_survey_samples.

glimpse(d_survey_samples)

If we pass a data frame into the function tidy_ages_raw() we will get a 'tidied' data frame that is ready for plotting with plot_ages(). Here, we have fin-based ages, which are ageing method 6:

unique(d_survey_samples$ageing_method)

tidy_ages_raw(d_survey_samples, ageing_method_codes = 6)

The functions in the gfplot package are pipe (%>%) friendly. This just means that the first argument in nearly every function takes data. So, we can pipe the output from the tidying functions directly into the plotting functions.

g <- tidy_ages_raw(d_survey_samples, ageing_method_codes = 6) %>%
  plot_ages()
g

The plotting functions all return a ggplot object. That means we need to print them to see them on the screen. It also means we can save them into objects, like I did above with the object g. That way we can also build on them to do things like adjust labels or change the color scales or add our own theme. It also means that if we want to save them that we can use the function ggsave():

ggsave(file.path(figs, "ages-raw.pdf"), plot = g, width = 13, height = 6)

Most plotting functions have arguments that let you control many aspects of how the plots look. Similarly, most of the tidying functions have some options for how the data are manipulated.

In all cases, the get_*() and tidy_*() functions can operate on one or multiple species in the same data frame. All of the fitting and plotting functions, however, expect data from a single species.

Let's make our previous age frequency plot but just for the synoptic surveys:

g <- tidy_ages_raw(d_survey_samples,
  survey = c("SYN WCHG", "SYN HS", "SYN QCS", "SYN WCVI"),
  ageing_method_codes = 6
) %>%
  plot_ages()
g

Because the output from the plotting functions is a ggplot object we can do things like add our own theme:

g + theme_grey()
g + theme_light()
g + ggthemes::theme_excel() + scale_color_manual(values = c("M" = "cyan", "F" = "red"))

Excel, eh?

Or modify the plot by doing things like suppressing the legend or changing the title or changing the colours:

g + guides(colour = FALSE) +
  labs(
    title = "Pacific cod age frequencies", subtitle = "Synoptic surveys only",
    caption = paste("Made on", Sys.Date())
  ) +
  scale_color_manual(values = c("M" = "blue", "F" = "red"))

That plot was with unweighted age frequencies. If instead we wanted to create age frequencies with samples weighted according to strata density and area, we can use the function tidy_ages_weighted(). In that case, we also need to supply a data frame that specifies the information on the survey strata. We previously cached that data frame into d_survey_sets, but we could generate it with get_survey_sets("pacific cod").

g <- tidy_ages_weighted(d_survey_samples,
  survey = c("SYN WCHG", "SYN HS", "SYN QCS", "SYN WCVI"),
  ageing_method_codes = 6,
  dat_survey_sets = d_survey_sets
) %>%
  plot_ages()
g

Or for the commercial samples with the following code that makes use of the cached data frames d_comm_samples and d_catch. The latter data frame is used as part of the weighting.

# g <- tidy_ages_weighted(d_comm_samples,
#   ageing_method_codes = 6,
#   area_grep_pattern = "*",
#   sample_type = "commercial",
#   dat_catch = d_catch
# ) %>%
#   plot_ages()
# g

In the above, the area_grep_pattern argument supplies the search string to grep() to filter the data for specific statistical area codes. For example, "5[CD]+" would find all areas 5C and 5D.

Let's do the same thing but for the length frequencies:

g <- tidy_lengths_raw(d_survey_samples,
  bin_size = 2,
  year_range = c(2002, Inf)
) %>%
  plot_lengths()
g
ggsave(file.path(figs, "lengths-raws.pdf"), width = 9, height = 9)

Or weighted:

g <- tidy_lengths_weighted(d_survey_samples,
  bin_size = 2,
  year_range = c(2002, Inf),
  dat_survey_sets = d_survey_sets
) %>%
  plot_lengths()
g
ggsave(file.path(figs, "lengths-weighted.pdf"), width = 9, height = 9)

For commercial samples, weighted:

g <- tidy_lengths_weighted(d_comm_samples,
  bin_size = 2,
  year_range = c(2002, Inf),
  sample_type = "commercial",
  dat_catch = d_catch
) %>%
  plot_lengths()
g
ggsave(file.path(figs, "lengths-weighted-commercial.pdf"), width = 9, height = 9)

By default, we are discarding commercial "keeper" samples.

Ageing precision

We can look at ageing precision. By default, this picks a random 250 fish that have primary and precision ageing values.

tidy_age_precision(d_age_precision, ageing_method_codes = 6) %>%
  plot_age_precision()
ggsave(file.path(figs, "age-precision.pdf"), width = 5, height = 5)

Commercial catch

Catch:

tidy_catch(d_catch) %>%
  plot_catch()
ggsave(file.path(figs, "catch.pdf"), width = 6, height = 3)

Survey relative biomass indices

The survey indices with bootstrapped confidence intervals:

tidy_survey_index(d_survey_index) %>%
  plot_survey_index()
ggsave(file.path(figs, "surv-index.pdf"), width = 6, height = 5)

Again, we could just select some surveys:

tidy_survey_index(d_survey_index,
  surveys = c(
    "West Coast Haida Gwaii Synoptic Survey",
    "Hecate Strait Synoptic Survey", "Queen Charlotte Sound Synoptic Survey",
    "West Coast Vancouver Island Synoptic Survey"
  ),
  survey_names = c("WCHG", "HS", "QCS", "WCVI")
) %>%
  plot_survey_index()

Overall biological sample availability

We can look at the biological sample (number of specimens) availability from commercial and survey sources:

tidy_sample_avail(d_comm_samples, ageing_method_codes = 6) %>%
  plot_sample_avail(title = "Commercial samples", year_range = c(1994, 2017))
ggsave(file.path(figs, "comm-samp-avail.pdf"), width = 6, height = 1.75)

tidy_sample_avail(d_survey_samples, ageing_method_codes = 6) %>%
  plot_sample_avail(title = "Survey samples", year_range = c(1994, 2017))
ggsave(file.path(figs, "surv-samp-avail.pdf"), width = 6, height = 1.75)

Growth and maturity

We can fit a von Bertalanffy growth model and plot it. We could fit this to just the survey data or just the commercial data or we can combine those two data sets with the function bind_samples().

# TODO FIX ME!!!!!!!! memory mapping problem
model_file <- system.file("stan", "vb.stan", package = "gfplot")
mod <- rstan::stan_model(model_file)

combined_samples <- bind_samples(d_survey_samples, d_comm_samples)

vb_m <- fit_vb(combined_samples, sex = "male", method = "mpd", ageing_method_codes = 6)
vb_f <- fit_vb(combined_samples, sex = "female", method = "mpd", ageing_method_codes = 6)
plot_vb(object_female = vb_m, object_male = vb_f)
ggsave(file.path(figs, "vb.pdf"), width = 5, height = 4)

lw_m <- fit_length_weight(combined_samples, sex = "male", method = "rlm")
lw_f <- fit_length_weight(combined_samples, sex = "female", method = "rlm")
plot_length_weight(object_female = lw_m, object_male = lw_f)
ggsave(file.path(figs, "length-wt.pdf"), width = 5, height = 4)

Similarly, we can fit a logistic maturity ogive. I'll use the combined survey and commercial data, but you could just as easily substitute d_survey_samples or d_comm_samples for combined_samples in the following code chunk.

mat_age <- combined_samples %>%
  fit_mat_ogive(
    type = "age",
    months = seq(1, 12),
    ageing_method_codes = 6
  )
plot_mat_ogive(mat_age)
ggsave(file.path(figs, "age-mat-ogive.pdf"), width = 6, height = 3.5)

mat_length <- combined_samples %>%
  fit_mat_ogive(
    type = "length",
    months = seq(1, 12)
  )
plot_mat_ogive(mat_length)
ggsave(file.path(figs, "length-mat-ogive.pdf"), width = 6, height = 3.5)

There is also the functionality to specify random intercepts for the sample IDs (this will take much longer to fit):

mat_length <- d_survey_samples %>%
  fit_mat_ogive(
    type = "length",
    months = seq(1, 12),
    sample_id_re = TRUE
  )
plot_mat_ogive(mat_length)

And look at maturity by month (for the surveys right now):

tidy_maturity_months(d_survey_samples) %>%
  plot_maturity_months()
ggsave(file.path(figs, "maturity-months.pdf"), width = 6, height = 4)

Commercial CPUE maps

We can plot the trawl and hook and line CPUE on maps. Note that the hook and line "CPUE" is just kg per fishing event right now since it seems like quantifying effort across all stocks might be troublesome.

filter(d_cpue_spatial, year >= 2012) %>%
  plot_cpue_spatial(bin_width = 7, n_minimum_vessels = 3) +
  ggtitle("Trawl CPUE") +
  labs(subtitle = "Since 2012; including discards; 3-vessel minimum")
ggsave(file.path(figs, "cpue-spatial.pdf"), width = 6, height = 4.75)

filter(d_cpue_spatial_ll, year >= 2008) %>%
  plot_cpue_spatial(
    bin_width = 7, n_minimum_vessels = 3,
    fill_lab = "CPUE (kg/fe)"
  ) +
  ggtitle("Hook and line CPUE") +
  labs(subtitle = "Since 2008; excluding discards; 3-vessel minimum")
ggsave(file.path(figs, "cpue-spatial-ll.pdf"), width = 6, height = 4.75)

Commercial CPUE indexed standardization

We can fit an index standardization delta-lognormal GLM to the commercial trawl CPUE data.

d_cpue_index <- get_cpue_index(gear = "bottom trawl")

I'll define the "fleet" as any vessel that has made at least 100 tows that caught some of the species over all years and has at least 4 years with 4 trips that caught some of the species. I will run this for the major statistical areas 5CDE.

pcod_fleet <- tidy_cpue_index(d_cpue_index,
  species_common = "pacific cod",
  year_range = c(1996, 2017),
  area_grep_pattern = "5[CD]+",
  min_positive_fe = 100,
  min_positive_trips = 4,
  min_yrs_with_trips = 4
)

names(pcod_fleet)

How many vessels make up the "fleet?":

length(unique(pcod_fleet$vessel_name))

We can fit the CPUE in the index standardization model with whatever formulas we want for the binomial and lognormal components:

f() in the next chunk is a custom helper function to generate sequential factor values that can be safely passed to TMB. Anything that should be treated as a factor (except year) should be wrapped in it. It also lets you control what the 'base' or 'reference' level is for the factor. This affects the final CPUE index slightly because the binomial and Gamma models are combined on the natural scale, not the link scale. The default is to take the most common factor level (e.g. most common month or most common depth bin).

m_cpue <- fit_cpue_index(pcod_fleet,
  formula_binomial =
    pos_catch ~ year_factor + f(month) + f(vessel) +
      f(locality) + f(depth) + f(latitude),
  formula_gamma =
    log(spp_catch / hours_fished) ~ year_factor + f(month) +
      f(vessel) + f(locality) + f(depth) + f(latitude)
)

We can plot the coefficients from the models. Note that the coefficient values for the factor predictors are all relative to the first (lowest) level, which is set to 0 in the model.

plot_cpue_index_coefs(m_cpue)
ggsave(file.path(figs, "trawl-commercial-cpue-coefs.pdf"), width = 11, height = 10)

We can plot the resulting index in its raw form (which depends somewhat on what the base levels of the factors are):

predict_cpue_index(m_cpue) %>%
  plot_cpue_index()

Or we can center them so that the combined and Gamma indices have a geometric mean of 1 and the binomial model has a mean of 0 in logit link space (i.e. centered on 0.5 in probability space):

predict_cpue_index(m_cpue, center = TRUE) %>%
  plot_cpue_index()
ggsave(file.path(figs, "trawl-commercial-cpue-trends.pdf"), width = 8, height = 3)

If you just want the combined model:

predict_cpue_index(m_cpue, center = TRUE) %>%
  plot_cpue_index(all_models = FALSE)

There is a built-in function to jackknife out each of the predictors to assess the sensitivity of the model to each predictor. In the following plot, the dashed grey line represents the standardized and centered index. Each of the coloured indices represents removing that predictor.

plot_cpue_index_jk(m_cpue)
ggsave(file.path(figs, "trawl-commercial-cpue-jk.pdf"), width = 9, height = 6)

Modelling relative biomass from the surveys spatially

There are also functions to fit spatial models to the survey data with Stan. They currently only work for the synoptic surveys.

Let's see what is available:

table(d_survey_sets$survey_series_desc, d_survey_sets$year)

As an example, I will fit a delta-lognormal GLMM with spatial random fields to the Queen Charlotte Sound Synoptic Survey data with our glmmfields package / Stan. The model is using depth as a predictor along with a residual spatial random effect surface (the random field).

m_wcvi <- fit_survey_sets(d_survey_sets,
  survey = "Queen Charlotte Sound Synoptic Bottom Trawl",
  years = 2017, chains = 3, iter = 1000
)

The following plotting function is a work in progress. The coordinates are in UTMS for UTM zone 9.

First is a plot of the combined binary and lognormal components:

plot_survey_sets(m_wcvi$predictions, m_wcvi$data, fill_column = "combined")
ggsave(file.path(figs, "qcs-survey-map.pdf"), width = 7, height = 5)

And here are the binary and lognormal components on their own:

plot_survey_sets(m_wcvi$predictions, m_wcvi$data, fill_column = "bin") +
  scale_fill_gradient2(
    midpoint = 0.5,
    low = scales::muted("blue"), high = scales::muted("red")
  )
plot_survey_sets(m_wcvi$predictions, m_wcvi$data, fill_column = "pos") +
  viridis::scale_fill_viridis(option = "C")

Raw data

Finally, let's look at the raw data extracted. You can do whatever you want with it.

# glimpse(d_cpue_spatial) # commented out for privacy
# glimpse(d_cpue_spatial_ll) # commented out for privacy
# glimpse(d_cpue_index) # commented out for privacy
glimpse(d_survey_sets)
glimpse(d_survey_samples)
glimpse(d_comm_samples)
glimpse(d_catch)
glimpse(d_survey_index)
glimpse(d_age_precision)

COSEWIC / SARA

There is also a bonus function that we plan to use for the description at the top. Look up the SARA/COSEWIC status of your favourite species! This gets scraped dynamically from tables on http://www.registrelep-sararegistry.gc.ca/sar/index/default_e.cfm.

d <- get_sara_dat()
glimpse(d)

CPUE exploration across three areas

I started using Pacific cod as an example to iterate over areas for CPUE index standardization. If this ends up being useful, I might add some cleaned up versions of these functions to the package.

First, let's fit models for 3CD, 5CD, and 5AB separately. We will make the reference factor level the most common factor level (e.g. a month or depth bin) from all fishing events that caught Pacific cod.

areas <- c("3[CD]+", "5[CD]+", "5[AB]+")

cpue_models <- lapply(areas, function(area) {
  message("Determining qualified fleet for area ", area, ".")
  fleet <- tidy_cpue_index(d_cpue_index,
    year_range = c(1996, 2017),
    species_common = "pacific cod",
    area_grep_pattern = area,
    min_positive_fe = 100,
    min_positive_trips = 4,
    min_yrs_with_trips = 4,
    lat_band_width = 0.2,
    depth_band_width = 50,
    clean_bins = TRUE,
    depth_bin_quantiles = c(0.02, 0.98),
    lat_bin_quantiles = c(0.01, 0.99)
  )

  pos_catch_fleet <- filter(fleet, pos_catch == 1)
  base_month    <- get_most_common_level(pos_catch_fleet$month)
  base_depth    <- get_most_common_level(pos_catch_fleet$depth)
  base_lat      <- get_most_common_level(pos_catch_fleet$latitude)
  base_vessel   <- get_most_common_level(pos_catch_fleet$vessel)
  base_locality <- get_most_common_level(pos_catch_fleet$locality)

  message("Fitting standardization model for area ", area, ".")
  m_cpue <- fit_cpue_index(fleet,
    formula_binomial = pos_catch ~ year_factor +
      f(month, base_month) +
      f(vessel, base_vessel) +
      f(locality, base_locality) +
      f(depth, base_depth) +
      f(latitude, base_lat),
    formula_lognormal = log(spp_catch / hours_fished) ~
      year_factor +
      f(month, base_month) +
      f(vessel, base_vessel) +
      f(locality, base_locality) +
      f(depth, base_depth) +
      f(latitude, base_lat)
  )
  list(model = m_cpue, fleet = fleet, area = gsub("\\[|\\]|\\+", "", area))
})

indices <- purrr::map_df(cpue_models, function(x) {
  p <- predict_cpue_index(x$model, center = FALSE)
  p$area <- x$area
  p
})

indices_centered <- purrr::map_df(cpue_models, function(x) {
  p <- predict_cpue_index(x$model, center = TRUE)
  p$area <- x$area
  p
})

Coefficients from models:

coef_plots <- lapply(cpue_models, function(x) {
  plot_cpue_index_coefs(x$model) + labs(title = x$area)
})
ignore <- lapply(coef_plots, print)

Standardized indices:

plot_cpue_facet <- function(dat, scales = "free_y") {
  dat %>%
    ggplot(aes(year, est, ymin = lwr, ymax = upr, fill = model)) +
    geom_ribbon(alpha = 0.3) +
    geom_line() +
    facet_grid(model~area, scales = scales) +
    theme_pbs() +
    ylab("Estimate") + xlab("Year") +
    guides(fill = FALSE)
}

plot_cpue_facet(indices) +
  scale_fill_brewer(palette = "Set2")

plot_cpue_facet(filter(indices_centered, model == "Combined")) +
  facet_wrap(~area, scales = "free_y", ncol = 1) +
  scale_fill_manual(values = "black")

Sensitivity:

jks <- lapply(cpue_models, function(x) {
  plot_cpue_index_jk(x$model, terms = c(
    "f(month, base_month)",
    "f(vessel, base_vessel)",
    "f(locality, base_locality)",
    "f(depth, base_depth)",
    "f(latitude, base_lat)")) +
      labs(title = x$area)
})
ignore <- lapply(jks, print)

pbs-assess/gfplot documentation built on April 3, 2024, 2:10 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

pbs-assess/gfplot
Data Extraction and Plotting for DFO PBS Groundfish Stocks

In pbs-assess/gfplot: Data Extraction and Plotting for DFO PBS Groundfish Stocks

Setup

An overview of gfplot

Caching the data from the SQL servers

Age and length frequencies

Ageing precision

Commercial catch

Survey relative biomass indices

Overall biological sample availability

Growth and maturity

Commercial CPUE maps

Commercial CPUE indexed standardization

Modelling relative biomass from the surveys spatially

Raw data

COSEWIC / SARA

CPUE exploration across three areas

R Package Documentation

Browse R Packages

We want your feedback!

pbs-assess/gfplot Data Extraction and Plotting for DFO PBS Groundfish Stocks

In pbs-assess/gfplot: Data Extraction and Plotting for DFO PBS Groundfish Stocks

Setup

An overview of gfplot

Caching the data from the SQL servers

Age and length frequencies

Ageing precision

Commercial catch

Survey relative biomass indices

Overall biological sample availability

Growth and maturity

Commercial CPUE maps

Commercial CPUE indexed standardization

Modelling relative biomass from the surveys spatially

Raw data

COSEWIC / SARA

CPUE exploration across three areas

R Package Documentation

Browse R Packages

We want your feedback!

pbs-assess/gfplot
Data Extraction and Plotting for DFO PBS Groundfish Stocks