In seananderson/gfplot: Data Extraction and Plotting for DFO PBS Groundfish Stocks

Here we will compare the age frequency weighting functions against the frequency weighting code previously implemented by Rowan.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 8 / 1.618
)

library(gfplot)
library(dplyr)
library(ggplot2)

Let's cache all the data for redstripe rockfish using gfplot:

cache_pbs_data("redstrip rockfish", path = "data-cache")

Survey samples

First let's compare the weighting with the survey data.

Let's read in the gfplot data and apply the weighting:

survey_samples <- readRDS("data-cache/pbs-survey-samples.rds")
survey_sets <- readRDS("data-cache/pbs-survey-sets.rds")

survey_samples <- survey_samples %>%
  filter(survey_series_desc == "Queen Charlotte Sound Synoptic Bottom Trawl" )

dd <- tidy_comps_survey(survey_samples, survey_sets, value = age)
dd$age <- paste(ifelse(dd$sex == 1, "M", "F"), dd$age)
dd$sex <- NULL

surv_ages <- weight_comps(dd)
surv_ages <- surv_ages %>%
  mutate(sex = substr(value, 1, 1),
    value = as.numeric(as.character(gsub("[A-Z ]+", "", value)))) %>%
  mutate(value = ifelse(value >= 40, 40, value))

surv_ages %>%
  ggplot(aes(year, value, size = weighted_prop)) +
  geom_point(pch = 21) +
  theme_pbs() +
  facet_wrap(~sex) +
  scale_size_continuous(range = c(0, 8)) +
  ylim(0, 39)

Rowan's weighted data:

rh <- read.csv("awatea-wpa439(den)-ssid(1)-major(567).csv",
  stringsAsFactors = FALSE, strip.white = TRUE)

rh <- rh %>%
  select(-nsid, -age1, -ageN) %>%
  reshape2::melt(id.vars = c("series", "year"), value.name = "weighted_prop",
    variable.name = "value") %>%
  mutate(sex = toupper(substr(value, 1, 1)),
    value = as.numeric(as.character(gsub("[a-z]+", "", value)))) %>%
  mutate(weighted_prop = ifelse(weighted_prop == 0, NA, weighted_prop)) %>%
  select(-series)

ggplot(rh, aes(year, value, size = weighted_prop)) +
  geom_point(pch = 21) + theme_pbs() +
  facet_wrap(~sex) +
  scale_size_continuous(range = c(0, 8)) +
  ylim(0, 39)

Plot the two together:

all <- rbind(mutate(rh, type = "RH"),
  mutate(surv_ages, type = "gfplot"))
all <- as.data.frame(na.omit(all))

ggplot(all, aes(year, value, size = weighted_prop, colour = type)) +
  geom_point(pch = 21) + theme_pbs() +
  facet_wrap(~sex) +
  scale_size_continuous(range = c(0, 8)) +
  ylim(0, 39)

And also add the raw data:

raw <- survey_samples %>%
  select(year, sex, age) %>%
  filter(!is.na(age), sex %in% c(1, 2)) %>%
  group_by(year, age, sex) %>%
  summarise(freq = n()) %>%
  group_by(year) %>%
  mutate(weighted_prop = freq / sum(freq)) %>%
  select(-freq) %>%
  mutate(sex = ifelse(sex == 1, "M", "F")) %>%
  rename(value = age) %>%
  mutate(type = "raw")

all <- rbind(all, as.data.frame(raw))

ggplot(all, aes(year, value, size = weighted_prop, colour = type)) +
  geom_point(pch = 21) + theme_pbs() +
  facet_wrap(~sex) +
  scale_size_continuous(range = c(0, 8)) +
  ylim(0, 39) +
  scale_colour_manual(
    values = c("raw" = "grey40", "RH" = "red", "gfplot" = "blue"))

Note that we are starting with different source data here. My understanding is that Rowan's data includes samples from tows regardless of their usability code. The data from gfplot makes use of Norm's procedure used when documenting the relative biomass indices and so only includes fishing events with usable usability codes.

As we will see below, the weighting is almost identical for the commercial samples and both use the same weighting function meaning that the differences here are almost certainly from the slightly different criteria used when filtering the source data.

Commercial samples

Next let's do the same thing for the commercial samples.

gplot data first. Elise wrote some SQL to match the criteria used by Rowan when extracting the commercial sample data:

sql_query <- readLines("get-comm-samples-rowan.sql")
com_samples <- gfplot::run_sql("GFBioSQL", query = sql_query)
saveRDS(com_samples, "redstripe-comm-samples-rowan.rds")

com_samples <- readRDS("redstripe-comm-samples-rowan.rds") %>%
  dplyr::as.tbl()
names(com_samples) <- tolower(names(com_samples))
com_samples <- mutate(com_samples,
  species_science_name = tolower(species_science_name),
  species_common_name = tolower(species_common_name))
com_samples <- mutate(com_samples, year = lubridate::year(trip_start_date))
com_samples <- filter(com_samples, year >= 1996)
assertthat::assert_that(sum(duplicated(com_samples$specimen_id)) == 0)

com_catch <- readRDS("data-cache/pbs-catch.rds")

gfplot weighting:

dd <- tidy_comps_commercial(com_samples, com_catch, value = age)
dd$age <- paste(ifelse(dd$sex == 1, "M", "F"), dd$age)
dd$sex <- NULL

com_ages <- weight_comps(dd)
com_ages <- com_ages %>%
  mutate(sex = substr(value, 1, 1),
    value = as.numeric(as.character(gsub("[A-Z ]+", "", value)))) %>%
  mutate(value = ifelse(value >= 40, 40, value))

com_ages %>%
  ggplot(aes(year, value, size = weighted_prop)) +
  geom_point(pch = 21) +
  theme_pbs() +
  facet_wrap(~sex) +
  scale_size_continuous(range = c(0, 8)) +
  ylim(0, 39)

Rowan's weighted data:

rh <- read.csv("awatea-wpa439(cat)-tt(145)-major(34567).csv",
  stringsAsFactors = FALSE, strip.white = TRUE) %>%
  filter(year >= 1996)

rh <- rh %>%
  select(-ntid, -age1, -ageN) %>%
  reshape2::melt(id.vars = c("series", "year"), value.name = "weighted_prop",
    variable.name = "value") %>%
  mutate(sex = toupper(substr(value, 1, 1)),
    value = as.numeric(as.character(gsub("[a-z]+", "", value)))) %>%
  mutate(weighted_prop = ifelse(weighted_prop == 0, NA, weighted_prop)) %>%
  select(-series)

ggplot(rh, aes(year, value, size = weighted_prop)) +
  geom_point(pch = 21) + theme_pbs() +
  facet_wrap(~sex) +
  scale_size_continuous(range = c(0, 8)) +
  ylim(0, 39)

Compare the two visually:

all <- rbind(mutate(rh, type = "RH"),
  mutate(com_ages, type = "gfplot"))
all <- as.data.frame(na.omit(all))

ggplot(all, aes(year, value, size = weighted_prop, colour = type)) +
  geom_point(pch = 21) + theme_pbs() +
  facet_wrap(~sex) +
  scale_size_continuous(range = c(0, 8)) +
  ylim(0, 39)

This looks nearly perfect to me. There are a small number of 29, 30, and 31 year-old redstripe in 2013 in Rowan's data that aren't in ours (see the blue dots that stand out). When we go back to the databases themselves, we can't find any redstripe rockfish of those ages in those years regardless of filtering criteria (actually I believe we found one that gets filtered out). The most likely scenario is that some data have been corrected in the database since Rowan pulled his version of the data.

A plot with raw frequencies added in grey:

raw <- com_samples %>%
  select(year, sex, age) %>%
  filter(!is.na(age), sex %in% c(1, 2)) %>%
  group_by(year, age, sex) %>%
  summarise(freq = n()) %>%
  group_by(year) %>%
  mutate(weighted_prop = freq / sum(freq)) %>%
  select(-freq) %>%
  mutate(sex = ifelse(sex == 1, "M", "F")) %>%
  rename(value = age) %>%
  mutate(type = "raw")

all <- rbind(all, as.data.frame(raw))

ggplot(all, aes(year, value, size = weighted_prop, colour = type)) +
  geom_point(pch = 21) + theme_pbs() +
  facet_wrap(~sex) +
  scale_size_continuous(range = c(0, 8)) +
  ylim(0, 39) +
  geom_vline(xintercept = c(2009, 2004)) +
  scale_colour_manual(
    values = c("raw" = "grey40", "RH" = "red", "gfplot" = "blue"))

So we can see here that the weighting does make a difference (albeit usually relatively minor), but the two weighting implementations look effectively identical.

When I go back further than 1996 that looks like there are some small differences in the weighted frequencies. This most likely comes down to the catch data that is used.

seananderson/gfplot documentation built on April 13, 2025, 6:18 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com