knitr::opts_chunk$set(
  collapse = TRUE,
  comment = ">",
  fig.width = 10,
  fig.height = 8
)

In 2020 only the first 20 hooks were evaluated, so those data are not easily imported into GFbio. Going to incorporate into gfiphc here. Likely need this as a template for future years (resave this file with new year, and change all 2020's to the subsequent year -- and go through the code somewhat manually to check the output as you go along). This code includes some manual checks to make sure the data look okay.

For comparison first look at 2013 data included in gfiphc:

load_all()
setData2013
countData2013

We want to get the new data into the same format as those (columns with same names and classes, even though in retrospect some classes aren't ideally chosen, but also retaining retrieved and observed hooks for the set data). Two data sets are needed because later gfiphc code summarises catches of a particular species at the station level, and needs to create counts of zeros for the species of interest (and such zeros are not included in IPHC output).

Set-level information

Maria was sent the file 2020 IPHCtoDFO_dataExtraction-Maria.xls for set details, but this is multiple sheets and more complex than needed. Will try extracting directly from IPHC website (which they want us to do in the future anyway), using the following instructions:

Go to https://www.iphc.int/data/fiss-data-query and select the following options:

  1. Year Range -- 2020 to 2020.

  2. Area 2B

  3. Purpose Codes -- All

  4. IPHC Charter Regions -- All

  5. Maps -- Nothing

  6. Select non-Pacific halibut species -- deselect All.

Download tab on bottom right (see instructions above question 4), and select CrossTab. Select "Set and Pacific Halibut data" and .xlsx format (I tried .csv format but it didn't save with commas, strangely). Save in this folder as set-and-halibut-data-2020.xlsx. Open in Excel and Export as .csv, set-and-halibut-data-2020.csv, and when trying to quit Excel say no to save changes (not sure if that matters).

Repeat but with all non-halibut data (select All in number 6), and save as non-halibut-data-2020.xlsx and export as .csv in Excel, non-halibut-data-2020.csv. Importantly, this file (but not the first one) contains the numbers of observed hooks, needed in our calculations.

sets_raw <- readr::read_csv("set-and-halibut-data-2020.csv") %>%
  dplyr::mutate_if(is.character, factor)
sets_raw
summary(sets_raw)
testthat::expect_equal(unique(sets_raw$"IPHC Reg Area"), as.factor("2B")) # Check just BC
testthat::expect_equal(unique(sets_raw$Year), 2020)
#testthat::expect_equal(length(unique(sets_raw$Station)),
#                       length(sets_raw$Station))

Understand any issues raised above

Uncomment those three testthat commands when looking at new data each year. If any of fail then have to comment it out and figure out what it means here.

This is for 2020 (check for future years), to look for station(s) that was fished twice:

length(unique(sets_raw$Station))
length(sets_raw$Station)
dplyr::count(sets_raw, Station) %>% dplyr::filter(n > 1)
twice_fished <- dplyr::count(sets_raw, Station) %>%
  dplyr::filter(n > 1) %>%
  select(Station) %>%
  as.numeric()
# If there's more than a single station then adapt later code
as.data.frame(dplyr::filter(sets_raw,
                            Station == twice_fished))

So Station r twice_fished had two vessels fishing the same station (which the code below originally caused a total of four rows for that station, explaining the 200 rows I had in original setData2020 before fixing the issue). Interestingly the halibut catches were almost double for one vessel than the other (but were 6 days apart):

Note that one of those entries has 'Vessel code' HAN, but HAN only appears once in the whole data set (as seen in summary(sets_raw) above:

summary(sets_raw$"Vessel code")

So given we want to exclude one of the duplicates, makes sense to exclude HAN. (Also, Dana mentioned some gear comparison studies for 2020). Do this and simplify down to what's needed and rename, based on iphc2013data.Rnw (need to include the 'purpose' column, unlike 2013):

sets_simp <- dplyr::filter(sets_raw, `Vessel code` != "HAN") %>%
  dplyr::select(year = Year,
                station = Station,
                lat = "MidLat fished",
                lon = "MidLon fished",
                avgDepth = "AvgDepth (fm)",
                skatesHauled = "No. skates hauled",
                effSkateIPHC = "Effective skates hauled",
                soakTimeMinutes = "Soak time (min.)",  # Joe might want
                usable = Eff,
                purpose = Purpose,
                U32halibut = "U32 Pacific halibut count",
                O32halibut = "O32 Pacific halibut count") %>%
  arrange(station) %>%
  dplyr::mutate(year = as.integer(year),
                station = as.character(station),
                avgDepth = as.integer(avgDepth),
                usable = as.character(usable))
sets_simp

For future years check the HAN issue and remove that first line if necessary (especially if HAN is used in the survey).

Then change purpose to standard (Y/N) to match 2018 data (Y for the standard grid). Here purpose takes three values, and we need to convert to standard:

summary(sets_simp$purpose)

sets_simp_std <- dplyr::mutate(sets_simp,
                               standard_tmp = (purpose == "Standard grid"))

standard <- as.character(sets_simp_std$standard_tmp)  # to get the right length
standard[sets_simp_std$standard_tmp] = "Y"
standard[!sets_simp_std$standard_tmp] = "N"
length(standard)

sets_simp_std <- cbind(sets_simp_std,
                   standard) %>%
  as_tibble() %>%
  dplyr::select(-c("standard_tmp"))
summary(sets_simp_std)

Look at data and show map to understand changing definition of standard station from 2018 to 2020.

It seems that the definition of 'standard grid' has changed from 2018 (when first needed due to the expanded grid) to 2020. Simply equating them as above is not sufficient.

This section figures out the two problems and corrects them. However, to replicate the original analysis we need to revert the second correction (the correction changes station 2343 to be non-standard, even though it originally was classed as standard -- see later):

hooks_with_bait_revert <- hooks_with_bait

# This should be commented out for 2021 survey analysis in iphc-2021-data.Rmd,
#  since the problem is fixed:
hooks_with_bait_revert$set_counts[hooks_with_bait_revert$set_counts$year == 2018 &
                           hooks_with_bait_revert$set_counts$station == 2343,
                           ]$standard = "Y"

filter(hooks_with_bait$set_counts, year == 2018, station == 2343) %>%
  as.data.frame()      # saved version
filter(hooks_with_bait_revert$set_counts, year == 2018, station == 2343) %>%
  as.data.frame()      # reverted version

Plotting four years, with crosses showing 'non-standard'. (2020 is coloured different since no hooks with bait data yet, but the important bit is the crosses).

sets_2020 <- dplyr::select(sets_simp_std,
                           -c(U32halibut, O32halibut))
                           # not the same structure as sets_2018, below

plot_iphc_map_panel(hooks_with_bait_revert$set_counts,
               sp = "Hooks with bait",
               years = 2017:2019,
               indicate_standard = TRUE)

plot_iphc_map(sets_2020,
              sp = NULL,
              years = 2020,
              indicate_standard = TRUE)

Clearly 2020 has a few less stations just north of Vancouver Island, but not enough to worry about greatly. The curious ones are the ones way in in the inlets that are no longer flagged as non-standard.

Need to look and plot values:

sets_2018 <- filter(hooks_with_bait_revert$set_counts,
                    year == 2018)
not_std_2018 <- filter(sets_2018,
                       standard == "N")$station


not_std_2020 <- filter(sets_2020,
                       standard == "N")$station

# Not standard in both:
not_std_2018_and_2020 <- intersect(not_std_2018, not_std_2020)
not_std_2018_and_2020

length(not_std_2018)
length(not_std_2020)
length(not_std_2018_and_2020)

# 2018 has some east of the map, all non-standard:
filter(hooks_with_bait_revert$set_counts, year == 2018, lon > -124)$standard
nrow(filter(hooks_with_bait_revert$set_counts, year == 2018, lon > -124))

std_in_2018_but_not_std_in_2020 <- intersect(filter(sets_2018,
                                                    standard == "Y")$station,
                                             not_std_2020)
std_in_2018_but_not_std_in_2020

not_std_in_2018_but_std_in_2020 <- intersect(not_std_2018,
                                             filter(sets_2020,
                                                    standard == "Y")$station)
not_std_in_2018_but_std_in_2020

# setdiff(x, y) - elements in x but not in y
# setdiff(not_std_2018, not_std_2020) - but 2020 fewer coverage so misleading

Plot stations not standard in 2018 but standard in 2020, and vice versa, using each years' lats and lons (to verify that they all still agree -- i.e., that station numbers have consistent lats and lons), and show 2019 data to check no 'usual' stations are non-standard in 2018 or 2020:

plot_BC()
points(lat~lon,
       data = filter(sets_2018,
                     station %in% not_std_in_2018_but_std_in_2020),
       col="red",
       pch = 19)

# Do the same but using 2020 station co-ordinates - should overlap:
points(lat~lon,
       data = filter(sets_2020,
                     station %in% not_std_in_2018_but_std_in_2020),
       col="blue",
       pch = 3)

# And show the single station std in 2018 but not 2020:
points(lat~lon,
       data = filter(sets_2018,
                     station %in% std_in_2018_but_not_std_in_2020),
       col="red",
       pch = 17)
points(lat~lon,
       data = filter(sets_2020,
                     station %in% std_in_2018_but_not_std_in_2020),
       col="blue",
       pch = 1,
       cex = 2)

# Now show all 2019 stations:
points(lat~lon,
       data = filter(hooks_with_bait_revert$set_counts,
                     year == 2019),
       col="darkgreen",
       pch = 0)

legend("bottomleft",
       legend = c("Not std 2018 but std 2020 (2018 co-ords)",
                  "Not std 2018 but std 2020 (2020 co-ords)",
                  "Std 2018 but not std 2020 (2018 co-ords)",
                  "Std 2018 but not std 2020 (2020 co-ords)",
                  "All 2019 stations"),
             pch = c(19, 3, 17, 1, 0),
             pt.cex = c(1, 1, 1, 2, 1),
             col = c("red", "blue", "red", "blue", "darkgreen"))

Check if the one standard station in 2018 but not in 2020 (not fished at all in 2019) appears in any earlier years:

filter(hooks_with_bait_revert$set_counts,
       station == std_in_2018_but_not_std_in_2020) %>%
  as.data.frame()

Was only fished in 2018 and 2020, so we should define it as non-standard.

So, the conclusions from this section are that we should:

  1. Retain the 2018 definitions of standard stations, not the new ones defined in 2020:
sets_simp_std_corrected <- sets_simp_std

sets_simp_std_corrected$standard[sets_simp_std_corrected$station %in%
                                 not_std_in_2018_but_std_in_2020] <- "N"
# cbind(sets_simp_std$standard, sets_simp_std_corrected$standard) # to check
  1. Define station 2343 as non-standard (over-riding original 2018 designation).

So check which functions that need changing, since they create a 'standard' column. These do not need changing: get_iphc_hooks() and get_iphc_skates_info.

Then get_iphc_sets_info() does return standard, but the standard designation is not saved in GFbio it is saved in setDataExpansion in gfiphc. So just need to add a line in IPHC-stations-expanded.R and then re-save all .rda files. Fixed that, now recreating all .rda files, as per the README.

Species counts

First, get the species counts into the desired format (to match countData2013 shown earlier):

counts_raw <- readr::read_csv("non-halibut-data-2020.csv") %>%
  dplyr::mutate_if(is.character, factor)

counts_raw
summary(counts_raw)
testthat::expect_equal(unique(counts_raw$Year), 2020)
testthat::expect_equal(unique(counts_raw$SampleType), as.factor("20Hook"))

# This mismatches for 2020:
# testthat::expect_equal(length(unique(counts_raw$Station)),
#                        length(sets_raw$Station))

unique(counts_raw$"Species Name")

Note that halibut are not included in these counts:

dplyr::filter(counts_raw, "Species Name" == "Pacific Halibut")

which I presume explains why total number of counts for a station does not add up to HooksObserved. See later for halibut calculations.

Need to remove the HAN records for the twice-fished station, which turns out to be set number 4 for station 2104:

dplyr::filter(counts_raw, Station == twice_fished) %>%
  dplyr::select(c("Station", "Setno", "Species Name",
                  "Number Observed")) %>%
    as.data.frame()

dplyr::filter(sets_raw, Station == twice_fished)

So use that here to remove the species counts for that vessel (note that vessel code is not in counts_raw)

dplyr::filter(counts_raw,
              Station == twice_fished & Setno == 4)

# So just keep these:
dplyr::filter(counts_raw,
              !(Station == twice_fished & Setno == 4))

countData2020_no_halibut <- dplyr::filter(counts_raw,
                               !(Station == twice_fished & Setno == 4)) %>%
  dplyr::select(year = Year,
                station = Station,
                spNameIPHC = "Species Name",
                specCount = "Number Observed") %>%
  arrange(station) %>%
  dplyr::mutate(year = as.integer(year),
                station = as.character(station),
                spNameIPHC = as.character(spNameIPHC),
                specCount = as.integer(specCount))

testthat::expect_equal(names(countData2013), names(countData2020_no_halibut))
countData2020_no_halibut
summary(countData2020_no_halibut)

Hooks observed and retrieved

Now, obtain the numbers of hooks observed and retrieved from counts_raw, to then merge into the set details:

hook_details <- dplyr::filter(counts_raw,
                              !(Station == twice_fished & Setno == 4)) %>%
  dplyr::group_by(Station) %>%
  dplyr::summarise(year = unique(Year),
                   hooksRetr = unique(HooksRetrieved),
                   hooksObs = unique(HooksObserved)) %>%
  dplyr::rename(station = Station) %>%
  dplyr::ungroup() %>%
  arrange(station) %>%
  dplyr::mutate(year = as.integer(year),
                station = as.character(station))

hook_details

expect_equal(sets_simp_std_corrected$station, hook_details$station)

So now need to get the hook details into the set details, and keep columns as for setData2013 but also with standard, and may as well keep hooksRetr and hooksObs:

setData2020 <- dplyr::left_join(sets_simp_std_corrected,
                                hook_details,
                                by = c("year", "station")) %>%
  dplyr::mutate(E_it20 = effSkateIPHC * hooksObs / hooksRetr) %>%
  dplyr::select(year,
                station,
                lat,
                lon,
                avgDepth,
                effSkateIPHC,
                E_it20,
                usable,
                standard,
                hooksRetr,
                hooksObs) %>%
  dplyr::mutate(year = as.integer(year),
                station = as.character(station),
                avgDepth = as.integer(avgDepth),
                usable = as.character(usable),
                standard = as.factor(standard))
setData2020
testthat::expect_equal(names(setData2013), names(setData2020)[1:ncol(setData2013)])
summary(setData2020)

Pacific Halibut counts

As noted above, the data extraction for the counts is for all non-halibut species. We still want the halibut counts for just the first 20 hooks -- the data_for_all_species vignette (for data up to 2019) shows that the 20-hook and full hook counts (Series A and B) are very similar when rescaled, and the rescaling is miniscule with $G_A / G_B = 1.005$. So this justifies sticking with 20-hook counts for halibut, even though the full data are available for all sets, given it is a halibut survey. (Using all hooks for all years could be done, but would be a lot of new code).

There are two options for getting halibut counts for the first 20 hooks (given we don't have hook-by-hook data, though it could probably be obtained just not from the IPHC website).

Option 1.

Take the halibut counts for all the hooks and create N_it20_halibut_est = E_it20 / E_it * N_it, or equivalently just N_it20_halibut_est = hooksObs / hooksRetr * N_it. Note that observed refers to observed for non-halibut species (presumably hooksRetr works for halibut). Not strictly the first 20 hooks, but is a rescaling. But will not guarantee integer values.

setData2020_and_halibut <-
  dplyr::left_join(setData2020,
                   dplyr::select(sets_simp_std_corrected,
                                 c(station,
                                   U32halibut,
                                   O32halibut)),
                   by = "station") %>%
  dplyr::mutate(N_it_halibut = U32halibut + O32halibut,
                N_it20_halibut_opt_1 = hooksObs / hooksRetr * N_it_halibut)
setData2020_and_halibut %>% dplyr::select(station,
                                          N_it_halibut,
                                          N_it20_halibut_opt_1)

Option 2.

Add the counts for each set (which include Hook with Skin etc.) and compare with hooksObs. I think the latter is higher, and the difference is halibut (as the only non non-halibut species). Compare with the results from option 1. If close then use option 2, since it will be just be halibut counts and gives an integer number.

Add counts for each set:

counts_20 <- countData2020_no_halibut %>%
  dplyr::group_by(station) %>%
  dplyr::summarise(non_halibut = sum(specCount)) %>%
  dplyr::ungroup()
counts_20

Now join the two options together to calculate N_it20_halibut_opt_2 and compare the two estimates of N_it20_halibut:

compare <-
  dplyr::left_join(setData2020_and_halibut,
                   counts_20,
                   by = "station") %>%
  dplyr::mutate(N_it20_halibut_opt_2 = hooksObs - non_halibut,
                N_it20_opt_1_over_opt_2 = N_it20_halibut_opt_1 / N_it20_halibut_opt_2) %>%
  dplyr::select(year,
                station,
                usable,
                N_it20_halibut_opt_1,
                N_it20_halibut_opt_2,
                N_it20_opt_1_over_opt_2)
compare$spNameIPHC <- "Pacific Halibut"
compare

plot(compare$N_it20_halibut_opt_1, compare$N_it20_halibut_opt_2)
abline(a = 0, b = 1)

So this is the right approach, though numbers not quite as close as may have thought. But these data are used for aggregating across all stations in a year (and any further analyses on halibut should be done using the raw data anyway). And the means aren't too bad:

mean(compare$N_it20_halibut_opt_1)
mean(compare$N_it20_halibut_opt_2)

So either of these would work. So use option 2 since gives an integer count:

compare$N_it20_halibut_opt_2
countData2020_halibut <- dplyr::select(compare,
                                       year,
                                       station,
                                       spNameIPHC,
                                       specCount = N_it20_halibut_opt_2) %>%
  dplyr::mutate(specCount = as.integer(specCount))
countData2020 <- rbind(countData2020_no_halibut,
                       countData2020_halibut) %>%
  dplyr::arrange(station)
# First time running, called the above countData2020_NEW to check remaining data didn't change:
#expect_equal(countData2020, filter(countData2020_NEW, spNameIPHC !=
#                                                      "Pacific Halibut"))

Check species names

The file inst/extdata/iphc-spp-names.csv contains species common names (as used for gfsynopsis, and a few extra like unidentified skate) and the IPHC common name.

These are IPHC names that are not given in iphc-spp-names.csv.

check_iphc_spp_name()

These are the ones just for the new 2020 data:

check_iphc_spp_name(countData2020)

Of these, only the Thornyhead and Blackspotted Rockfish are likely of interest (Issues #17 and #18). And maybe the sharks.

Save data sets

usethis::use_data(countData2020,
                  overwrite = TRUE)

usethis::use_data(setData2020,
                  overwrite = TRUE)


pbs-assess/gfiphc documentation built on July 4, 2023, 1:13 p.m.