knitr::opts_chunk$set( collapse = TRUE, comment = ">", fig.width = 10, fig.height = 8 )
In 2020 only the first 20 hooks were evaluated, so those data are not easily imported into GFbio. Going to incorporate into gfiphc here. Likely need this as a template for future years (resave this file with new year, and change all 2020's to the subsequent year -- and go through the code somewhat manually to check the output as you go along). This code includes some manual checks to make sure the data look okay.
For comparison first look at 2013 data included in gfiphc:
load_all()
setData2013
countData2013
We want to get the new data into the same format as those (columns with same names and classes, even though in retrospect some classes aren't ideally chosen, but also retaining retrieved and observed hooks for the set data). Two data sets are needed because later gfiphc code summarises catches of a particular species at the station level, and needs to create counts of zeros for the species of interest (and such zeros are not included in IPHC output).
Maria was sent the file 2020 IPHCtoDFO_dataExtraction-Maria.xls
for set
details, but this is multiple sheets and more complex than needed. Will try
extracting directly from IPHC website (which they want us to do in the future
anyway), using the following instructions:
Go to https://www.iphc.int/data/fiss-data-query and select the following options:
Year Range -- 2020 to 2020.
Area 2B
Purpose Codes -- All
IPHC Charter Regions -- All
Maps -- Nothing
Select non-Pacific halibut species -- deselect All.
Download tab on bottom right (see instructions above question 4), and select
CrossTab. Select "Set and Pacific Halibut data" and .xlsx format (I tried .csv
format but it didn't save with commas, strangely). Save in this
folder as set-and-halibut-data-2020.xlsx
. Open in Excel and Export as .csv, set-and-halibut-data-2020.csv
,
and when trying to quit Excel say no to save changes (not sure if that matters).
Repeat but with all non-halibut data (select All in number 6), and save as
non-halibut-data-2020.xlsx
and export as .csv in Excel,
non-halibut-data-2020.csv
. Importantly, this file
(but not the first one) contains the numbers of observed hooks, needed in our
calculations.
sets_raw <- readr::read_csv("set-and-halibut-data-2020.csv") %>% dplyr::mutate_if(is.character, factor) sets_raw summary(sets_raw) testthat::expect_equal(unique(sets_raw$"IPHC Reg Area"), as.factor("2B")) # Check just BC testthat::expect_equal(unique(sets_raw$Year), 2020) #testthat::expect_equal(length(unique(sets_raw$Station)), # length(sets_raw$Station))
Uncomment those three testthat
commands when looking at new data each year. If
any of fail then have to comment it out and figure out what it means here.
This is for 2020 (check for future years), to look for station(s) that was fished twice:
length(unique(sets_raw$Station)) length(sets_raw$Station) dplyr::count(sets_raw, Station) %>% dplyr::filter(n > 1) twice_fished <- dplyr::count(sets_raw, Station) %>% dplyr::filter(n > 1) %>% select(Station) %>% as.numeric() # If there's more than a single station then adapt later code as.data.frame(dplyr::filter(sets_raw, Station == twice_fished))
So Station r twice_fished
had two vessels fishing the same station (which the code below
originally caused a total of four rows for that station, explaining the 200 rows
I had in original setData2020
before fixing the issue). Interestingly the
halibut catches were almost double for one vessel than the other (but were 6
days apart):
Note that one of those entries has 'Vessel code' HAN, but HAN only appears once
in the whole data set (as seen in summary(sets_raw)
above:
summary(sets_raw$"Vessel code")
So given we want to exclude one of the duplicates, makes sense to exclude
HAN. (Also, Dana mentioned some gear comparison studies for 2020).
Do this and simplify down to what's needed and rename, based on iphc2013data.Rnw
(need
to include the 'purpose' column, unlike 2013):
sets_simp <- dplyr::filter(sets_raw, `Vessel code` != "HAN") %>% dplyr::select(year = Year, station = Station, lat = "MidLat fished", lon = "MidLon fished", avgDepth = "AvgDepth (fm)", skatesHauled = "No. skates hauled", effSkateIPHC = "Effective skates hauled", soakTimeMinutes = "Soak time (min.)", # Joe might want usable = Eff, purpose = Purpose, U32halibut = "U32 Pacific halibut count", O32halibut = "O32 Pacific halibut count") %>% arrange(station) %>% dplyr::mutate(year = as.integer(year), station = as.character(station), avgDepth = as.integer(avgDepth), usable = as.character(usable)) sets_simp
For future years check the HAN issue and remove that first line if necessary (especially if HAN is used in the survey).
Then change purpose to standard
(Y/N
) to match 2018 data (Y
for the
standard grid). Here purpose
takes three values, and we need to convert to
standard
:
summary(sets_simp$purpose) sets_simp_std <- dplyr::mutate(sets_simp, standard_tmp = (purpose == "Standard grid")) standard <- as.character(sets_simp_std$standard_tmp) # to get the right length standard[sets_simp_std$standard_tmp] = "Y" standard[!sets_simp_std$standard_tmp] = "N" length(standard) sets_simp_std <- cbind(sets_simp_std, standard) %>% as_tibble() %>% dplyr::select(-c("standard_tmp")) summary(sets_simp_std)
It seems that the definition of 'standard grid' has changed from 2018 (when first needed due to the expanded grid) to 2020. Simply equating them as above is not sufficient.
This section figures out the two problems and corrects them. However, to replicate the original
analysis we need to revert the second correction (the correction changes station
2343
to be non-standard, even though it originally was classed as standard --
see later):
hooks_with_bait_revert <- hooks_with_bait # This should be commented out for 2021 survey analysis in iphc-2021-data.Rmd, # since the problem is fixed: hooks_with_bait_revert$set_counts[hooks_with_bait_revert$set_counts$year == 2018 & hooks_with_bait_revert$set_counts$station == 2343, ]$standard = "Y" filter(hooks_with_bait$set_counts, year == 2018, station == 2343) %>% as.data.frame() # saved version filter(hooks_with_bait_revert$set_counts, year == 2018, station == 2343) %>% as.data.frame() # reverted version
Plotting four years, with crosses showing 'non-standard'. (2020 is coloured different since no hooks with bait data yet, but the important bit is the crosses).
sets_2020 <- dplyr::select(sets_simp_std, -c(U32halibut, O32halibut)) # not the same structure as sets_2018, below plot_iphc_map_panel(hooks_with_bait_revert$set_counts, sp = "Hooks with bait", years = 2017:2019, indicate_standard = TRUE) plot_iphc_map(sets_2020, sp = NULL, years = 2020, indicate_standard = TRUE)
Clearly 2020 has a few less stations just north of Vancouver Island, but not enough to worry about greatly. The curious ones are the ones way in in the inlets that are no longer flagged as non-standard.
Need to look and plot values:
sets_2018 <- filter(hooks_with_bait_revert$set_counts, year == 2018) not_std_2018 <- filter(sets_2018, standard == "N")$station not_std_2020 <- filter(sets_2020, standard == "N")$station # Not standard in both: not_std_2018_and_2020 <- intersect(not_std_2018, not_std_2020) not_std_2018_and_2020 length(not_std_2018) length(not_std_2020) length(not_std_2018_and_2020) # 2018 has some east of the map, all non-standard: filter(hooks_with_bait_revert$set_counts, year == 2018, lon > -124)$standard nrow(filter(hooks_with_bait_revert$set_counts, year == 2018, lon > -124)) std_in_2018_but_not_std_in_2020 <- intersect(filter(sets_2018, standard == "Y")$station, not_std_2020) std_in_2018_but_not_std_in_2020 not_std_in_2018_but_std_in_2020 <- intersect(not_std_2018, filter(sets_2020, standard == "Y")$station) not_std_in_2018_but_std_in_2020 # setdiff(x, y) - elements in x but not in y # setdiff(not_std_2018, not_std_2020) - but 2020 fewer coverage so misleading
Plot stations not standard in 2018 but standard in 2020, and vice versa, using each years' lats and lons (to verify that they all still agree -- i.e., that station numbers have consistent lats and lons), and show 2019 data to check no 'usual' stations are non-standard in 2018 or 2020:
plot_BC() points(lat~lon, data = filter(sets_2018, station %in% not_std_in_2018_but_std_in_2020), col="red", pch = 19) # Do the same but using 2020 station co-ordinates - should overlap: points(lat~lon, data = filter(sets_2020, station %in% not_std_in_2018_but_std_in_2020), col="blue", pch = 3) # And show the single station std in 2018 but not 2020: points(lat~lon, data = filter(sets_2018, station %in% std_in_2018_but_not_std_in_2020), col="red", pch = 17) points(lat~lon, data = filter(sets_2020, station %in% std_in_2018_but_not_std_in_2020), col="blue", pch = 1, cex = 2) # Now show all 2019 stations: points(lat~lon, data = filter(hooks_with_bait_revert$set_counts, year == 2019), col="darkgreen", pch = 0) legend("bottomleft", legend = c("Not std 2018 but std 2020 (2018 co-ords)", "Not std 2018 but std 2020 (2020 co-ords)", "Std 2018 but not std 2020 (2018 co-ords)", "Std 2018 but not std 2020 (2020 co-ords)", "All 2019 stations"), pch = c(19, 3, 17, 1, 0), pt.cex = c(1, 1, 1, 2, 1), col = c("red", "blue", "red", "blue", "darkgreen"))
Check if the one standard station in 2018 but not in 2020 (not fished at all in 2019) appears in any earlier years:
filter(hooks_with_bait_revert$set_counts, station == std_in_2018_but_not_std_in_2020) %>% as.data.frame()
Was only fished in 2018 and 2020, so we should define it as non-standard.
So, the conclusions from this section are that we should:
sets_simp_std_corrected <- sets_simp_std sets_simp_std_corrected$standard[sets_simp_std_corrected$station %in% not_std_in_2018_but_std_in_2020] <- "N" # cbind(sets_simp_std$standard, sets_simp_std_corrected$standard) # to check
So check which functions that need changing, since they create a 'standard'
column. These do not need changing: get_iphc_hooks()
and get_iphc_skates_info
.
Then get_iphc_sets_info()
does return standard
, but the standard
designation is not saved in GFbio it is saved in setDataExpansion
in gfiphc
.
So just need to add a line in IPHC-stations-expanded.R
and then re-save all
.rda files. Fixed that, now recreating all .rda files, as per the README.
First, get the species counts into the desired format (to match countData2013
shown earlier):
counts_raw <- readr::read_csv("non-halibut-data-2020.csv") %>% dplyr::mutate_if(is.character, factor) counts_raw summary(counts_raw) testthat::expect_equal(unique(counts_raw$Year), 2020) testthat::expect_equal(unique(counts_raw$SampleType), as.factor("20Hook")) # This mismatches for 2020: # testthat::expect_equal(length(unique(counts_raw$Station)), # length(sets_raw$Station)) unique(counts_raw$"Species Name")
Note that halibut are not included in these counts:
dplyr::filter(counts_raw, "Species Name" == "Pacific Halibut")
which I presume explains why
total number of counts for a station does not add up to HooksObserved
. See
later for halibut calculations.
Need to remove the HAN records for the twice-fished station, which turns out to be set number 4 for station 2104:
dplyr::filter(counts_raw, Station == twice_fished) %>% dplyr::select(c("Station", "Setno", "Species Name", "Number Observed")) %>% as.data.frame() dplyr::filter(sets_raw, Station == twice_fished)
So use that here to remove the species counts for that vessel (note that vessel
code is not in counts_raw
)
dplyr::filter(counts_raw, Station == twice_fished & Setno == 4) # So just keep these: dplyr::filter(counts_raw, !(Station == twice_fished & Setno == 4)) countData2020_no_halibut <- dplyr::filter(counts_raw, !(Station == twice_fished & Setno == 4)) %>% dplyr::select(year = Year, station = Station, spNameIPHC = "Species Name", specCount = "Number Observed") %>% arrange(station) %>% dplyr::mutate(year = as.integer(year), station = as.character(station), spNameIPHC = as.character(spNameIPHC), specCount = as.integer(specCount)) testthat::expect_equal(names(countData2013), names(countData2020_no_halibut)) countData2020_no_halibut summary(countData2020_no_halibut)
Now, obtain the numbers of hooks observed and retrieved from counts_raw
, to
then merge into the set details:
hook_details <- dplyr::filter(counts_raw, !(Station == twice_fished & Setno == 4)) %>% dplyr::group_by(Station) %>% dplyr::summarise(year = unique(Year), hooksRetr = unique(HooksRetrieved), hooksObs = unique(HooksObserved)) %>% dplyr::rename(station = Station) %>% dplyr::ungroup() %>% arrange(station) %>% dplyr::mutate(year = as.integer(year), station = as.character(station)) hook_details expect_equal(sets_simp_std_corrected$station, hook_details$station)
So now need to get the hook details into the set details, and keep columns as
for setData2013
but also with standard
, and may as well keep hooksRetr
and
hooksObs
:
setData2020 <- dplyr::left_join(sets_simp_std_corrected, hook_details, by = c("year", "station")) %>% dplyr::mutate(E_it20 = effSkateIPHC * hooksObs / hooksRetr) %>% dplyr::select(year, station, lat, lon, avgDepth, effSkateIPHC, E_it20, usable, standard, hooksRetr, hooksObs) %>% dplyr::mutate(year = as.integer(year), station = as.character(station), avgDepth = as.integer(avgDepth), usable = as.character(usable), standard = as.factor(standard)) setData2020 testthat::expect_equal(names(setData2013), names(setData2020)[1:ncol(setData2013)]) summary(setData2020)
As noted above, the data extraction for the counts is for all non-halibut
species. We still want the halibut counts for just the first 20 hooks -- the
data_for_all_species
vignette (for data up to 2019) shows that the 20-hook and full hook counts
(Series A and B) are very similar when rescaled, and the rescaling is miniscule
with $G_A / G_B = 1.005$. So this justifies sticking with 20-hook counts for
halibut, even though the full data are available for all sets, given it is a
halibut survey. (Using all hooks for all years could be done, but would be a lot
of new code).
There are two options for getting halibut counts for the first 20 hooks (given we don't have hook-by-hook data, though it could probably be obtained just not from the IPHC website).
Take the halibut counts for all the hooks and create N_it20_halibut_est = E_it20 / E_it *
N_it
, or equivalently just N_it20_halibut_est = hooksObs / hooksRetr *
N_it
. Note that observed refers to observed for non-halibut species
(presumably hooksRetr
works for halibut). Not
strictly the first 20 hooks, but is a rescaling. But will not guarantee integer values.
setData2020_and_halibut <- dplyr::left_join(setData2020, dplyr::select(sets_simp_std_corrected, c(station, U32halibut, O32halibut)), by = "station") %>% dplyr::mutate(N_it_halibut = U32halibut + O32halibut, N_it20_halibut_opt_1 = hooksObs / hooksRetr * N_it_halibut) setData2020_and_halibut %>% dplyr::select(station, N_it_halibut, N_it20_halibut_opt_1)
Add the counts for each set (which include Hook with Skin
etc.) and compare
with hooksObs
. I think the latter is higher, and the difference is halibut
(as the only non non-halibut
species). Compare with the results from
option 1. If close then use option 2, since it will be just be halibut counts
and gives an integer number.
Add counts for each set:
counts_20 <- countData2020_no_halibut %>% dplyr::group_by(station) %>% dplyr::summarise(non_halibut = sum(specCount)) %>% dplyr::ungroup() counts_20
Now join the two options together to calculate N_it20_halibut_opt_2
and
compare the two estimates of N_it20_halibut
:
compare <- dplyr::left_join(setData2020_and_halibut, counts_20, by = "station") %>% dplyr::mutate(N_it20_halibut_opt_2 = hooksObs - non_halibut, N_it20_opt_1_over_opt_2 = N_it20_halibut_opt_1 / N_it20_halibut_opt_2) %>% dplyr::select(year, station, usable, N_it20_halibut_opt_1, N_it20_halibut_opt_2, N_it20_opt_1_over_opt_2) compare$spNameIPHC <- "Pacific Halibut" compare plot(compare$N_it20_halibut_opt_1, compare$N_it20_halibut_opt_2) abline(a = 0, b = 1)
So this is the right approach, though numbers not quite as close as may have thought. But these data are used for aggregating across all stations in a year (and any further analyses on halibut should be done using the raw data anyway). And the means aren't too bad:
mean(compare$N_it20_halibut_opt_1) mean(compare$N_it20_halibut_opt_2)
So either of these would work. So use option 2 since gives an integer count:
compare$N_it20_halibut_opt_2 countData2020_halibut <- dplyr::select(compare, year, station, spNameIPHC, specCount = N_it20_halibut_opt_2) %>% dplyr::mutate(specCount = as.integer(specCount)) countData2020 <- rbind(countData2020_no_halibut, countData2020_halibut) %>% dplyr::arrange(station) # First time running, called the above countData2020_NEW to check remaining data didn't change: #expect_equal(countData2020, filter(countData2020_NEW, spNameIPHC != # "Pacific Halibut"))
The file inst/extdata/iphc-spp-names.csv
contains species common names (as
used for gfsynopsis, and a few extra like unidentified skate
) and the IPHC
common name.
These are IPHC names that are not given in iphc-spp-names.csv
.
check_iphc_spp_name()
These are the ones just for the new 2020 data:
check_iphc_spp_name(countData2020)
Of these, only the Thornyhead and Blackspotted Rockfish are likely of interest (Issues #17 and #18). And maybe the sharks.
usethis::use_data(countData2020, overwrite = TRUE) usethis::use_data(setData2020, overwrite = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.