Introduction

This R package FishFreqTree helps users to easily explore and quantitatively compare fishery definitions based on a distributional regression tree algorithm that is applied to age/length frequency data. The details regarding distributional regression tree algorithm can be found in @lennert-cody2010 and @lennert-cody2013. Maunder et al. 2022 provides an example that demonstrates how to use this package to make fishery definitions for stock assessments.

How to install

library(devtools)

install_github('HaikunXu/FishFreqTree',ref='main')

Input data

The input age/length frequency data should be a data frame including four columns named exactly as "lat", "lon", "year", and "quarter" and various length frequency columns corresponding to selected age/length bins. The columns "lat" and "lon" represent the latitudinal and longitudinal positions of grid centers, respectively. For those who don't consider quarter as a splitting dimension (e.g., your model has a time step of one year), please still add a column named "quarter" to the input data with value = 1. This regression tree package works with age/length frequency data so please make sure age/length frequency values sum to 1 across age/length bins. An example of the input data can be found [here]{.underline}.

Functions

Main functions

run_regression_tree: explore a user-specified fishery definition based on the regression tree algorithm

Users are required to specify the input data frame (LF), the first (fcol) and last (lcol) columns in the input data frame that have frequency data, the name of all age/length bins (bins) as a numeric vector, the number of splits (Nsplit; equals to the number of defined fisheries - 1), and a directory where results are saved (save_dir).

The function also provides some advanced options including manually building the regression tree (manual = TRUE) and specifying the minimal number of lat (lat.min), lon (lon.min), and year (year.min) allowed for a cell. Users can also turn on/off the year (year) and quarter (quarter) dimensions when building the regression tree.

This function provides a series of standardized outputs for users to understand the result:

The package provides a default fishery definition (run_regression_tree(..., manual = FALSE)) by selecting every split that corresponds to the highest percentage of variability explained (the first row of split.csv files). However, users can explore other definitions by using run_regression_tree(..., manual = TRUE, select = user_specified). The user-specified splits are numbered according to the rank in split.csv files). Please change one split at a time because the step-wise regression tree is hierarchical.

loop_regression_tree: compare differing fishery definitions according to the percentage of variance explained

Users are highly recommended to compare various fishery definitions, even for the same number of splits, because the definition is flexible and may need to be adjusted for practical reasons. Moreover, the tree is hierarchical and unstable, so comparing a variety of combinations with the default combination is highly valuable. In fact, the default one may not explain the highest percentage of variance in the input data.

Users are required to specify the input data frame (LF), the first (fcol) and last (lcol) columns in the input data frame that have frequency data, the name of all age/length bins (bins) as a numeric vector, the number of splits (Nsplit; equals to the number of defined fisheries - 1), a directory where results are saved (save_dir), and the selection matrix the user wants to explore (select_matrix).

The function also provides some advanced options including specifying the minimal number of lat (lat.min), lon (lon.min), and year (year.min) allowed for a cell. Users can also turn on/off the year (year) and quarter (quarter) dimensions when building the regression tree.

For example, there are three splits (Nsplit = 3) and you want to explore two competing splits for each of the three splits. First you need to generate the selection matrix by select_matrix <- expand.grid(split1 = 1:2, split2 = 1:2, split3 = 1:2). The number of combinations to be explored should be 2*2*2=8: dims(select_matrix).

The function provides to the screen the summary table from the loop function (also saved as loop.csv).

| select1 | select2 | select3 | \% var_explained | |---------|---------|---------|------------------| | 1 | 2 | 1 | 0.1642 | | 1 | 1 | 1 | 0.1617 | | 1 | 1 | 2 | 0.1556 | | 1 | 2 | 2 | 0.1499 | | 2 | 1 | 1 | 0.1453 | | 2 | 2 | 1 | 0.1402 | | 2 | 1 | 2 | 0.1394 | | 2 | 2 | 2 | 0.0953 |

evaluate_regression_tree: evaluate a user's pre-specified fishery definition according to the percentage of variance explained

Users can evaluate a pre-specified fishery definition using this function. Users are required to specify the input data frame (LF), the first (fcol) and last (lcol) columns in the input data frame that have frequency data, the name of the column where fishery definition is specified (Flagcol), the name of all age/length bins (bins) as a numeric vector.

This function provides a series of standardized outputs for users to understand the result:

Supporting functions

make.lf.map: make lat-lon gridded maps for length frequency

make_meanl.map: make spatial maps for mean length

lf.aggregate: aggregate length count data by length only or also by year and quarter

lf.demean: remove the mean of length frequency data

Package demonstration

Load the required packages:

# devtools::install_github('HaikunXu/RegressionTree',ref='main')
library(FishFreqTree)
require(tidyverse) # it is required for some supporting functions
# check the example LF data
head(LF)

Specify function inputs

fcol <- 5 # the first column with LF info
lcol <- 17 # the last column with LF info
bins <- seq(30,150,10) # length of bins as a numeric vector
Nsplit <- 3 # the number of splits (the number of cells - 1)
# the directory where results will be saved
save_dir <- "D:/OneDrive - IATTC/Git/FishFreqTree/manual/"

Plot the length frequency data as lat-lon grids and map of mean length

# plot lf data as maps
make.lf.map(LF, fcol, lcol, bins, save_dir)
# plot mean length as maps
make.meanl.map(LF, fcol, lcol, bins, save_dir)

Find the default 4-cell combination:

LF_Tree <- run_regression_tree(LF, fcol, lcol, bins, Nsplit, save_dir)

Check the default 4-cell combination results

head(LF_Tree$LF)
# a summary of the three splits
LF_Tree$Record
# Color-code the 4 fisheries in the map
make.split.map(LF_Tree$LF, Nsplit, save_dir)
# or text-code the 4 fisheries in the map (works better when quarterly splits exist)
make.split.map(LF_Tree$LF, Nsplit, save_dir, text = TRUE)

In addition to the default combination, you can also manually explore other 4-cell combinations. For example, in the figure above cell#2 only includes two 5*5 grids, so you want to explore some other 4-cell combinations. You can run the regression tree again with the second best 2th split:

LF_Tree2 <-
  run_regression_tree(LF,
                      fcol,
                      lcol,
                      bins,
                      Nsplit,
                      save_dir,
                      manual = TRUE,
                      select = c(1, 2, 1))
# a summary of the three splits
LF_Tree2$Record
# map the 4 cells
make.split.map(LF_Tree2$LF, Nsplit, save_dir)

This setting leads to a even higher proportion of variance explained (16.4% vs. 16.2%) than the default run. Also, the number of 5*5 grids within each cell is more reasonable. In fact, you can use the loop function to explore multiple 4-cell combinations. Those combinations are compared and sorted according to the proportion of variance explained.

# create a directory to save results from the loop function
loop_dir <- paste0(save_dir,"loop/")
dir.create(loop_dir)
# build the selection matrix
my_select_matrix <-
  data.matrix(expand.grid(
    split1 = 1:2,
    split2 = 1:2,
    split3 = 1:2
))

# check the selction matrix
my_select_matrix

LF_Tree_Loop <-
  loop_regression_tree(LF,
                       fcol,
                       lcol,
                       bins,
                       Nsplit,
                       save_dir = loop_dir,
                       select_matrix = my_select_matrix)

Users can also pre-specify a fishery definition and evaluate the performance of the pre-specified splits according to the percentage of variance explained:

# first specify the fishery definition using column Flag3
LF_Tree3 <- LF_Tree$LF
LF_Tree3$Flag3 <- ifelse(LF_Tree3$Flag3 > 2, LF_Tree3$Flag3, 1)
# check the fishery deifnition is specified as expected
make.split.map(LF_Tree3, 3, save_dir)
# evlaute the fishery definition
evaluate_regression_tree(LF_Tree3,fcol,lcol,"Flag3",bins,save_dir,folder_name="test")

References



HaikunXu/FishFreqTree documentation built on Jan. 25, 2025, 12:07 a.m.