This R package FishFreqTree helps users to easily explore and quantitatively compare fishery definitions based on a distributional regression tree algorithm that is applied to age/length frequency data. The details regarding distributional regression tree algorithm can be found in @lennert-cody2010 and @lennert-cody2013. Maunder et al. 2022 provides an example that demonstrates how to use this package to make fishery definitions for stock assessments.
library(devtools)
install_github('HaikunXu/FishFreqTree',ref='main')
The input age/length frequency data should be a data frame including four columns named exactly as "lat", "lon", "year", and "quarter" and various length frequency columns corresponding to selected age/length bins. The columns "lat" and "lon" represent the latitudinal and longitudinal positions of grid centers, respectively. For those who don't consider quarter as a splitting dimension (e.g., your model has a time step of one year), please still add a column named "quarter" to the input data with value = 1. This regression tree package works with age/length frequency data so please make sure age/length frequency values sum to 1 across age/length bins. An example of the input data can be found [here]{.underline}.
run_regression_tree
: explore a user-specified fishery definition based on the regression tree algorithmUsers are required to specify the input data frame (LF
), the first (fcol
) and last (lcol
) columns in the input data frame that have frequency data, the name of all age/length bins (bins
) as a numeric vector, the number of splits (Nsplit
; equals to the number of defined fisheries - 1), and a directory where results are saved (save_dir
).
The function also provides some advanced options including manually building the regression tree (manual = TRUE
) and specifying the minimal number of lat (lat.min
), lon (lon.min
), and year (year.min
) allowed for a cell. Users can also turn on/off the year (year
) and quarter (quarter
) dimensions when building the regression tree.
This function provides a series of standardized outputs for users to understand the result:
split.csv: all candidate splits are compared and sorted across existing cells based on the percentage of total variance explained. The last column of this table (Rank) is used to specify the step-wise selection decision.
improvement-split.csv: cell-specific improvement metric for all candidate splits; values are sorted for every cell (highest values are preferred for the selection). This table provides supplementary information only and is NOT used to specify the step-wise selection decision.
Record.csv: summarize step-wise split information including split number, key, value, cell, and the percentage of total variance explained.
split(annual maps).png: spatial distribution of cells across quarters
split(quarterly maps).png: spatial distribution of cells by quarter
split(latlon).png: cell-specific improvement profiles against lat and lon
split(year).png: cell-specific improvement profiles against year
split(lf).png: comparison of cell-specific length frequency
The package provides a default fishery definition (run_regression_tree(..., manual = FALSE)
) by selecting every split that corresponds to the highest percentage of variability explained (the first row of split.csv files). However, users can explore other definitions by using run_regression_tree(..., manual = TRUE, select = user_specified)
. The user-specified splits are numbered according to the rank in split.csv files). Please change one split at a time because the step-wise regression tree is hierarchical.
loop_regression_tree
: compare differing fishery definitions according to the percentage of variance explainedUsers are highly recommended to compare various fishery definitions, even for the same number of splits, because the definition is flexible and may need to be adjusted for practical reasons. Moreover, the tree is hierarchical and unstable, so comparing a variety of combinations with the default combination is highly valuable. In fact, the default one may not explain the highest percentage of variance in the input data.
Users are required to specify the input data frame (LF
), the first (fcol
) and last (lcol
) columns in the input data frame that have frequency data, the name of all age/length bins (bins
) as a numeric vector, the number of splits (Nsplit
; equals to the number of defined fisheries - 1), a directory where results are saved (save_dir
), and the selection matrix the user wants to explore (select_matrix
).
The function also provides some advanced options including specifying the minimal number of lat (lat.min
), lon (lon.min
), and year (year.min
) allowed for a cell. Users can also turn on/off the year (year
) and quarter (quarter
) dimensions when building the regression tree.
For example, there are three splits (Nsplit = 3
) and you want to explore two competing splits for each of the three splits. First you need to generate the selection matrix by select_matrix <- expand.grid(split1 = 1:2, split2 = 1:2, split3 = 1:2)
. The number of combinations to be explored should be 2*2*2=8: dims(select_matrix)
.
The function provides to the screen the summary table from the loop function (also saved as loop.csv).
| select1 | select2 | select3 | \% var_explained | |---------|---------|---------|------------------| | 1 | 2 | 1 | 0.1642 | | 1 | 1 | 1 | 0.1617 | | 1 | 1 | 2 | 0.1556 | | 1 | 2 | 2 | 0.1499 | | 2 | 1 | 1 | 0.1453 | | 2 | 2 | 1 | 0.1402 | | 2 | 1 | 2 | 0.1394 | | 2 | 2 | 2 | 0.0953 |
evaluate_regression_tree
: evaluate a user's pre-specified fishery definition according to the percentage of variance explainedUsers can evaluate a pre-specified fishery definition using this function. Users are required to specify the input data frame (LF
), the first (fcol
) and last (lcol
) columns in the input data frame that have frequency data, the name of the column where fishery definition is specified (Flagcol
), the name of all age/length bins (bins
) as a numeric vector.
This function provides a series of standardized outputs for users to understand the result:
split(annual maps).png: spatial distribution of cells across quarters
split(lf).png: comparison of cell-specific length frequency
make.lf.map
: make lat-lon gridded maps for length frequencymake_meanl.map
: make spatial maps for mean lengthlf.aggregate
: aggregate length count data by length only or also by year and quarterlf.demean
: remove the mean of length frequency dataLoad the required packages:
# devtools::install_github('HaikunXu/RegressionTree',ref='main') library(FishFreqTree) require(tidyverse) # it is required for some supporting functions # check the example LF data head(LF)
Specify function inputs
fcol <- 5 # the first column with LF info lcol <- 17 # the last column with LF info bins <- seq(30,150,10) # length of bins as a numeric vector Nsplit <- 3 # the number of splits (the number of cells - 1) # the directory where results will be saved save_dir <- "D:/OneDrive - IATTC/Git/FishFreqTree/manual/"
Plot the length frequency data as lat-lon grids and map of mean length
# plot lf data as maps make.lf.map(LF, fcol, lcol, bins, save_dir) # plot mean length as maps make.meanl.map(LF, fcol, lcol, bins, save_dir)
Find the default 4-cell combination:
LF_Tree <- run_regression_tree(LF, fcol, lcol, bins, Nsplit, save_dir)
Check the default 4-cell combination results
head(LF_Tree$LF) # a summary of the three splits LF_Tree$Record # Color-code the 4 fisheries in the map make.split.map(LF_Tree$LF, Nsplit, save_dir) # or text-code the 4 fisheries in the map (works better when quarterly splits exist) make.split.map(LF_Tree$LF, Nsplit, save_dir, text = TRUE)
In addition to the default combination, you can also manually explore other 4-cell combinations. For example, in the figure above cell#2 only includes two 5*5 grids, so you want to explore some other 4-cell combinations. You can run the regression tree again with the second best 2th split:
LF_Tree2 <- run_regression_tree(LF, fcol, lcol, bins, Nsplit, save_dir, manual = TRUE, select = c(1, 2, 1)) # a summary of the three splits LF_Tree2$Record # map the 4 cells make.split.map(LF_Tree2$LF, Nsplit, save_dir)
This setting leads to a even higher proportion of variance explained (16.4% vs. 16.2%) than the default run. Also, the number of 5*5 grids within each cell is more reasonable. In fact, you can use the loop function to explore multiple 4-cell combinations. Those combinations are compared and sorted according to the proportion of variance explained.
# create a directory to save results from the loop function loop_dir <- paste0(save_dir,"loop/") dir.create(loop_dir) # build the selection matrix my_select_matrix <- data.matrix(expand.grid( split1 = 1:2, split2 = 1:2, split3 = 1:2 )) # check the selction matrix my_select_matrix LF_Tree_Loop <- loop_regression_tree(LF, fcol, lcol, bins, Nsplit, save_dir = loop_dir, select_matrix = my_select_matrix)
Users can also pre-specify a fishery definition and evaluate the performance of the pre-specified splits according to the percentage of variance explained:
# first specify the fishery definition using column Flag3 LF_Tree3 <- LF_Tree$LF LF_Tree3$Flag3 <- ifelse(LF_Tree3$Flag3 > 2, LF_Tree3$Flag3, 1) # check the fishery deifnition is specified as expected make.split.map(LF_Tree3, 3, save_dir) # evlaute the fishery definition evaluate_regression_tree(LF_Tree3,fcol,lcol,"Flag3",bins,save_dir,folder_name="test")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.