knitr::opts_chunk$set(cache=TRUE, message=FALSE, warning=FALSE, fig.path='figs/', cache.path = '_cache/', fig.process = function(x) { x2 = sub('-\\d+([.][a-z]+)$', '\\1', x) if (file.rename(x, x2)) x2 else x }) options(kableExtra.latex.load_packages = FALSE)
library(tidyverse) library(FleetSegmentation)
The appropriate segmentation of fishing fleets is controversially discussed in fisheries research and management and a variety of approaches has been introduced. In the current fisheries data management of the STECF and ICES, two major approaches continue to exist side by side. In the economic data collection and the annual economic report, fishing vessels are classified by gear class and vessel length. This is problematic, not only because of the selection of length classes for vessels but also because no information on the fishing strategy is included. The second approach used mostly in biological research uses so-called fisheries metiers, which include information on the main gear, vessel length, mesh width, and target assemblage. This is again problematic because the target assemblage contains no information on the stocks used and how mixed or targeted a fishery is. The present approach, developed in a pilot study funded by the DCF, introduces a standardized multivariate approach for characterizing fisheries fleet segments by hierarchical agglomerative cluster analysis (HAC) of their catch composition.
The FS-package includes all the functions needed to perform the multivariate approach and analyze the results. It loads all packages necessary for its use automatically, but unfortunately, those packages can´t be installed automatically when the FS-package is installed from .zip. A code block for the installation of all necessary packages can be found in the appendix (see Appendix - Package installation). It is planned is to make this package accessible on GitHub and automate the process. Make sure to update your R and R Studio to the latest version and install/update all other necessary packages. Once you are ready, load the FS-package from the library.
library(FleetSegmentation)
The package uses functions from 33 other packages, so loading it might take a moment. Some warnings might be shown because of package conflicts about which you do not have to worry.
Now it is time to prepare and load your fleet data. Please keep in mind that clustering is a data-driven procedure, so the structure and size of the data frame determines the clustering result and especially the estimates of how many clusters are present in the data. Therefore, it is strongly advised to pre-separate your data by gear class into, for example, demersal trawlers, pelagic trawlers, and passive gearers. If a notable part of the fleet is using mixed gears, classify them as a mixed gear
fleet. Prepare the required data frames for all gear classes and run the segmentation procedure for all of them. This will most likely improve the quality of the clustering.
The functions of the FS-package are essentially based on two input data frames: One data frame containing the catch of all vessels of a certain gear class and one data frame containing the length of those vessels. First, we will load the example catch data included in the FS-package to illustrate the required data frame structure. We will store it in a data frame called catchdata
.
data <- example_catchdata glimpse(data)
As you see, this data frame contains four columns.
ship_[running nr.]
The dataframe is aggregated to the species- and FAO-area level, hence one row of the dataframe resembles the total catch of one vessel for one species in one FAO-area.
Next, we load the vessel length data.
shiplength <- example_lengthdata glimpse(shiplength)
This data frame contains two columns.
ship_[running nr.]
cm
or m
, dependent functions will auto-transform values.The clustering procedure's quality and identification of specific fisheries are considerably better when spatially defined stocks
instead of species are used as the clustering variable. Therefore, the package contains a build-in function to assign ICES and GFCM stocks based on the target species and the catch area. If the fisheries represented by your data are predominantly operating in the Northeast Atlantic or the Mediterranean, you can use this function to assign stocks to your catch data relatively quickly, as you can see in the example.
data_stocks <- assign_stocks(data = data) glimpse(data_stocks)
As you can see, the newly created data frame has only three columns: The vessel identifier, the stock identifier, and the total catch per stock and vessel. Yet, not all species fished in the Northeast Atlantic or the Mediterranean are managed under a specific stock. In such cases, the function will automatically assign stocks to species caught in a reasonable volume (more than 10t in total, adjustable by the argument threshold.auto.generate
) based on the species and the area identifiers. In the case of FAO areas, only the major area (e.g., 27.4 for the North Sea) will be considered. You can switch off this function setting the argument auto.generate
to FALSE
, though it is highly recommended to leave it activated. Suppose the fisheries you are analyzing with the package are not operating in the areas covered by the function but are managed by a stock system. In that case, you have to prepare your data structured analogously to the data frame created by the assign_stocks
function, i.e., with the three columns vessel identifier, stock identifier, and total catch (in kg) per stock and vessel. If you don´t have access to spatial information or if there is no stock-management system applicable to your data set, you can still use the fleet segmentation approach with the catch per species as an input variable, even though the result of the clustering will most likely be harder to interpret. Just make sure to use the necessary data frame structure: Three columns, first the vessel identifier, then the species identifier, and then the total catch per species and vessel.
In our case, the example data resembles a fishery operating in the area covered by ICES, and therefore, ICES stocks were assigned to the example data by the applied function. Next up, we transform the data to a format suitable for the clustering procedure.
Prior to the clustering, we have to transform the data into a suitable format for clustering and store the transformed data in a new data frame. Therefore, we use the catchdata_transformation()
function and call the transformed data frame catchdata
.
catchdata <- catchdata_transformation(data = data_stocks) glimpse(catchdata)
The resulting data frame is a matrix with each stock as a column and each vessel as a row. The catch values are transformed to their relative proportion of a vessel's total catch, scaling between 0 and 1.
This is one of the most important yet most challenging steps of a clustering procedure. There is not one definite index or method to examine the correct
number of clusters in a dataset. Instead, various indices and tests are used, their results compared and a heuristic decision on the number of clusters made. The multiple clustering procedures using a different number of clusters might have to be repeated to get a final clustering result.
First, five essential indices estimating the best number of clusters are computed. The result can be printed in a basic or an HTML version or stored in a data frame. The maximum number of clusters to be expected is automatically calculated from the data, up to a maximum of 15 clusters. If you expect your dataset to contain more than 15 clusters, use the max_clusternumber
argument in the function.
# print base format numberclust_table(catchdata = catchdata) # store base format as data frame index_table <- numberclust_table(catchdata = catchdata)
# print html-format numberclust_table(catchdata = catchdata,style = "html")
As you see, most indices suggest a number of 7 to 8 clusters to be present in the data, except for the Calinski-Harabasz-index, which suggests 15 clusters to be present. These indices use different methods for calculating the ideal number of clusters and are likely to produce quite heterogenic results. The average silhouette score and the mantel test have proven to be the most robust for the type of data used for the clustering. They can be compared graphically using the numberclust_plot
function of the FS-package, along with two other important diagnostic plots: The scree plot and a dendrogram of the clustering.
numberclust_plot(catchdata = catchdata)
The scree plot on the upper left depicts the total within sums of squares of the clustering data. It is checked for an "elbow" in the curve, i.e., the clustering step with the steepest decline. This position is usually less pronounced in the clustering approach used than in, e.g., k-means clustering of environmental data. Yet, the transition zone
of the curve, in which the slope decreases from steep to nearly zero, gives a good insight into how many clusters are to expect in the data set. In the case of the example data, seven to eight clusters seem to be ideal. This is supported by the average silhouette width, which is high for six to nine clusters and at a maximum at seven clusters and shows a steep decline for ten clusters and forth; and by the mantel test, which reveals a high Pearson correlation between the clustering and the original data for six to seven clusters, with a maximum at seven The visual examination of the dendrogram is another way to estimate the appropriate number of clusters in the data. The FS-package contains a function to plot only the dendrogram and manipulate the cutting height in different ways, one for cutting at a specific height and one for cutting at a range of heights. We will plot these options aligned with the ggarrange()-function of the ggpubr-package. To tidy up the grid, the y-axis label of the second and third dendrogram will be removed. These dendrograms are produced by the ggplot-grammar and can be manipulated with ggplot-functions (and all functions based on the ggplot-grammar).
library(ggpubr) # basic dendrogram dend1 <- numberclust_dendrogram(catchdata = catchdata) # dendrogram with specified cutting height at 0.75 dend2 <- numberclust_dendrogram(catchdata = catchdata, dend_method = "cut", dend_cut = .75) + theme(axis.title.y = element_blank()) # dendrogram with a range of cutting heights from 0.5 to 0.9 dend3 <- numberclust_dendrogram(catchdata = catchdata, dend_method = "range", range_min = .5, range_max = .8)+ theme(axis.title.y = element_blank()) # grid the dendrograms ggarrange(dend1,dend2,dend3, nrow = 1,align = "h", labels = c("A","B","C"), label.x = .9)
As you see, the dendrograms support the conclusion that five to seven distinct clusters are present in the data. Note that 0.5 to 0.9 is a very wide range of distances and in a more heterogenic dataset can return a very wide range of clusters. In this case, it can be helpful to narrow down the range of cutting heights using the range_min
and range_max
arguments. In general, the more heterogenic and overlapping a dataset is, the harder it is to interpret a dendrogram. In this case, rely on the clustering indices and other diagnostic plots to estimate the appropriate number of clusters.
Another helpful diagnostic plot for estimating the best number of clusters to use is a clustering tree. This tree plot visualizes the proceeding of the clustering splits and the resulting sizes of clusters. It can be used to determine the steps in which significant splits of clusters occur and differentiate between those steps and over-clustering
, where only small sub-clusters of single ships are produced. The function is a modification of the clustree()
function from the eponymous package. As the indices and dendrograms already showed that the data set is unlikely to contain more than 10 clusters, we will adjust this argument when plotting the clustering tree.
numberclust_clustree(catchdata = catchdata,max_clusternumber = 10)
The clustering tree makes it easy to retrace the changes happening when new clusters are introduced. When a sixth cluster is introduced, cluster number 1 is split into two new clusters, one containing 29 and the other 8 vessels. These might represent valid fleet segments. When an additional cluster is created, cluster number 4 is split into two clusters, one containing only one ship. This is rather undesirable for one-ship clusters are rarely a valid fleet segment, yet it is possible that one vessel is engaged in. Considering all the clustering diagnostics we examined, we have to decide to use six to seven clusters. A general strategy on how to proceed at a point like this is to stick with the upper end of the identified range. It is quite simple to re-join clusters in the final step of the segmentation but nearly impossible to separate them. Therefore, we will move forward with our example case and assume that 7 clusters are present in the dataset.
After deciding on the number of clusters in the dataset, the clustering can easily be re-computed and assigned in a single line of code. It is advised to store the clustering result in a new data frame, which we will simply call clustering
. The function to create this clustering data frame is called segmentation_clustering
, it requires the transformed catch data and the assumed number of clusters as input.
clustering <- segmentation_clustering(catchdata = catchdata,n_cluster = 7) glimpse(clustering)
The clustering result is a data frame with two columns.
This data frame will be the basis of the fleet segmentation. It is likely, that some of the clusters are not valid fleet segments but rather should be joined in a single segment. To characterize the clusters and decide whether or not to unite them to fleet segments, the FS-package contains a set of diagnostic methods.
The first step to characterize the clusters is to evaluate their catch. The average percentage each stock contributes to a cluster's catch can be tabelized and plotted using the following functions. Note that these functions require the basic (untransformed) data as an input argument.
# This will print a basic table clustering_stockshares_table(data = data_stocks, clustering = clustering) # This will store the basic table as a data frame stockshares_table <- clustering_stockshares_table(data = data_stocks, clustering = clustering) # This will produce an html-table displayed in the viewer # (disabled in markdown) # clustering_stockshares_table(data = data,clustering = clustering, # style = "html") # This produces a ggplot() of the average percentages and a # ranking of shares colour-indicated clustering_stockshares_plot(data = data_stocks,clustering = clustering)
The plot illustrates the differences in catch composition among the clusters. At this point, expert knowledge on stock distribution is very helpful. It can be used to identify clusters representing a target fishery and clusters with similar catch composition, which might be joined to fleet segments. In the example data, a few things become obvious with some background knowledge of the fished stocks. First, we can distinguish between clusters using coastal stocks (1,5 and 6) and clusters using subarctic and high seas stocks (2,3,4 and 7). This might be harder in your dataset (due to less distinct stock nomenclature or more widely distributed) but still worth the effort. In our example, the clusters fishing coastal stocks can be further divided into a very targeted herring fishery (cluster 6), a targeted saithe fishery (cluster 5), and a rather mixed coastal fishery (cluster 1). The clusters fishing high sea stocks comprise a cluster fishing exclusively shrimps (2), a cluster fishing mainly redfish (3), a cluster fishing redfish and grenadier in approximately equal quantities (4), and a cluster fishing mostly grenadier (7). In the next step, we will check how the catch composition of those clusters is interrelated with 2-dimensional and 3-dimensional ordination methods.
The FS-package includes functions for 2- and 3-dimensional ordination methods. They are very valuable tools for visualizing the compositional distances of clusters, identifying overlaps, and therefore clusters that might be suitable for uniting them to fleet segments. We will start with the 2-dimensional MDS. We can use a metric MDS because of the metric nature of the input data. It is to prefer over an nMDS because it reflects true distances, not ranked distances.
clustering_MDS(catchdata = catchdata,clustering = clustering)
This MDS reveals some of the relations in the catch composition and shows quite good goodness of fit. We will analyze it along the dimensions, starting with cluster 6. The cluster is distinctly different from the other clusters, which makes sense considering its catch composition. Vessels in cluster 6 caught near to exclusively herring, while no other clusters vessels caught considerable amounts of herring. Moving in the first dimension, cluster 1 opposes cluster 6. Moving down along the 2nd dimension, clusters 7, 4, and 3 are grouped close to each other by the ordination. This is in accordance with their catch composition, for they all fish on the same stocks (redfish, grenadier, and halibut) but in different quantities. Further down in the most negative area of the second dimension, cluster 2 and cluster 5 are ordinated very closely together. This is quite confusing at first, for cluster 2 is formed by vessels performing a targeted high seas shrimp fishery while cluster 5 vessels are mainly fishing coastal saithe and cod. It illustrates the limits of ordinating a catch composition matrix in a two-dimensional space. The clustering_MDS
-function therefore also includes the option of plotting a 3-dimensional ordination.
clustering_MDS(catchdata = catchdata,clustering = clustering, dim = 3)
This will produce an interactive, 3-Dimensional in the RStudio-Viewer looking like this:
knitr::include_graphics(path = "https://raw.githubusercontent.com/ESulanke/FleetSegmentation/main/vignettes/figs/FS_package_exampledata_3D_MDS.jpeg")
As you see in this snapshot, adding a third dimension to the visualization of the MDS shows a large distance between the second and fifth cluster along the third axis. Differences like this can only be revealed in a three-dimensional ordination, so I encourage you to always check it in your fleet segmentation progress, even if it might take a moment to fully interpret the ordination. In both ordinations, you might have noted that two clusters (2 and 6) are only represented by a single point. This is due to the catch composition of the clusters: They heavily fish on one species, and there is close to no inner-cluster variation. Because of cases like this (which are also common in real data), it is important to get an impression on the properties of the clusters and the vessels in them besides their catch composition two finally decide on which clusters can be used as fleet segments and which clusters should be joined. The FS-package contains functions to do this, which will be covered in the next section.
\pagebreak
The main diagnostic features for the characterization of fleet segments based on the available data besides the clusters catch composition are the number of vessels in the clusters, the length distribution of vessels in the cluster, the total catch of the individual vessels, and the total catch of each cluster. The FS-package contains a function to compute a plotgrid visualizing all these features.
clustering_plotgrid(data = data_stocks, clustering = clustering, shiplength = shiplength)
This plot grid contains a lot of information on the clusters and enables us to group them according to their structure. The numerically largest clusters are cluster 1 and cluster 6, both consisting of more than 20 vessels. They contain mostly small vessels of 15 to 20m length. Individual vessels have a medium annual catch but add up to the largest total cluster catches. The high seas clusters 2,3,4 and 7 contain rather few ships; 4 and 7 are actually single-ship clusters. The vessels are around a hundred meters long and have medium to high individual catches. The total cluster catch is consequently low since they are the only members of their clusters. This is just an extract of the information one can draw from this visualization. Even though the data used is artificial, this is a very realistic pattern to encounter in fisheries data. Especially high seas vessels are often operated by larger companies, and fishing quota is distributed among them not equally but region- or stock specifically. They, therefore, tend to form very small subclusters even though they are engaged in similar fishing operations. Cluster 5, which is the coastal fishery focussed on saithe, contains remarkably longer vessels than the other clusters of coastal fishery. They are likely to have different coast structures and other areas of operation than the other coastal clusters. Cluster 1, the highly mixes coastal cluster, contains relatively small vessels with an annual individual catch comparable to vessels from cluster 6. It is upon expert knowledge to decide whether or not these clusters should be joined as a fleet segment. For our example dataset, we assume that due to their differences in fishing strategies and techniques, area of operation, and cost structure, they will be treated as separate fleet segments.
All the plots in this grid can be produced as single plots using
cluster_size_plot(clustering = clustering) # For A) shiplength_plot(clustering = clustering, shiplength = shiplength) # For B) singleship_catch_plot(data = data_stocks, clustering = clustering) # For C) clustercatch_plot(data = data_stocks, clustering = clustering) # For D)
Note that there are a lot more kinds of fisheries data sources with the potential to characterize clusters and ultimately assign fleet segments, e.g., spatial data (haul positions and VMS-Data), temporal data (total time spent at sea and fishing trip duration, seasonality of fishing activity) as well as data on the cost structure (total costs as well as cost ratios). I encourage you to make use of every data source available and inspect your clusters from all possible angles and invite you to discuss this with me. It will make it easier to assign the fleet segments to your clusters. The actual process is fairly simple.
Taking the stock shares on the clusters catch, the result of the ordination, and the vessel and cluster characteristics into account, we can make a profound decision on the number of actual fleet segments in our example data.
The clustering result is stored as a factor variable in the data frame you created with the segmentation_clustering
-function and working with factors can be kind of tricky. There is no built-in function in the FS-package, but I will provide you with code to easily call the clusters by number and assign fleet segments to them. I recommend you create a new data frame in which you assign the fleet segments. We will call it final_segmentation
in our example and introduce a column called FS
for the fleet segments.
final_segmentation <- clustering # Create FS-column final_segmentation$FS <- NA # Call clusters by factor level # and assign fleet segments # 1) Coastal mixed demersal fishery final_segmentation$FS[final_segmentation$cluster %in% levels(final_segmentation$cluster)[c(1)]] <- "Coastal mixed demersal fishery" # 2) High seas shrimp fishery final_segmentation$FS[final_segmentation$cluster %in% levels(final_segmentation$cluster)[c(2)]] <- "High seas shrimp fishery" # 3) High seas mixed demersal fishery final_segmentation$FS[final_segmentation$cluster %in% levels(final_segmentation$cluster)[c(3,4,7)]] <- "High seas mixed demersal fishery" # 4) Coastal saithe fishery final_segmentation$FS[final_segmentation$cluster %in% levels(final_segmentation$cluster)[c(5)]] <- "Coastal saithe fishery" # 5) Coastal herring fishery final_segmentation$FS[final_segmentation$cluster %in% levels(final_segmentation$cluster)[c(6)]] <- "Coastal herring fishery" # check if we forgot to assign any cluster anyNA(final_segmentation$FS) # if FALSE, make fleet segment a factor variable # otherwise check assigning code final_segmentation$FS <- as.factor(final_segmentation$FS) # list the segments levels(final_segmentation$FS) # glimpse the data frame glimpse(final_segmentation)
And this is it! We have successfully assigned a fleet segment to each ship in our example data.
A final remark on assigning the fleet segments: The clustering process of the FS-package requires stock-based input data. Fishing vessels are typically designed to perform specific operations and, therefore, to harvest specific stocks. Consequently, the clustering result should be sufficient to assign fleet segments to your vessels. In case you want to take into account further variables in your segmentation, e.g., the ship length, this is, of course, possible since the clustering result is just a qualitative variable and can easily be combined with other variables. To illustrate such a case, we will assume that there is a large-scale (ships above 24m length) and a small-scale (ships below 24m length) coastal mixed demersal fishery.
# join the ship length information to the # segmentation data final_segmentation <- final_segmentation %>% left_join(example_lengthdata) # transform FS back to a character variable # this is important for manipulation, otherwise # an error will occur final_segmentation$FS <- as.character(final_segmentation$FS) # assign the small scale mixed demersal coastal fisheries final_segmentation$FS[final_segmentation$cluster %in% levels(final_segmentation$cluster)[c(1,8)] & final_segmentation$shiplength < 24] <- "Small-scale coastal mixed demersal fishery" # assign the large scale mixed demersal coastal fisheries final_segmentation$FS[final_segmentation$cluster %in% levels(final_segmentation$cluster)[c(1,8)] & final_segmentation$shiplength >= 24] <- "Large-scale coastal mixed demersal fishery" # check if we forgot to assign any cluster anyNA(final_segmentation$FS) # if FALSE, make fleet segment a factor variable # otherwise check assigning code final_segmentation$FS <- as.factor(final_segmentation$FS) # list the segments levels(final_segmentation$FS)
As you see, combining the clustering information with further information, such as vessel characteristics, is simple. I hope you found this manual helpful and enjoyable to read. Good luck with your fleet segmentation. If you have any questions, remarks, or critiques, I am looking forward to hearing from you.
All the best, Erik Sulanke
Final remark: The example dataset included in this package is purely fictional and for illustrative purposes only. Any resemblance of real fisheries is of coincidence.
This is the package installation code for all the packages you need to make the FS-package work. Simply start a new R-session outside any project and run this code to your console prior to installing the FS-package from the .zip-file.
install.packages(c("knitr", "plotly", "kableExtra", "RColorBrewer", "dplyr", "tidyr", "ggplot2", "scales", "ggpubr", "stringr", "lme4", "rpart", "data.table", "vegan", "factoextra", "FactoMineR", "dendextend", "cluster", "NbClust", "fpc", "ade4", "BBmisc", "clustree", "networkD3", "htmlwidgets", "labdsv", "ggfortify", "corrplot", "ggforce", "concaveman", "forcats", "ggrepel", "ggdendro"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.