knitr::opts_chunk$set(echo = TRUE)
In this walkthrough tutorial, I will guide you through a simple case of the "hespdiv" package application. The purpose of this walkthrough is to provide you with a general understanding of the HespDiv method, the capabilities of the "hespdiv" package, and how you can apply it for spatial analysis of fossil occurrence data.
If you come across any unfamiliar expressions in the tutorial or require clarification how the hespdiv algorithm works, you can refer to the "HespDiv: Concepts and Methodology" vignette that provides a glossary of key terms used in the HespDiv methodology and explains its concepts.
First of all, if you haven't already, install the "hespdiv" and its data package from GitHub repositories:
devtools::install_github("Liudas-Dau/hespdiv") devtools::install_github("Liudas-Dau/hespdiv_data")
In this walkthrough, we will be using the mio_mams dataset, which contains Miocene terrestrial mammal species' fossil occurrence data in the US, as well as the us dataset, which consists of a contiguous US polygon.
The mio_mams dataset was derived from tidying data downloaded from the Paleobiology database. The tidying process involved removing marine mammals, flying mammals, and occurrences with imprecise location information.
You can find the script that produced both datasets in the "data_prep.R" file. This file, along with the raw data and metadata files, is located in the GitHub repository of the "HHData" package, specifically in the raw-data folder. The "HDData" package provides exemplary data for use with this package.
Once you load the "HDData" package using library(HDData), you can use these datasets straight away. Don't forget to load the "hespdiv" package as well to access its functions.
library(HDData) mio_mams library(hespdiv)
Let's take a look at the mio_mams dataset. It consists of 5244 fossil occurrence observations and a considerable number of variables (132), which is typical for Paleodb datasets. However, for our hespdiv() analysis, we only require a few specific variables: the coordinates of the observations and the names or IDs of the taxa. The coordinates should be formatted as a data frame with "x" and "y" columns.
species <- mio_mams$accepted_name # Taxa names sp_coords <- data.frame(x = mio_mams$lng, y = mio_mams$lat) # Coordinates of observations
While it is not mandatory, I recommend providing a polygon representing the study area. This polygon serves as a geographical reference point when interpreting the results. The polygon should also be formatted as a data frame with "x" and "y" columns.
str(us)
Now, let's try the Morisita-Horn subdivision method using the default hespdiv() variables. This will allow us to perform hierarchical bioregionalization of the mio_mams dataset.
# The 'hd' object exists pre-calculated in HDData library, so don't need to run this code: hd <- hespdiv(data = species, xy.dat = sp_coords, study.pol = us)
Please note that running this call may take some time. The computation time is primarily influenced by the 'n.split.pts' argument, which controls the number of alternative subdivisions considered. The default value of 15 generates 120 split-lines (sum(1:15)) for each subdivision attempt. Increasing this value improves the fit of straight split-lines to the data but also increases computation time. In this tutorial, we will use lower values to avoid lengthy computation times.
After the computations are complete, you will see a table in the console containing information about the established split-lines:
hd
This table provides various information about the subdivisions and the performance of the split-lines. For instance, the "performance" column shows the Morisita-Horn values of the best split-lines, while the "mean" and "sd" columns indicate the average and standard deviation of the Morisita-Horn values for all tested split-lines in a single subdivision. The "z.score" column represents the outstandingness of the best split-line, calculated as (performance - mean) / sd. It is important to note that split-line performances are highly correlated in space, and extremely high or low z-scores (>|-+3|) should be uncommon, while medium z-score values (>|-+1.5|) suggest split-lines that are notably outstanding. Each split-line's performance is represented by its outstandingness (z-score), Morisita-Horn value (performance), and order in the hierarchy tree (rank).
The obtained result (hd) is a hespdiv class object, which is a list containing seven elements. These elements store the coordinates of the HespDiv polygons and split-lines, along with various information about all the subdivisions (both attempted and established), and call information.
One of the significant outcomes is the "poly.stats" data frame, which offers comprehensive information regarding the established HespDiv polygons. For more detailed information about the output and the printed table containing information about split-lines, please refer to the documentation of the hespdiv() function by executing ?hespdiv::hespdiv) in your R environment.
To gain a better understanding of the HespDiv results, it is recommended to visualize them. The "hespdiv" package offers several options for visualization. The main method, plot_hespdiv(), displays all established split-lines along with their performance. You can choose to visualize the performance using either line width or color. Additionally, there is an option to display the number of occurrences from the same location.
plot_hespdiv(hd, n.loc = TRUE) # Using color for split-line performance and displaying the number of fossil occurrences from each location plot_hespdiv(hd, type = "w") # Using line width for split-line performance
These plots provide a clear visualization of how the study area is subdivided and the performance of each split-line. The numbers next to each split-line indicate their order in the split-line table, corresponding to the order in which the algorithm established them. The rank of each split-line can also be checked in the table or distinguished directly from the plot. For example, the split-line with ID 1 was established first, dividing the study area and data into two parts. The second rank split-lines are with IDs 2 and 4, further subdividing these parts. The best split-line is of the 3rd rank (ID 6) and is located in the southeast, with a Morisita-Horn similarity value of 0.06. Split-lines established in the interior of the US appear to have lower performance, possibly due to more permeable communities at the center of the continent and the uniqueness of coastal communities.
However, plot_hespdiv() lacks the ability to convey information about the established HespDiv polygons and explicitly reveal the hierarchical structure of the results. For this purpose, the function blok3d() can be used. It opens an interactive 3D RGL device where the obtained polygons are displayed as 3D blocks, with their "height" corresponding to a selected variable from the hd$poly.stats data frame. By default, the height of the polygons corresponds to the mean performance of evaluated split-lines.
blok3d(hd)

In the 3D plot, you can observe that some polygons are tall while others are short. The tall polygons have higher Morisita-Horn similarities across all tested split-lines, indicating greater spatial homogeneity in taxa composition. Conversely, short polygons have spatially more heterogeneous taxa composition and probably more than one equally suitable subdivision. By selecting the standard deviation of Morisita-Horn values as the height of polygons, you can assess the variability of community composition heterogeneity across the territory.
If certain polygons obstruct the view, you can interactively remove them using the polypop() function.
polypop(hd, height = "mean") # Set the "height" to the same value as was set in the blok3D function

In the output of blok3d(), you can easily identify the position of each polygon in the spatial hierarchy tree, as higher-order polygons are always on top of lower-order ones. This also demonstrates that higher-order HespDiv polygons and clusters are strict subsets of lower-order ones. However, not all HespDiv polygons are displayed in the output of blok3d(hd) if not enough split-lines were evaluated to obtain a metric for height. By checking the hd$poly.stats[,c("plot.id","n.splits")] data frame, you can see how many split-lines were evaluated in each polygon.
To display the location of each polygon, you can use the polygon rank as their height. However, irregularities may appear in the polygon geometry obtained with nonlinear split-lines when split-lines cross polygon boundaries by any amount. In such cases, the blok3d() function produce an error. To resolve this, you would need to remove the points that make the polygon geometries irregular before running blok3d().
# The following code demonstrates a potential issue with the `blok3d()` function blok3d(hd, height = "rank") # This should result in an error, occurring before rendering the 14th polygon in the RGL device. Thus, in this case, the problem appears to be with the 14th polygon. Closer inspection reveals which polygon points are problematic. Removing them solves the problem. # Removing problematic points from the 14th polygon hd$polygons.xy$'14' <- hd$polygons.xy$'14'[-2:-4,] # Running `blok3d()` again after removing the problematic points blok3d(hd, height = "rank")

The plots generated by blok3d() may not be suitable for paper publications. Additionally, to make the location of each polygon clear, you may need to use the polypop() function. Furthermore, as previously mentioned, rendering issues can occur, especially with many polygons created using nonlinear split-lines. In such cases, the poly_scheme() function can be used. While it does not visualize polygon statistics like blok3d(), it helps pinpoint the location of each polygon and track its position in the spatial hierarchy tree.
poly_scheme(hd, seed = 4) # Different 'seed' values produce different sets of colors
In the poly_scheme() plot, polygons are indicated by three elements of the same color: their IDs, centroids, and dotted segments connecting centroids to the split-lines that produced the polygons. Polygon IDs can be used to look up polygon information in the hd$poly.stats data frame. The first HespDiv polygon corresponds to the undivided study area, and its centroid is connected to the study area's boundary rather than a split-line. All other HespDiv polygons are displayed regularly. By examining the plot, you can see the relationships between polygons and their subdivisions. For example, the poly_scheme(hd, seed = 4) plot shows that the 2nd and 7th brown polygons are produced by the first split-line which is also brown (note that a centroid of a polygon can fall outside a polygon if it has an irregular shape like the 2nd polygon). Imagining a 3D view of the polygons with an added height dimension can enhance understanding of the spatial hierarchy tree.
The cross_comp() function allows for the cross-comparison of the produced HespDiv clusters, corresponding to different HespDiv polygons. In the output of hespdiv(), comparison values (performance) are provided only for pairs of clusters obtained by the same split-line. On the other hand, the cross_comp() function takes each HespDiv cluster and compares its data to the data of every other HespDiv cluster using the same subdivision method approach, resulting in a matrix of cross-comparison results. This matrix can be further processed and used with other data exploration methods. For example, it can be transformed into a distance matrix and utilized as input for cluster or network analysis, effectively visualizing the interrelations between HespDiv clusters.
sim_m <- cross_comp(hd) # Obtain cross-comparison matrix cl <- hclust(as.dist(1-sim_m)) # Convert similarity to distance and perfomring cluster analysis plot(cl)
# install.packages("igprah") library(igraph) gr <- graph.adjacency( as.matrix(as.dist(sim_m)), # Zero the diagonal of the similarity matrix for the graph mode = "undirected", weighted = TRUE, diag = FALSE ) plot(gr)
gr <- igraph::graph.adjacency( as.matrix(as.dist(sim_m)), # Zero the diagonal of the similarity matrix for the graph mode="undirected", weighted=TRUE, diag=FALSE ) plot(gr)
The cluster analysis clearly reveals three main clusters of fossil taxa assemblages corresponding to the east, west, and central US HespDiv polygons. Additionally, the graph illustrates that fossil taxa assemblages from peripheral polygons are the most distinct, while the first HespDiv cluster, encompassing all study data from the entire US, occupies the center of the graph. These characteristics may reflect a fundamental aspect of how community change is distributed in space: it accelerates from the interior towards the periphery of the continent. This phenomenon might be explained by the spatial autocorrelation of fossil assemblages at a continental/regional scale. Consequently, regions with more neighboring regions exhibit less distinct fossil taxa assemblages compared to regions on the periphery of the continent, which have fewer neighboring regions.
The performance of subdivisions is demonstrated by the performance of split-lines, their z-scores, and ranks. However, there is a reasonable need to estimate the significance of the HespDiv output by other means, since the result might be sensitive to the subdivision criteria used, or similar performances can be obtained from totally randomized data.
If the spatial structure within the data is weak, very similar split-line performances can be obtained after randomly shuffling the data. The nulltest() function does exactly that: it shuffles the data n times and re-measures the performance of the established split-lines each time. It then compares the original performance with those obtained after data shuffling.
# The 'nl' object exists pre-calculated in HDData library, so don't need to run this code: set.seed(1) # The seed is used to obtain the same result of an experiment with random properties. nl <- nulltest(hd, n = 1000)
plot(nl)
nl
The boxplot diagram here reveals the distribution of 1000 re-measured performances of each split-line after data shuffling; the red dot indicates the original performances. Only the third split-line appears to be insignificant. The printed data frame also corroborates this fact, as the quantile (proportion of instances when split-line performed better after data shuffling) of the third split-line is 0.616. Results of other split-lines indicate a very strong spatial structure in the fossil assemblage data.
The hespdiv() function has a considerable number of arguments, and it would be helpful to understand how changing these arguments affects the result. For this purpose, the hsa() and hsa_detailed() functions are available, which allow you to specify alternative values for the arguments and re-run the hespdiv() function. However, the recommended function for performing hespdiv sensitivity analysis is hsa(), as it can more effectively sample a broader range of argument values.
When using hsa(), you can provide any number of alternative values for any number of arguments and specify the desired number of hespdiv() re-runs. Each re-run is performed with a random selection of values for each argument. The values for each argument are drawn from a pool that includes the value used in the original hespdiv() call and the alternative values provided.
Here is an example where changes are made to the algorithm type, study area, and the fit to data. The subdivision method, subdivision criteria, and study data remain unchanged.
# The 'hsa_rez' object exists pre-calculated in HDData library, so there's no need to run this code: set.seed(2) # The seed is used to obtain the same result for an experiment with random properties. hsa_rez <- hsa(obj = hd, n.runs = 100, # 100 alternative hespdiv re-runs n.split.pts = 8:30, # The number of split-points determines the fit to data of straight split-lines same.n.split = FALSE, # The regularity of split-point placement determines whether the scale of analysis changes depending on the order of subdivision c.splits = FALSE, # This argument controls whether curves are generated in an attempt to improve the performance of the best linear split-lines c.X.knots = 3:8, # Controls the number of wiggles in generated curves, determining their fit to the data c.Y.knots = 5:15, # Controls the number of different shapes each curve wiggle can achieve, also determining their fit to the data c.fast.optim = TRUE, # Determines the optimization algorithm for non-linear split-lines use.chull = FALSE) # Determines whether the convex polygon of occurrences is used as the study area polygon. If not, the study area polygon becomes the provided US polygon.
The obtained 'hsa_rez' object contains the results of alternative subdivisions. These results can be investigated to check the stability of the HespDiv polygons and the HespDiv clusters obtained with the original hespdiv() call.
The hsa_plot() function can be used to visualize all alternative subdivisions in the same plot along with the basal (original) subdivision and assess the stability of HespDiv polygons.
plot_hsa(hsa_rez, type = 1) # Using the default plot option - everything in one window
In the resulting plot, a high degree of variation in the boundaries of the produced HespDiv polygons is evident. However, there are also some convergence zones where the boundaries come closer together.
The stability of HespDiv polygons is expected to be low due to changes in the study area polygon and the types of allowed split-lines. Setting 'c.splits = FALSE' in the hsa() call results in subdivisions with only straight split-lines. Similarly, setting 'use.chull = FALSE' leads to a different initial polygon compared to the one obtained using the convex hull. Changes in the study area also indirectly affect the absolute subdivision criterion for area size, as the same proportion value corresponds to different area size.
Now let’s see how the picture changes if we consider split-line ranks:
plot_hsa(hsa_rez, type = 2, # Displays ranks with different colors alpha = 0.3, # Adjusts transparency basal.col = 1, # Color of the basal subdivision split.col.seed = 1) # Seed to generate random colors for each split-line rank

The plot obtained using 'type = 2' uses colors and width to distinguish split-lines of different ranks. It can be observed that split-lines of the same rank tend to be placed in similar zones that are relatively wide. These zones could represent the fuzzy boundaries that exist between different fossil assemblages.
The picture created with 'type = 2' provides a lot of information and may appear crowded with split-lines. However, you can use the 'type = 4' setting to plot split-lines of each rank in a separate device. The 'type = 3' setting would be intermediate between 2 and 4 since it cumulatively adds higher-ranked split-lines in each device.
plot_hsa(hsa_rez, type = 4, # Displays different ranks in different plot devices alpha = 0.3, # Adjusts transparency basal.col = 1, # Color of the basal subdivision split.col.seed = 1 # Seed to generate random colors for each split-line # rank )
The above code opens seven graphical devices, and in each one, split-lines of different ranks are plotted.







The obtained figures provide an even clearer picture. For example, the split-lines of the 1st rank concentrate in the west of the US and are mostly oriented longitudinally. However, the 1st rank split-line in the basal subdivision ('hd' object) appears to be slightly further west than the major convergence zone of split-lines. The split-lines of the 2nd rank exhibit higher variation. There are two clusters of split-lines, one located more to the east and the other to the west. The western cluster has two alternative dominant orientations, while the eastern split-line cluster has a dominant orientation and a point of convergence around which the split-lines seem to be slightly spun or rotated. The 2nd rank split-lines in the basal subdivision correspond quite well to both of these split-line clusters. The 3rd rank split-lines show less pronounced, but still visible clusters. The clusters of split-lines become even less pronounced in ranks four to five. There are very few split-lines of rank 6, and they are located in the eastern part of the US. Finally, only one split-line of rank 7 was obtained, also in the east.
If desired, you can change the basal subdivision with an alternative subdivision using the change_base() function.
new_hsa <- change_base(hsa_rez, id = 95) # Choosing the alternative subdivision that is the 95th in hsa_rez$Alternatives to become the new basal subdivision. The old basal subdivision will become the 95th alternative subdivision in new_hsa
plot_hespdiv(new_hsa$Basis)
new_hsa$Basis$call.info$Call_ARGS$study.pol <- us plot_hespdiv(new_hsa$Basis)
Next, you can use the plot_hsa() function to visualize the split-lines of different ranks in separate graphical devices after changing the basal subdivision.
plot_hsa(new_hsa, type = 4, # Displays different ranks in different plot devices alpha = 0.3, # Adjusts transparency basal.col = 1, # Color of the basal subdivision split.col.seed = 1 # Seed to generate random colors for each split-line rank )



by using the plot_hsa() function, you can identify where the split-lines converge and choose an alternative subdivision that better aligns with these convergence zones using the change_base() function.
When the distribution of fossil occurrence data is irregular, clustered, and contains wide spatial gaps, the boundaries of HespDiv polygons can vary significantly. However, different subdivisions can still result in HespDiv clusters that are quite similar, as differently shaped and positioned HespDiv polygons may filter similar subsets of study data.
The hsa_quant() function assesses the similarity between basal HespDiv clusters and alternative HespDiv clusters. It does this by calculating the Jaccard similarity between the basal and alternative HespDiv clusters. For each basal HespDiv cluster, the function identifies an "analog HespDiv cluster" from each alternative subdivision. This analog cluster has the highest Jaccard similarity with a given basal HespDiv cluster. The function provides the IDs of these analog clusters, along with their Jaccard similarity values and quantiles. Using plot_hsa_q(), you can visualize the distribution of Jaccard similarity values for the analog clusters associated with each basal HespDiv cluster. These distributions help assess the stability of each basal HespDiv cluster. For instance, if there is a single peak at high values (> ~0.8), it indicates that very similar HespDiv clusters consistently appear in alternative subdivisions, indicating a stable basal cluster. A unimodal distribution with a peak at medium values (0.4-0.6) and a tail towards higher values could also suggest a more persistent spatial structure. On the other hand, a single peak at low values (<0.4) indicates low cluster stability (e.g., the bioregion does not exist). Finally, uniform, bimodal, or other complex distributions may indicate that the stability and existence of the corresponding basal cluster depend on the parameters used in alternative hespdiv calls.
According to the provided guide, the plots below indicate the relative stability of the 2nd, 3rd, 6th, 7th, 8th, 9th, 10th, 12th, 13th, 14th, and 16th clusters. The 5th and 15th clusters show less stability. The 4th cluster is truly unstable or "non-existent," while there is high uncertainty surrounding the existence of the 11th and 17th clusters.
# The 'clst' object exists pre-calculated in HDData library, so don't need to run this code: clst <- hsa_quant(hsa_rez) plot_hsa_q(clst, hist = TRUE)
plot_hsa_q(clst, hist = TRUE)
I invite you to delve deeper into the world of HespDiv using the functions and techniques presented in this vignette. Experiment with different arguments, datasets, and scenarios to uncover new insights and applications. Remember, this vignette only scratches the surface, and there is much more to explore.
If you encounter any unfamiliar terms or concepts along the way, I encourage you to refer to the glossary vignette. It provides a comprehensive list of definitions and explanations for the key terminology used in this package.
I look forward to seeing how you apply these tools in your own analyses and investigations. Happy exploring!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.