hespdiv: Hierarchically subdivide spatial data

View source: R/hespdiv.R

hespdivR Documentation

Hierarchically subdivide spatial data

Description

This function is an implementation of spatial data analysis method HespDiv. It performs hierarchical spatial data subdivision by recursively dividing the data using random split-lines, evaluating their comparison values (how well they separate data), and using the best to perform subdivisions.

Usage

hespdiv(
  data,
  xy.dat = NULL,
  n.split.pts = 15,
  same.n.split = TRUE,
  method = "horn.morisita",
  generalize.f = NULL,
  compare.f = NULL,
  maximize = NULL,
  N.crit = 1,
  N.rel.crit = 0.2,
  N.loc.crit = 1,
  N.loc.rel.crit = 0.2,
  S.crit = 0.05,
  S.rel.crit = 0.2,
  Q.crit = NULL,
  c.splits = TRUE,
  c.Q.crit = NULL,
  c.crit.improv = 0,
  c.X.knots = 5,
  c.Y.knots = 10,
  c.max.iter.no = +Inf,
  c.fast.optim = FALSE,
  c.corr.term = 0.05,
  study.pol = NULL,
  use.chull = TRUE,
  tracing = NULL,
  pnts.col = 1,
  display = FALSE,
  pacific.region = FALSE,
  .do_recurse = TRUE
)

Arguments

data

An R object that contains the data to be analyzed. The required data structure depends on the selected method (e.g., character vector of taxa names).

xy.dat

A data.frame containing the coordinates for observations in the data. This parameter can be ignored if data is a data frame or matrix with columns x or y.

n.split.pts

integer. The number of split-points - 1. These points are used in creating straight split-lines (see details). The total number of straight split-lines generated can be obtained by sum(1:n.split.pts). Increasing the value of n.split.pts leads to an increase in both the computation time and the fit to the data.

same.n.split

logical. Should the number of split-points (n.split.pts) remain the same within each lower-ranked polygon? Choosing TRUE would result in a higher density of straight split-lines within lower-ranked polygons, whereas FALSE would preserve the same split-line density. Thus, essentially you are choosing between cross-scale and fixed-scale analysis.

method

Character. Name or abbreviation of a preset method: "sorensen", "pielou", "morisita", or "horn.morisita". Internally determines values for compare.f, generalize.f, and maximize.

generalize.f

Function. Optional function used in custom methods to prepare input for the compare.f function; see Details. It must have a single argument, for example x, to which a spatially filtered subset of data can be assigned; see Note for exceptions.

compare.f

function. Only required in custom methods. Employed to quantify the comparison value of a split-line. For this purpose, the compare.f function requires two arguments where the outputs of the generalize.f function can be assigned (see notes for exceptions).

maximize

logical. Only required in custom methods. Determines whether the split-line comparison value should be maximized or minimized during the optimization process.

N.crit

number. Minimum required number of observations that should be present in each polygon obtained by a split-line in order for it to meet this criterion.

N.rel.crit

number from 0 to 0.5. Each polygon obtained with a split-line must have at least such proportion of observations to pass this criterion. Equation of the proportion: (Number of observations in 1st/2nd resulting polygon) / (Number of observations in the polygon being subdivided)

N.loc.crit

number. Minimum required number of different locations that should be present in each polygon obtained by a split-line in order for it to meet this criterion.

N.loc.rel.crit

number from 0 to 0.5. Each polygon obtained with a split-line must have at least such proportion of different locations to pass this criterion. Equation of the proportion: (Number of different locations in 1st/2nd resulting polygon) / (Number of different locations in the polygon being subdivided)

S.crit

number from 0 to 1. Each polygon obtained with a split-line must have at least such area proportion to pass this criterion. Equation of the proportion: (Area of 1st/2nd resulting polygon) / (Area of the first polygon). The first polygon is the provided study area or convex hull of observation locations.

S.rel.crit

number from 0 to 0.5. Each polygon obtained with a split-line must have at least such area proportion to pass this criterion. Equation of the proportion: (Area of 1st/2nd resulting polygon) / (Area of the polygon being subdivided).

Q.crit

number. The threshold for a split-line comparison value to be considered acceptable for a subdivision. When maximize = TRUE, higher values of the comparison value indicate a better subdivision. Conversely, when maximize = FALSE, lower values of the comparison value indicate a better subdivision.

c.splits

logical. When set to TRUE, the algorithm will explore nonlinear split-lines in addition to straight split-lines in order to find the optimal subdivision.

c.Q.crit

number. The threshold for a split-line comparison value to be considered acceptable for generating nonlinear split-lines. It is recommended to use the default value, which does not impose a performance requirement, unless you have a clear understanding of the potential improvements that nonlinear split-lines can achieve over straight split-lines. If maximize = TRUE, a suggested value for c.Q.crit is Q.crit minus the maximum potential improvement.

c.crit.improv

number. The threshold for the improvement in a split-line comparison value required for a nonlinear split-line to be selected instead of a straight split-line for subdivision. The default value of 0 means that even if a nonlinear split-line performs equally to a straight split-line, the straight split-line will still be chosen.

c.X.knots

integer. Specifies the number of columns in a network of spline knots used to generate nonlinear split-lines. These knots are evenly distributed along the straight split-line. Adjusting the value of c.X.knots controls the degree of wiggliness (number of turns) in the resulting nonlinear split-lines.

c.Y.knots

integer. specifies the number of rows in a network of spline knots used to generate nonlinear split-lines. These knots are distributed regularly along lines orthogonal to the straight split-line. Adjusting the value of c.Y.knots controls the number of amplitudes tested in each wiggle of a spline, influencing the shape of nonlinear split-lines.

c.max.iter.no

integer. The maximum number of iterations allowed through the network of spline knots when searching for the optimal shape of a nonlinear split-line. Setting a higher value, such as +Inf, increases the chances of converging to the best possible curve. It is recommended to use higher values when c.fast.optim = FALSE, as in this case, only a single spline knot can be selected per iteration.

c.fast.optim

logical. Determines when spline knots are selected. If TRUE, the algorithm selects the first knot that generates a curve with a better comparison value, ensuring faster convergence. If set to FALSE, the algorithm completes a full iteration through the network of spline knots, which may result in slower convergence but potentially better overall performance.

c.corr.term

number from 0.01 to 0.2. A correction term for nonlinear split-lines that intersect the boundary of the polygon. Smaller values (default is 0.05) are recommended, as they determine the extent to which the outlying interval of the generated spline, which crosses the polygon boundary, should be shifted away from the boundary and inside the polygon in a direction orthogonal to the straight split-line. This shift is specified as a proportion of the polygon height where the spline intersects the polygon boundary.

study.pol

A data frame with two columns, x and y. Rows should correspond to vertices of a polygon that defines the study area encompassing all observation locations in xy.dat. If not provided (default is NULL), the convex hull of xy.dat is used as the study area polygon.

use.chull

logical. If study.pol is provided, you can use use.chull = TRUE (which is default) to use it only for visualization.

tracing

Optional character vector of length two controlling diagnostic tracing output. The first element specifies the tracing level and must be one of "best", "main", or "all". The second element specifies the traced object and must be one of "curves", "straight", or "both". The default, NULL, produces no tracing output.

pnts.col

character or numeric. Specifies the color of observations in a plot. The argument is used when tracing is not NULL. If pnts.col = NULL, observations will not be displayed.

display

logical. Display a simple plot of results at the end of computations?

pacific.region

logical (default is FALSE). When set to TRUE, indicates that the study area is crossed by the 180th meridian, such as being within the Pacific Ocean. In this case, the coordinates of xy.dat and study.pol are transformed to eliminate the artificial abrupt change in x-coordinate values at the 180th meridian.

.do_recurse

Logical. Controls recursion (for internal use only). Default is TRUE. Normal users should not modify this parameter.

Details

The Algorithm

1) Split-point Placement: The function places a predetermined number of split-points (n.split.pts + 1) along the perimeter of the study area (study.pol) or the convex hull of observation locations (xy.dat) if a study area polygon is not provided. These split-points are evenly spaced, resulting in a distance between points equal to 1/(n.split.pts + 1) fraction (1/16 by default) of a polygon circumference.

2) Straight Split-lines: Straight split-lines are generated by connecting the split-points. The total number of straight split-lines generated is equal to the value of (n.split.pts + 1) * n.split.pts / 2 or to choose(n.split.pts + 1, 2). This holds true only in the first iteration when same.n.split is set to FALSE or in all iterations when same.n.split is set to TRUE. Note that the total number of split-lines generated will not be equal to the number of split-lines evaluated since split-lines that cross polygon boundary or do not pass sample size, area or location number subdivision criteria are not evaluated.

3) Subdivisions: Each split-line spatially divides the data and study area polygon into two subsets.

4) Criteria: Both subsets are then checked to see if they meet sample size, area and location number criteria.

5) Obtaining comparison values: Subsets that meet the criteria are compared using the generalize.f and compare.f functions to obtain a comparison value. First, each subset is passed to generalize.f to obtain a generalization value, such as the Pielou evenness index for the "pielou" method; see the Note section for clarification. These values are then used by compare.f to compare the subsets. In this way, each split-line that passed all subdivision criteria is assigned a comparison value.

6) Best Straight Split-line Selection: The best performing straight split-line is determined based on whether the maximize argument is set to TRUE or FALSE. If maximize is TRUE, the best split-line is the one with the highest comparison value; if maximize is FALSE, the best split-line has the lowest comparison value.

The combination of the generalize.f, compare.f, and maximize arguments can be provided, enabling the creation of custom methods, or it can be determined internally based on the chosen preset subdivision methods (see below).

7) Nonlinear Split-lines: If the best straight split-line meets the quality criteria specified by the c.Q.crit, it serves as the basis for generating variously shaped curves (nonlinear split-lines). These curves are produced using splines, which are mathematical functions that can create smooth and flexible curves.

To generate the curves, a number of knots (control points) for the splines are distributed evenly along a set of lines orthogonal to the best straight split-line. These orthogonal lines are also evenly distributed along the straight split-line itself. The number of knots and lines used can be adjusted through the parameters c.Y.knots and c.X.knots, respectively.

The algorithm then iterates through this network of knots, considering different combinations of knots, to produce curves. By varying the selection and arrangement of knots, different shapes of curves are generated.

8) Nonlinear Split-line Evaluation and Selection: Curves are then processed in the same manner as straight split-lines in steps 3 to 6.

9) Final Split-line Selection and Establishment of Subdivision: If the best curve outperforms the best straight split-line by a margin of c.crit.improv, the best split-line becomes nonlinear. Otherwise, it remains straight. The split-line must also satisfy the criteria established by the Q.crit argument to be used for subdivision.

10) Recursive Iteration: The process described above is iteratively applied, resulting in a collection of selected split-lines with their performance values. These split-lines hierarchically subdivide space and data, forming polygons of various shapes.

Preset Methods

Preset methods and their combinations of compare.f, generalize.f, and maximize:

"sorensen"

Functions calculate the Sorensen similarity index (Sorensen 1948). maximize = FALSE. Values vary from 0 to 1.

"pielou"

Functions calculate the mean proportional decrease in Pielou evenness (Pielou 1966) that occurs after polygon subdivision. maximize = TRUE. Values vary from 0 to 1.

"morisita"

Functions calculate the Morisita overlap index (Morisita 1959). maximize = FALSE. Values vary from 0 to 1.

"horn.morisita"

Functions calculate the Morisita overlap index, as modified by Horn (1966). maximize = FALSE. Values vary from 0 to 1.

All preset methods currently available are specifically designed for bioregionalization purposes. These methods require two key inputs: the coordinates of fossil taxon occurrences (xy.dat) and the names or IDs of taxa (data). These names or IDs should be structured as character or numeric vectors, with each element corresponding to a row in the xy.dat data frame. Each method compares fossil communities on opposite sides of a split-line, aiming to minimize similarity or maximize difference. The output yields biogeographical provinces with a hierarchical structure.

Value

A hespdiv object, which is a list with seven elements:

poly.stats

A data frame containing information about polygons established by the selected split-lines. Its columns are:

  • rank: The rank of a polygon. It corresponds to the rank of the split-line that produced the polygon and to the polygon's position in the hierarchical structure of the subdivision.

  • plot.id: The ID assigned to the polygon. It corresponds to the order in which the polygon was processed during the hespdiv analysis.

  • root.id: The ID of the parent polygon whose subdivision resulted in the current polygon.

  • n.splits: The number of straight split-lines evaluated in an attempt to subdivide the polygon. This count excludes split-lines that crossed the polygon boundary or did not meet area, sample size, or location-number criteria.

  • n.obs: The number of observations inside the polygon.

  • mean: The average comparison value of the straight split-lines used in the attempted subdivision of the polygon. This value reflects the general spatial heterogeneity of the data.

  • sd: The standard deviation of the comparison values of the straight split-lines used in the attempted subdivision. It indicates the extent of anisotropy or variation in spatial heterogeneity within the polygon.

  • str.best: The comparison value of the best straight split-line produced within the polygon.

  • str.z.score: The z-score of the comparison value of the best straight split-line within the polygon. It indicates how outstanding the best straight split-line is relative to other evaluated straight subdivisions.

  • has.split: Logical. Indicates whether a subdivision was established in the polygon.

  • is.curve: Logical. Indicates whether the established subdivision was obtained using a curve. If has.split is FALSE, this value is NA.

  • crv.best: The same as str.best, but for nonlinear split-lines.

  • crv.z.score: The same as str.z.score, but for nonlinear split-lines.

  • c.improv: The improvement in comparison value achieved by using nonlinear split-lines instead of straight split-lines.

split.stats

A data frame containing information about established split-lines. Its columns can be interpreted similarly to those in poly.stats, but from the perspective of split-lines rather than polygons. For example, rank is the rank of the split-line, not of the polygon. The performance column contains the comparison value of the split-line and corresponds to either str.best or crv.best in poly.stats, depending on the value of is.curve. The same applies to the z.score column.

split.lines

A list of data frames containing coordinates of established split-lines. The order of split-lines in this list corresponds to their order in split.stats.

polygons.xy

A list of data frames containing coordinates of polygons established by the split-lines. The order of polygons in this list corresponds to their order in poly.stats.

poly.obj

A list of polygon objects, that is, outputs of generalize.f for each polygon. The order of elements in this list corresponds to the row order of poly.stats and the polygon order in polygons.xy.

call.info

Information about the hespdiv call, including the method and arguments used.

str.difs

A list containing comparison values of evaluated straight split-lines for each polygon. The elements of this list correspond to the rows of poly.stats.

Note

Please note that if you use the method argument, the arguments generalize.f, compare.f, and maximize are determined internally and should not be provided. Therefore, you should only assign values to these arguments when using a custom method, not a predefined one. Additionally, you can ignore the generalize.f argument even when applying custom methods. If generalize.f is set to NULL (default), the data remains unchanged, as generalize.f acts as an identity function. Hence, generalize.f is only an optional argument that allows to omit the transformation or generalization step in compare.f function, simplifying it.

Both generalize.f and compare.f inherit environments from parent functions: hespdiv() and .spatial_div(). This allows them to use additional variables from those environments, such as xy.dat, samp.xy, id1, and id2.

There is a small possibility that a nonlinear split-line may cross a polygon boundary. If the result contains such a split-line, or if an error related to this issue occurs, slightly change one of the arguments, such as n.split.pts, and re-run hespdiv().

Author(s)

Liudas Daumantas

References

Horn, H. S. (1966). Measurement of" overlap" in comparative ecological studies. The American Naturalist, 100(914), 419-424.

Morisita, M. (1959). Measuring of interspecific association and similarity between assemblages. Mem Fac Sci Kyushu Univ Ser E Biol, 3, 65-80.

Pielou, E. C. (1966). The measurement of diversity in different types of biological collections. Journal of theoretical biology, 13, 131-144.

Sorensen, T. A. (1948). A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Biol. Skar., 5, 1-34.

Examples

# Simulated data with a known boundary at x = 0.45 to illustrate boundary detection

study.area <- data.frame(
  x = c(0, 0.8, 1, 0.6, 0, 0),
  y = c(0, 0, 0.4, 1.1, 1.1, 0)
)

set.seed(852)
# Simulate 100 occurrence coordinates
N <- 100
xy.dat_arg <- data.frame(
  x = rpois(N, 4) / 10,
  y = rpois(N, 4) / 10
)

xy.dat_arg <- xy.dat_arg[order(xy.dat_arg$x), ]

# Simulate a boundary at x = 0.45
n_left <- sum(xy.dat_arg$x < 0.45)

set.seed(1)
common_data <- letters[1:5]

left_data <- sample(
  c(common_data, LETTERS[1:10]),
  size = n_left,
  replace = TRUE,
  # common-endemic probability ratio 3:4
  prob = c(rep(3, 5), rep(4, 10))
)

right_data <- sample(
  c(common_data, LETTERS[11:20]),
  size = N - n_left,
  replace = TRUE,
  # common-endemic probability ratio 3:4
  prob = c(rep(3, 5), rep(4, 10))
)

data_arg <- c(left_data, right_data)

# Apply hespdiv
r <- hespdiv(
  data = data_arg,
  xy.dat = xy.dat_arg,
  n.split.pts = 6,  # small value used here for illustration
  method = "sor",   # subdivision minimizing Sorensen-Dice similarity
  S.crit = 0.3,     # minimum area size is 30% of study area
  study.pol = study.area,
  use.chull = FALSE
)

plot_hespdiv(r, n.loc = TRUE) + ggplot2::geom_vline(xintercept = 0.45)

# Detected split-line performance
r$split.stats$performance

# Sorensen-Dice similarity across the true simulated boundary
2 * length(intersect(left_data, right_data)) /
  (length(unique(left_data)) + length(unique(right_data)))

hespdiv documentation built on May 21, 2026, 5:09 p.m.