geoFold: Assign geographically-distinct k-folds

View source: R/geoFold.r

geoFoldR Documentation

Assign geographically-distinct k-folds

Description

This function assigns geographically-divided k-folds ("g-folds") using partitioning around mediods (PAM) algorithm. The user can specify the number of folds to create, and optionally, the minimum size of any fold plus the minimum number of sites NOT in any fold (good for ensuring each fold has enough sites for testing and training).

Usage

geoFold(
  x,
  k,
  minIn = NULL,
  minOut = NULL,
  longLat = NULL,
  distFunct = NULL,
  swaps = NULL,
  ...
)

Arguments

x

A data frame, matrix, SpatialPoints, or SpatialPointsDataFrame object. If a data frame or matrix then the coordinate reference system is assumed to be unprojected (WGS84).

k

Positive integer. Number of k-folds to create.

minIn

Positive integer or NULL. Minimum number of sites required to be in a fold. If left NULL (default), it is possible to have just one site in a fold.

minOut

Positive integer or NULL. Minimum number of sites required to be outside of a fold. (i.e., if there are 5 folds, then for fold #1 this is the number of sites in folds 2 through 5). Leave as NULL to ignore.

longLat

Two-element character list or two-element integer list. If x is a data frame then this should be a character list specifying the names of the fields in x or a two-element list of integers that correspond to longitude and latitude (in that order). For example, c('long', 'lat') or c(1, 2). If x is a matrix then this is a two-element list indicating the column numbers in x that represent longitude and latitude. For example, c(1, 2). If x is a SpatialPoints object then this is ignored.

distFunct

Either a function or NULL. If NULL then distCosine() in the geosphere package is used to calculate distances. More accurate distances can be obtained by using other functions (see [geosphere::distCosine()] and related "dist" functions). Alternatively, a custom function can be used so long as its first argument is a 2-column numeric matrix with one row for the x- and y-coordinates of a single point and its second argument is a two-column numeric matrix with one or more rows of other points.

swaps

Positive integer. Sometimes the routine generates folds that aren't minimally compact; i.e., points from some folds are spatially inside other folds. To correct this a random swap procedure is performed at the end in which pairs of points from different folds are swapped assignment. If this decreases the mean distance to the (new) centroid of each fold then the swap is kept. Otherwise it is not. This procedure is performed swap number of times. Default (set by swap=NULL) is max(100, n^2) times where n is total number of points.

...

Arguments to pass to distFunct.

Value

An integer vector, one element for for of x, with values 1 through k indicating which fold a site is located in.

See Also

distCosine, pam

Examples

# Make three groups, one with two points and two with 20 points apiece.
# Naturally these should group into 3 groups with 2, 20, and 20 point apiece.
# By setting minIn and minOut to non-NULL values, we can increase/decrease
# the size of the groups.
# define plot function
pointPlot <- function(x, folds, ...) {
   plot(x, pch=16, cex=2, col='white', ...)
   for (i in sort(unique(folds))) points(x[folds==i, ], bg=i + 1, pch=20 + i, cex=2)
   legend('bottomright', legend=paste('fold', sort(unique(folds))),
   pt.bg=sort(unique(folds)) + 1, pch=20 + sort(unique(folds)), cex=1.4)
}

set.seed(17)
group1 <- data.frame(x=c(-90, -90), y=c(40, 41))
group2 <- data.frame(x=rep(-80, 20), y=rep(37, 20))
group3 <- data.frame(x=rep(-100, 20), y=rep(37, 20))

group2 <- group2 + cbind(rnorm(20), rnorm(20))
group3 <- group3 + cbind(rnorm(20), rnorm(20))

sites <- rbind(group1, group2, group3)

# simple g-folds
folds <- geoFold(sites, k=3)
pointPlot(sites, folds, main='Simple G-folds')

# g-folds with >= 5 sites per fold
folds <- geoFold(sites, k=3, minIn=5)
pointPlot(sites, folds, main='G-folds with >=5 sites in each')

# g-folds with >= 10 sites in and out of each fold
folds <- geoFold(sites, k=3, minIn=10, minOut=10)
pointPlot(sites, folds, main='G-folds with >=10\nsites in/outside each')

# g-folds with >=14 sites in and >= 20 sites out of each fold
folds <- geoFold(sites, k=3, minIn=14, minOut=20)
pointPlot(sites, folds, main='G-folds >=14 in\nand >=20 outside')

adamlilith/enmSdm documentation built on Jan. 6, 2023, 11 a.m.