cv_nndm: Use the Nearest Neighbour Distance Matching (NNDM) to...
In blockCV: Spatial and Environmental Blocking for K-Fold and LOO Cross-Validation

cv_nndm

R Documentation

Use the Nearest Neighbour Distance Matching (NNDM) to separate train and test folds

Description

A fast implementation of the Nearest Neighbour Distance Matching (NNDM) algorithm (Milà et al., 2022) in C++. Similar to cv_buffer, this is a variation of leave-one-out (LOO) cross-validation. It tries to match the nearest neighbour distance distribution function between the test and training data to the nearest neighbour distance distribution function between the target prediction and training points (Milà et al., 2022).

Usage

cv_nndm(
  x,
  column = NULL,
  r,
  size,
  num_sample = 10000,
  sampling = "random",
  min_train = 0.05,
  presence_bg = FALSE,
  add_bg = FALSE,
  plot = TRUE,
  report = TRUE
)

Arguments

`x`	a simple features (sf) or SpatialPoints object of spatial sample data (e.g., species data or ground truth sample for image classification).
`column`	character; indicating the name of the column in which response variable (e.g. species data as a binary response i.e. 0s and 1s) is stored. This is required when `presence_bg = TRUE`, otherwise optional.
`r`	a terra SpatRaster object of a predictor variable. This defines the area that model is going to predict.
`size`	numeric value of the range of spatial autocorrelation (the `phi` parameter). This distance should be in metres. The range could be explored by `cv_spatial_autocor`.
`num_sample`	integer; the number of sample points from predictor (`r`) to be used for calculating the G function of prediction points.
`sampling`	either `"random"` or `"regular"` for sampling prediction points. When `sampling = "regular"`, the actual number of samples might be less than `num_sample` for non-rectangular rasters (points falling on no-value areas are removed).
`min_train`	numeric; between 0 and 1. A constraint on the minimum proportion of train points in each fold.
`presence_bg`	logical; whether to treat data as species presence-background data. For all other data types (presence-absence, continuous, count or multi-class responses), this option should be `FALSE`.
`add_bg`	logical; add background points to the test set when `presence_bg = TRUE`. We do not recommend this according to Radosavljevic & Anderson (2014). Keep it `FALSE`, unless you mean to add the background pints to testing points.
`plot`	logical; whether to plot the G functions.
`report`	logical; whether to generate print summary of records in each fold; for very big datasets, set to `FALSE` for slightly faster calculation.

Details

When working with presence-background (presence and pseudo-absence) species distribution data (should be specified by presence_bg = TRUE argument), only presence records are used for specifying the folds (recommended). The testing fold comprises only the target presence point (optionally, all background points within the distance are also included when add_bg = TRUE; this is the distance that matches the nearest neighbour distance distribution function of training-testing presences and training-presences and prediction points; often lower than size). Any non-target presence points inside the distance are excluded. All points (presence and background) outside of distance are used for the training set. The methods cycles through all the presence data, so the number of folds is equal to the number of presence points in the dataset.

For all other types of data (including presence-absence, count, continuous, and multi-class) set presence_bg = FALE, and the function behaves similar to the methods explained by Milà and colleagues (2022).

Value

An object of class S3. A list of objects including:

folds_list - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices
k - number of the folds
size - the distance band to separated trainig and testing folds)
column - the name of the column if provided
presence_bg - whether this was treated as presence-background data
records - a table with the number of points in each category of training and testing

References

C. Milà, J. Mateu, E. Pebesma, and H. Meyer, Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation, Methods in Ecology and Evolution (2022).

Examples


library(blockCV)

# import presence-absence species data
points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV"))
# make an sf object from data.frame
pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845)

# load raster data
path <- system.file("extdata/au/bio_5.tif", package = "blockCV")
covar <- terra::rast(path)

nndm <- cv_nndm(x = pa_data,
                column = "occ", # optional
                r = covar,
                size = 350000, # size in metres no matter the CRS
                num_sample = 10000,
                sampling = "regular",
                min_train = 0.1)

blockCV documentation built on Nov. 1, 2024, 9:09 a.m.