selectAbsences: Select (spatially biased) absence rows.
In fuzzySim: Fuzzy Similarity in Species Distributions

selectAbsences

R Documentation

Select (spatially biased) absence rows.

Description

This function takes a matrix or (optionally spatial) data frame containing species presence (1) and (pseudo)absence (0) data (e.g., the output of [gridRecords), and it selects among the absence rows to stay within a given number or ratio of absences, and/or within and/or beyond a given distance to the presences. Optionally, to incorporate survey bias, absences can be selected with higher probability towards the vicinity of presences; or with probability weights according to a user-provided bias raster.

Usage

selectAbsences(data, sp.cols, coord.cols = NULL, CRS = NULL, min.dist = NULL,
max.dist = NULL, n = NULL, mult.p = NULL, bias = FALSE, bunch = FALSE,
dist.mat = NULL, seed = NULL, plot = !is.null(coord.cols), df = TRUE,
verbosity = 2)

Arguments

`data`	a 'data.frame' or an object that can be coerced to such (e.g. a 'matrix', 'tibble', 'SpatVector' or 'sf' object) containing a column with the species' presence (1) and absence (0) records, with localities in rows; and (if distance or spatial bias are required) two columns with the spatial coordinates, x and y.
`sp.cols`	names or index numbers of the columns containing the species presences (1) and absences (0) in 'data'.
`coord.cols`	names or index numbers of the columns containing the spatial coordinates in 'data' (x and y, or longitude and latitude, in this order!). Needed if distance or spatial bias are required.
`CRS`	coordinate reference system of the 'coord.cols' in 'data', in one of the following formats: WKT/WKT2, <authority>:<code>, or PROJ-string notation (see `terra::crs()`). Ignored if 'dist.mat' is provided. Otherwise, if 'CRS' is provided and the 'terra' package is installed, distances are computed with `terra::distance()`, thus accounting for the curvature of the Earth.
`min.dist`	(optional) numeric value specifying the minimum distance (in the same units as 'coord.cols') at which selected absences should be from the presences.
`max.dist`	(optional) numeric value specifying the maximum distance (in the same units as 'coord.cols') at which selected absences should be from the presences.
`n`	(optional) integer value specifying the number of absence rows to select. Can also be specified as a ratio – see 'mult.p' below.
`mult.p`	(optional) numeric value specifying how many times the number of presences to use as 'n' (e.g. 10 times as many absences as presences). Ignored if 'n' is not NULL.
`bias`	either a logical value TRUE to make the selection of absences biased towards the vicinity of presences (which requires specifying 'coord.cols'; may take time and memory for large datasets if 'dist.mat' is not provided); or a SpatRaster layer (quantifying e.g. survey effort, accessibility or human presence; see Details and `biasLayer`) with higher pixel values where the selection of absences should be more likely. Note that this layer should have (approximately) the same spatial resolution as 'data'. The default is FALSE, for no bias.
`dist.mat`	optional argument to pass to `distPres`. If not provided, a distance matrix will be computed with `stats::dist()`) or with `terra::distance()`).
`bunch`	[PENDING IMPLEMENTATION] logical value specifying if the selected absences should concentrate around presences in proportion to their local density, as in Vollering et al. (2019). The default is FALSE.
`seed`	(optional) integer value to pass to `set.seed` specifying the random seed to use for sampling among the absences.
`plot`	logical value specifying whether to plot the result. The default is TRUE if 'coord.cols' are provided.
`df`	logical value specifying whether to return a dataframe with the input 'data' after removal of the non-selected absences. The default is TRUE. If set to FALSE, the output is a logical vector specifying if each row of 'data' was selected or not.
`verbosity`	numeric value indicating the amount of messages to display. Choose 0 for no messages.

Details

Species occurrence data typically incorporate two probability distributions: the actual probability of the species being present, and the probability of it being recorded if it was present (Merow et al. 2013). Thus, any covariation between recording probability and the predictor variables can bias the predictions of species distribution models (Yackulic et al. 2013).

Methods to correct for this bias include the selection of (pseudo)absences preferably towards the vicinity of presence records, in order to reproduce the survey bias. This function implements this strategy in several (alternative or complementary) ways: 1) selecting absences within and/or beyond a given distance from presences; 2) biasing the random selection of absences, making it more likely towards the vicinity of presences (providing the 'prob' argument in sample with the result of distPres); or [PENDING IMPLEMENTATION!] 3) bunching up the absences preferably around the areas with higher densities of presences (Vollering et al. 2019).

More recent versions allow using instead a raster map of weights (bias layer), with higher pixel values where absence selection should be proportionaly more likely, and zero or NA where pseudoabsence points should not be placed. This layer should normally reflect (a proxy for) likely survey effort, such as proximity to roads or populated areas, human footprint (see e.g. geodata::footprint()) or travel time (e.g. geodata::travel_time()). You can also create your own e.g. with biasLayer. Users should provide the bias layer at aproximately the same spatial resolution as the 'coord.cols' in the input 'data'.

Value

This function returns the 'data' input after removal of the non-selected absences, or (if df=FALSE) a logical vector specifying if each row of 'data' was selected (TRUE) or not (FALSE). If plot=TRUE and provided 'coord.cols', it also plots the presences (blue "plus" signs), the selected absences (red "minus" signs) and the excluded absences (orange dots).

Author(s)

A. Marcia Barbosa

References

Vollering J., Halvorsen R., Auestad I. & Rydgren K. (2019) Bunching up the background betters bias in species distribution models. Ecography, 42: 1717-1727

Examples

data(rotif.env)

head(rotif.env)

names(rotif.env)

table(rotif.env$Burceo)

# select among the absences using different criteria:

burceo_select <- selectAbsences(data = rotif.env, sp.cols = "Burceo",
coord.cols = c("Longitude", "Latitude"), n = 150, seed = 123)

burceo_select <- selectAbsences(data = rotif.env, sp.cols = "Burceo",
coord.cols = c("Longitude", "Latitude"), mult.p = 1.5, seed = 123)

burceo_select <- selectAbsences(data = rotif.env, sp.cols = "Burceo",
coord.cols = c("Longitude", "Latitude"), max.dist = 18)

burceo_select <- selectAbsences(data = rotif.env, sp.cols = "Burceo",
coord.cols = c("Longitude", "Latitude"), max.dist = 18, min.dist = 5,
n = sum(rotif.env$Burceo), bias = TRUE)


## Not run: 
# you can run the next example if you have the 'geodata' package installed:

bias_layer <- geodata::footprint(year = 2009, path = tempfile())

# see also the biasLayer() function for creating your own bias layer


burceo_select <- fuzzySim::selectAbsences(data = rotif.env, sp.cols = "Burceo",
coord.cols = c("Longitude", "Latitude"), n = sum(rotif.env$Burceo),
bias = bias_layer)

## End(Not run)

fuzzySim documentation built on March 22, 2025, 3 a.m.