selectAbsences: Select (spatially biased) absence rows.

View source: R/selectAbsences.R

selectAbsencesR Documentation

Select (spatially biased) absence rows.

Description

This function takes a matrix or (optionally spatial) data frame containing species presence (1) and (pseudo)absence (0) data (e.g., the output of [gridRecords), and it selects among the absence rows to stay within a given number or ratio of absences, and/or within and/or beyond a given distance to the presences. Optionally, to incorporate survey bias, absences can be selected with higher probability towards the vicinity of presences; or with probability weights according to a user-provided bias raster.

Usage

selectAbsences(data, sp.cols, coord.cols = NULL, CRS = NULL, min.dist = NULL,
max.dist = NULL, n = NULL, mult.p = NULL, bias = FALSE, bunch = FALSE,
dist.mat = NULL, seed = NULL, plot = !is.null(coord.cols), df = TRUE,
verbosity = 2)

Arguments

data

a 'data.frame' or an object that can be coerced to such (e.g. a 'matrix', 'tibble', 'SpatVector' or 'sf' object) containing a column with the species' presence (1) and absence (0) records, with localities in rows; and (if distance or spatial bias are required) two columns with the spatial coordinates, x and y.

sp.cols

names or index numbers of the columns containing the species presences (1) and absences (0) in 'data'.

coord.cols

names or index numbers of the columns containing the spatial coordinates in 'data' (x and y, or longitude and latitude, in this order!). Needed if distance or spatial bias are required.

CRS

coordinate reference system of the 'coord.cols' in 'data', in one of the following formats: WKT/WKT2, <authority>:<code>, or PROJ-string notation (see terra::crs()). Ignored if 'dist.mat' is provided. Otherwise, if 'CRS' is provided and the 'terra' package is installed, distances are computed with terra::distance(), thus accounting for the curvature of the Earth.

min.dist

(optional) numeric value specifying the minimum distance (in the same units as 'coord.cols') at which selected absences should be from the presences.

max.dist

(optional) numeric value specifying the maximum distance (in the same units as 'coord.cols') at which selected absences should be from the presences.

n

(optional) integer value specifying the number of absence rows to select. Can also be specified as a ratio – see 'mult.p' below.

mult.p

(optional) numeric value specifying how many times the number of presences to use as 'n' (e.g. 10 times as many absences as presences). Ignored if 'n' is not NULL.

bias

either a logical value TRUE to make the selection of absences biased towards the vicinity of presences (which requires specifying 'coord.cols'; may take time and memory for large datasets if 'dist.mat' is not provided); or a SpatRaster layer (quantifying e.g. survey effort, accessibility or human presence; see Details and biasLayer) with higher pixel values where the selection of absences should be more likely. Note that this layer should have (approximately) the same spatial resolution as 'data'. The default is FALSE, for no bias.

dist.mat

optional argument to pass to distPres. If not provided, a distance matrix will be computed with stats::dist()) or with terra::distance()).

bunch

[PENDING IMPLEMENTATION] logical value specifying if the selected absences should concentrate around presences in proportion to their local density, as in Vollering et al. (2019). The default is FALSE.

seed

(optional) integer value to pass to set.seed specifying the random seed to use for sampling among the absences.

plot

logical value specifying whether to plot the result. The default is TRUE if 'coord.cols' are provided.

df

logical value specifying whether to return a dataframe with the input 'data' after removal of the non-selected absences. The default is TRUE. If set to FALSE, the output is a logical vector specifying if each row of 'data' was selected or not.

verbosity

numeric value indicating the amount of messages to display. Choose 0 for no messages.

Details

Species occurrence data typically incorporate two probability distributions: the actual probability of the species being present, and the probability of it being recorded if it was present (Merow et al. 2013). Thus, any covariation between recording probability and the predictor variables can bias the predictions of species distribution models (Yackulic et al. 2013).

Methods to correct for this bias include the selection of (pseudo)absences preferably towards the vicinity of presence records, in order to reproduce the survey bias. This function implements this strategy in several (alternative or complementary) ways: 1) selecting absences within and/or beyond a given distance from presences; 2) biasing the random selection of absences, making it more likely towards the vicinity of presences (providing the 'prob' argument in sample with the result of distPres); or [PENDING IMPLEMENTATION!] 3) bunching up the absences preferably around the areas with higher densities of presences (Vollering et al. 2019).

More recent versions allow using instead a raster map of weights (bias layer), with higher pixel values where absence selection should be proportionaly more likely, and zero or NA where pseudoabsence points should not be placed. This layer should normally reflect (a proxy for) likely survey effort, such as proximity to roads or populated areas, human footprint (see e.g. geodata::footprint()) or travel time (e.g. geodata::travel_time()). You can also create your own e.g. with biasLayer. Users should provide the bias layer at aproximately the same spatial resolution as the 'coord.cols' in the input 'data'.

Value

This function returns the 'data' input after removal of the non-selected absences, or (if df=FALSE) a logical vector specifying if each row of 'data' was selected (TRUE) or not (FALSE). If plot=TRUE and provided 'coord.cols', it also plots the presences (blue "plus" signs), the selected absences (red "minus" signs) and the excluded absences (orange dots).

Author(s)

A. Marcia Barbosa

References

Vollering J., Halvorsen R., Auestad I. & Rydgren K. (2019) Bunching up the background betters bias in species distribution models. Ecography, 42: 1717-1727

See Also

gridRecords, sample, biasLayer

Examples

data(rotif.env)

head(rotif.env)

names(rotif.env)

table(rotif.env$Burceo)

# select among the absences using different criteria:

burceo_select <- selectAbsences(data = rotif.env, sp.cols = "Burceo",
coord.cols = c("Longitude", "Latitude"), n = 150, seed = 123)

burceo_select <- selectAbsences(data = rotif.env, sp.cols = "Burceo",
coord.cols = c("Longitude", "Latitude"), mult.p = 1.5, seed = 123)

burceo_select <- selectAbsences(data = rotif.env, sp.cols = "Burceo",
coord.cols = c("Longitude", "Latitude"), max.dist = 18)

burceo_select <- selectAbsences(data = rotif.env, sp.cols = "Burceo",
coord.cols = c("Longitude", "Latitude"), max.dist = 18, min.dist = 5,
n = sum(rotif.env$Burceo), bias = TRUE)


## Not run: 
# you can run the next example if you have the 'geodata' package installed:

bias_layer <- geodata::footprint(year = 2009, path = tempfile())

# see also the biasLayer() function for creating your own bias layer


burceo_select <- fuzzySim::selectAbsences(data = rotif.env, sp.cols = "Burceo",
coord.cols = c("Longitude", "Latitude"), n = sum(rotif.env$Burceo),
bias = bias_layer)

## End(Not run)

fuzzySim documentation built on March 22, 2025, 3 a.m.