outliers.detect.mass: Detect outliers for a multi-species set of geographical...

View source: R/outlier_fn.R

outliers.detect.massR Documentation

Detect outliers for a multi-species set of geographical coordinates

Description

This function runs the outlier detection methods described in gecko::outliers.detect() but for multi-species datasets, automatically adjusting for the amount of data available and strategy chosen. Species must have at least 3 data points in order to be processed. Additionally, inclusion of a training dataset will induce the function to use method "svm" which has an added restriction of needing at least 5 training points. For now species with insufficient data are accepted by default but future updates will allow users to choose a "lack of data" strategy.

Usage

outliers.detect.mass(
  test,
  train = NULL,
  path = NULL,
  strategy = "majority",
  hi_res = FALSE,
  crop = FALSE,
  threshold = 0.05
)

Arguments

test

data.frame. With three columns containing species, latitude and longitude, describing the locations of a species, which may contain outliers.

train

data.frame. With the same formatting as longlat, indicating only known locations where a target species occurs. Used exclusively as training data for method 'svm'. In order for outlier detection to work the training data supplied must have valid environmental data, if you suspect this might not be the case, run terra::extract() using your downloaded WorldClim data.

path

character. Path to a folder where plots scrutinizing decision making per species should be saved.

strategy

character. Strategy to use for combining the decisions of the outlier detection methods used. Either "permissive", "majority" or "conservative". In "permissive", only points marked as potential outliers by all methods selected will be rejected. In "majority" a decision is made based on popular vote. If popular vote cannot be achieved, the point is rejected by default. In "conservative" all points marked as outliers by any number of methods will be rejected.

hi_res

logical. Specifies if 1 KM resolution environmental data should be used. If FALSE 10 KM resolution data is used instead.

crop

logical. Indicates whether environmental data should be cropped to an extent similar to what is given in longlat and training. Useful to avoid large processing times of higher resolutions.

threshold

numeric. Value indicating the threshold for classifying outliers in methods "geo" and "env". E.g.: under the default of 0.05, points that are at an average distance greater than the 95 of the average distances of all points, will be classified as outliers.

Details

Environmental data used is WorldClim and requires a long download, see gecko::gecko.setDir() This function is a version of gecko::outliers.detect() tailored for ease of handling datasets with multiple species. For details on the methodology used to detect outliers please consult the documentation for that function.

Value

list. With the first element being a dataset containing all elements of the original test set except for those rejected. The second element is a table scrutinizing how many data points belonged to species not_in_common, those where a decision was not passed due to insufficient_data, and the ones that were accepted and rejected, with the latter being accompanied by how much each group of methods was used as basis, e.g: env;geo.

Examples

## Not run: 
old_occurrences = gecko.data("records")
colnames(old_occurrences) = c("species", "long", "lat")
new_occurrences = data.frame(
species = rep(c("Hogna maderiana", "Malthonica oceanica", "Agroeca inopina"), each = 50),
long = c(runif(50, -17.1, -16.09), runif(50, -8.8, -7), runif(50, -6, -2)),
lat = c(runif(50, 32.73, 32.76), runif(50, 39.5, 40), runif(50, 40, 42))
)
outliers.detect.mass(new_occurrences, train = old_occurrences, path = path)

## End(Not run)

gecko documentation built on Sept. 9, 2025, 5:58 p.m.