checkOut: Search For Spatial Outliers

View source: R/checkOut.R

checkOutR Documentation

Search For Spatial Outliers

Description

This function searches for spatial outliers, i.e. records too far away from species core distributions based on Mahalanobis distances. Spatial outliers can indicate misidentifications or records obtained from cultivated individuals, although not all cultivated individuals are necessarily spatial outliers (see also the plantR function getCult()).

Usage

checkOut(
  x,
  lon = "decimalLongitude.new",
  lat = "decimalLatitude.new",
  tax.name = "scientificName.new",
  geo.name = "geo.check",
  cult.name = "cult.check",
  n.min = 6,
  center = "median",
  geo.patt = "ok_",
  cult.patt = NA,
  clas.cut = 3,
  rob.cut = 16
)

Arguments

x

a data frame with the species records.

lon

character. Column with the record longitude in decimal degrees. Default to 'decimalLongitude.new'.

lat

character. Column with the record latitude in decimal degrees. Default to 'decimalLatitude.new'

tax.name

character. Name of the columns containing the species name. Default to "scientificName.new"

geo.name

character. Name of the column containing the validation of the geographical coordinates. Default to "geo.check"

cult.name

character. Name of the column containing the validation of records from cultivated individuals. Default to "cult.check"

n.min

numerical. Minimum number of unique coordinates to be used in the calculations.

center

character. Which metric should be used to obtain he center of the distribution of coordinates: 'mean' or 'median'?

geo.patt

character. The pattern to be used to search for classes of geographical validation to be included in the analyses. Default to "ok_".

cult.patt

character. The pattern to be used to search for classes of validation of cultivated specimens to be included in the analyses. Default to NA.

clas.cut

numerical. The threshold distance for outlier detection, using classic Mahalanobis distances. Default to 3

rob.cut

numerical. The threshold distance for outlier detection, using classic Mahalanobis distances. Default to 16

Details

The function searches for spatial outliers using two different methods to detect outliers (Liu et al., 2018): the classic and the robust squared Mahalanobis distances (see help of mahalanobisDist() for details). They can be used separately or combined (See Examples).

To detect outliers, we need thresholds to be applied to the values of Mahalanobis distances obtained for each species (arguments clas.cut and rob.cut). Ideally these thresholds should be species-specific, but this is not always possible. Based on the empirical distribution of some Atlantic Forest species with very different number of occurrences and spatial distribution patterns, Lima et al. (2020) noted that occurrences outside the species ranges often had classic and robust Mahalanobis distances above 3 and 16 (used here as defaults). For cultivated species, they used more restrictive thresholds of 2.5 and 12, respectively. They also mentioned that these thresholds are very conservative (i.e. only more extreme outliers are removed) and so some outliers may remain undetected.

The detection of outliers may depend on the amount of unique coordinates available. Therefore, the detection of spatial outliers is safer for cases where many unique coordinates are available. As a rule of thumb, ten unique coordinates per taxa should avoid possible problems (undetected true outliers or detection of false outliers). See Examples.

Value

The input data frame with a new column containing the indication of spatial outliers.

Author(s)

Renato A. F. de Lima

References

Lima, R.A.F. et al. 2020. Defining endemism levels for biodiversity conservation: Tree species in the Atlantic Forest hotspot. Biological Conservation, 252: 108825.

Liu, C., White, M., and Newell, G. 2018. Detecting outliers in species distribution data. Journal of Biogeography, 45(1): 164-176.

See Also

checkCoord, getCult, mahalanobisDist

Examples


# few data and close coordinates (no outliers)
lon <- c(-42.2,-42.3,-42.4,-42.3,-42.3)
lat <- c(-44.3,-44.2,-44.2,-42.2,-42.2)
df <- data.frame(lon = lon, lat = lat)
checkOut(df, lon = "lon", lat = "lat", n.min = 4)
checkOut(df, lon = "lon", lat = "lat", clas.cut = NULL, n.min = 4)

# some data and one outlier
lon <- c(runif(5, -45, -41), -12.2)
lat <- c(runif(5, -45, -41), -18.2)
df <- data.frame(lon = lon, lat = lat)

checkOut(df, lon = "lon", lat = "lat")
checkOut(df, lon = "lon", lat = "lat", clas.cut = NULL)
checkOut(df, lon = "lon", lat = "lat", rob.cut = NULL)

# more data and one outlier
lon <- c(runif(9, -45, -41), -12.2)
lat <- c(runif(9, -45, -41), -18.2)
df <- data.frame(lon = lon, lat = lat)

checkOut(df, lon = "lon", lat = "lat")
checkOut(df, lon = "lon", lat = "lat", clas.cut = NULL)
checkOut(df, lon = "lon", lat = "lat", rob.cut = NULL)


LimaRAF/plantR documentation built on Jan. 1, 2023, 10:18 a.m.