Environmental outlier detection with bootstrapping and principal component analysis."
In specleanr: Detecting Environmental Outliers in Data Analysis Pipelines

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"#, error = TRUE
)

library(specleanr)

Environmental outlier check for fish species from the Danube River Basin

The workflow for environmental outlier detection and removal is similar across taxa, regions, or ecological realms. However, we included the check_names() function to cater for fish species names exhaustively. In this worked example, we tried the functionalities on the fish species from the Danube River Basin, with extracts of species records from Joint Danube Survey (JDS) and EFI+ data archived in the package. We complimented the data with fish species occurrences from online sources including Global Biodiversity Information Facility (GBIF), iNaturalist, and VertNET using the getdata() function. They are basically five steps, including: 1) Data acquisition and harmonization; 2) Precleaning and predictor extraction 3) outlier detection 4) identification of clean data and suitable method 5) developing species distribution models (optional).

Scope of application

In this workflow, we provide three approaches that can be used to handle outlier detection, namely 1) the default approach (no bootstrapping and principal component analysis applied); 2) bootstrapping applied during outlier detection mostly for fewer records (user based to set the records) and 3) combining principal component analysis and bootstrapping. Because each approach will significantly affect how records are flagged as outliers, its upon the user to select an approach to use. However, we advise users to apply bootstrapping and PCA if the particular suspicious records are still not handled in the first approach.

1.Data acquisition: a) Collate species species records: offline and online data

The species records were obtained from the archived datasets extracted from the Joint Danube Survey (https://www.danubesurvey.org/jds4/) and EFIPLUS (Logez et al., 2012). To compliment species records, we used the getdata() function to retrieve data from the GBIF (https://www.gbif.org/), VertNet (http://www.vertnet.org/) and iNaturalist ( https://www.inaturalist.org/) for Squalius cephalus, Salmo trutta, Thymallus thymallus, and Anguilla anguilla. For online data, we limited data to 50 records from each data source to reduce on the execution time.

NOTE

This workflow may fail if the particular settings such as blocking of IP address to GIBF, iNaturalist, FishBase, and VertNet may prevent user from accessing the online data. Since, this may differ at user-to-user basis, it is beyond the scope of this package to address such limitations.

#==========================
#Step 1ai. Obtain Local data sources (archived in this package)
#=========================

data(efidata) #Data extract from EFIPLUS data

data(jdsdata) #Data extract from JDS4 data 

#===================================
#Step 1aii: Retrieve online data for the species: polygon to limit the extent to get records.
#=====================================
danube <- sf::st_read(system.file('extdata', "danube.shp.zip",
                                  package = 'specleanr'), quiet=TRUE)


df_online <- getdata(data = c("Squalius cephalus", 'Salmo trutta', 
                              "Thymallus thymallus","Anguilla anguilla"), 
                                extent = danube,
                                gbiflim = 50, 
                                inatlim = 50, 
                                vertlim = 50, 
                     verbose = FALSE)

dim(df_online)

Merging and harmonizing species records from different sources

The online data sources from getdata() and local files are merged using the match_datasets() function. Five columns are harmonized while combining data from different sources: the country, species names, latitude/longitude columns, and dates. The Darwin Core standard names are country, species, decimalLatitude, decimalLongitude, and dates (Wieczorek et al., 2012). So, if the local dataset has a different name for standard names, the user should indicate it. For example, in JDS data, the species column is labeled speciesname, shown in the species parameter for automatic renaming and merging with other datasets. Note: The user should indicate all dataset names in the list. check_names() is used to clean species names in terms of synonyms or spellings, based on FishBase (https://www.fishbase.se/). This function generates another column speciescheck that contain the clean names.

mergealldfs <- match_datasets(datasets = list(efi= efidata, jds = jdsdata, 
                                        onlinedata = df_online),
                country = c('JDS4_sampling_ID'),
                lats = 'lat', lons = 'lon',
                species = c('speciesname', 'scientificName'))

#Species names are re-cleaned since the species names from vertnet are changed.

cleannames_df <- check_names(data = mergealldfs, colsp = 'species', pct = 90, 
                             merge = TRUE, verbose = TRUE)

#Filter out species from clean names df where the species names such as synonyms like Salmo trutta fario chnaged to Slamo trutta

speciesfiltered <- cleannames_df[cleannames_df$speciescheck %in%
                                   c("Squalius cephalus", 'Salmo trutta', 
                                     "Thymallus thymallus","Anguilla anguilla"),]

1. Data acquisition: b) Environmental predictors from worldclim

We used WORLDCLIM data archived in the package to enable users to test the package functions seamlessly. For direct interaction with the WORDCLIM data, please visit (https://www.worldclim.org/) and the geodata package for download in user-customized workflows. WORLDCLIM data has 19 bioclimatic variables (https://www.worldclim.org/data/bioclim.html), including;

BIO1 = Annual Mean Temperature
BIO2 = Mean Diurnal Range (Mean of monthly (max temp - min temp))
BIO3 = Isothermality (BIO2/BIO7) (×100)
BIO4 = Temperature Seasonality (standard deviation ×100)
BIO5 = Max Temperature of Warmest Month
BIO6 = Min Temperature of Coldest Month
BIO7 = Temperature Annual Range (BIO5-BIO6)
BIO8 = Mean Temperature of Wettest Quarter
BIO9 = Mean Temperature of Driest Quarter
BIO10 = Mean Temperature of Warmest Quarter
BIO11 = Mean Temperature of Coldest Quarter
BIO12 = Annual Precipitation
BIO13 = Precipitation of Wettest Month
BIO14 = Precipitation of Driest Month
BIO15 = Precipitation Seasonality (Coefficient of Variation)
BIO16 = Precipitation of Wettest Quarter
BIO17 = Precipitation of Driest Quarter
BIO18 = Precipitation of Warmest Quarter
BIO19 = Precipitation of Coldest Quarter

#Get climatic variables from the package folder

worldclim <- terra::rast(system.file('extdata/worldclim.tiff', package = 'specleanr'))

2. Precleaning and environmental data extraction

Here,

The duplicate records are removed if points they are obtained from the same location for the same species.
The missing values coordinates are removed.
The environmental predictors are extracted from the raster layers (WORLDCLIM).
The user can set the minimum point for the species to be retianed for further analyis.
The bounding box can be set to limit the extent of data extraction. For this case, we used the basin layer for the Danube Basin was obtained from Hydrography90m (https://hydrography.org/hydrography90m/hydrography90m_layers).
The user can either return a dataframe or list of the cleaned data. Important in the next steps.

#Get basin shapefile to delineate the study region: optional

danube <- sf::st_read(system.file('extdata', 'danube.shp.zip', 
                                  package = 'specleanr'), quiet=TRUE)

#For multiple species indicate multiple TRUE
multipreclened <-  pred_extract(data= speciesfiltered, 
                             raster= worldclim, 
                             lat = 'decimalLatitude',
                             lon = 'decimalLongitude',
                             colsp = 'speciescheck',
                             bbox  = danube,  
                             list= TRUE, 
                             minpts = 10, merge = FALSE)
names(multipreclened)


thymallusdata <- speciesfiltered[speciesfiltered[,'speciescheck'] %in%c("Thymallus thymallus"),]

dim(thymallusdata)

thymallus_referencedata <-  pred_extract(data= thymallusdata, raster= worldclim, 
                             lat = 'decimalLatitude',
                             lon = 'decimalLongitude',
                             colsp = 'speciescheck',
                             bbox  = danube,
                             list= TRUE, 
                             minpts = 10)
dim(thymallus_referencedata)

3. Outlier detection for both single and multiple species (No PCA or bootstrapping)

Multiple outlier detection are set. This package contains 20 outlier detection methods and the user can run extractMethods() to get the allowed methods. They are categorized into univariate, multivariate and species ecological ranges. var is the predictor to be used univariate methods. exclude allows to remove predictors that user deems unnecessary. For example, the coordinates, since the multivariate methods consider the whole dataset.

#For multiple species: default settings

multiple_spp_out_detection <- multidetect(data = multipreclened,
                      multiple = TRUE,
                      var = 'bio6',
                     exclude = c('x','y'),
                      methods = c('zscore', 'adjbox',
                                                'logboxplot', 'distboxplot',
                                                'iqr', 'semiqr',
                                                'hampel','kmeans',
                                                'jknife', 'onesvm',
                                                'iforest'))
#single species:default settings

thymallus_outlier_detection <- multidetect(data = thymallus_referencedata,
                      multiple = FALSE,
                      var = 'bio6',
                      output = 'outlier',
                      exclude = c('x','y'),
                      methods = c('zscore', 'adjbox',
                                  'logboxplot', 'distboxplot',
                                  'iqr', 'semiqr',
                                  'hampel','kmeans',
                                  'jknife', 'onesvm',
                                  'iforest'))

4. Outlier visualization for single and multiple species

ggoutliers are based in ggplot2, so it can be modified based on user needs. x: is the output for outlier detection, y is the species name or index for multiple species, and raw = TRUE if the number of outliers are the displayed, otherwise the proportion of outliers to the total number of records will be plotted.

#for multiple species
ggoutliers(multiple_spp_out_detection)

#for single species
ggoutliers(thymallus_outlier_detection)

Identify the best threshold using loess model.

The local regression is used to optimize and identify the best threshold for denoting the point as an absolute outlier. We fit the local region model between the data retained at every threshold, and we identify a maxima when the number of records retain are number of records retained does not significantly vary with an increased increase in the threshold.

thymallus_opt_threshold <- optimal_threshold(refdata = thymallus_referencedata, 
                               outliers = thymallus_outlier_detection, plot = list(plot = TRUE, group = "Thymallus thymallus"))

#obtain the optimal thresholds for multiple species 

multspp_opt_threshold <- optimal_threshold(refdata = multipreclened, 
                                           outliers = multiple_spp_out_detection)

5. Extracting clean data from the reference data (precleaned data in step 2).

The user sets a threshold ranging from 0.1 to 1 but its advisable to set a value above 0.5 to include above 50% of the methods. threshold is the value indicating the proportion of methods to be used to classify a record as a true outlier. For example, a threshold of 0.6 means that at least in the 4 of the 6 methods noted during outlier detection in step 3. We used the loess method for identifying the optimal threshold.

multspecies_clean <- extract_clean_data(refdata = multipreclened, 
                                        outliers = multiple_spp_out_detection, 
                                        loess =  TRUE)
head(multspecies_clean)

thymallus_qcdata <- extract_clean_data(refdata = thymallus_referencedata, 
                             outliers = thymallus_outlier_detection, 
                             loess = TRUE)


multiple_spp_qcdata <- classify_data(refdata = multipreclened, 
                                outliers = multiple_spp_out_detection, 
                                EIF = TRUE)
head(multiple_spp_qcdata)


thymallus_qc_labelled <- classify_data(refdata = thymallus_referencedata, 
                              outliers = thymallus_outlier_detection, 
                              EIF = TRUE)
head(thymallus_qc_labelled)

6. Visualize labeled data in environmental space.

#multiple species 
ggenvironmentalspace(qcdata = multiple_spp_qcdata, 
                     xvar = 'bio1', 
                     yvar = "bio18", 
                     xlab = "Annual mean temperature",
                     ylab = "Precipitation of Warmest Quarter",
                     scalecolor = 'viridis',
                     ncol = 2, 
                     nrow = 2,
                     pointsize = 2)

#for single species
ggenvironmentalspace(qcdata = thymallus_qc_labelled,
                     xvar = 'bio1',
                     yvar = "bio18",
                     xlab = "Annual mean temperature",
                     ylab = "Precipitation of Warmest Quarter",
                     scalecolor = 'viridis',
                     pointsize = 2)

Using bootstrapping during outlier detection

Bootstrapping is a robust approach where the records are randomly sampled with replacement. In this approach, outlier detection is iteratively conducted on bootstrap samples and each record flagged as outlier is weighted based on the total number of bootstraps used. The higher the record is flagged in several across multiple tests, the higher the likelihood of record being an absolute outlier. Although the default number of records at bootstrapping is 30, the maximum number of records can be adjusted by the user as demonstrated below.

Note

Bootstrapping is not implemented by the defualt, so the user has to set it run during outlier detection.

The number of records maxrecords in reference dataset for Thymallus thymallus is 99. Therefore, to implement bootstrapping, indicate the maximum number of records higher than the nrows in reference dataset otherwise bootstrap will not be implemented.
The number of bootstraps, nb are user-defined.
Bootstrapping was conducted on Thymallus thymallus data because there was no proper separation between the moderate and very strong outliers.

thymallus_outlier_boot <- multidetect(data = thymallus_referencedata,
                      multiple = FALSE,
                      var = 'bio6',
                      exclude = c('x','y'),
                      methods = c('zscore', 'adjbox',
                                  'logboxplot', 'distboxplot',
                                  'iqr', 'semiqr',
                                  'hampel','kmeans',
                                  'jknife', 'onesvm',
                                  'iforest'),
                      bootSettings = list(run = TRUE, maxrecords = 100, nb = 10))

Visulise outliers after bootstrapping

ggoutliers(thymallus_outlier_boot)

Classify data to obtain labels

thymallus_qc_label_boot <- classify_data(refdata = thymallus_referencedata, 
                                outliers = thymallus_outlier_boot)

Visualise after bootstrapping

ggenvironmentalspace(qcdata = thymallus_qc_label_boot, 
                     xvar = 'bio1', 
                     yvar = "bio18",
                     xlab = "Annual mean temperature",
                     ylab = "Precipitation of Warmest Quarter",
                     scalecolor = 'viridis',
                     pointsize = 2)

Note

When bootstrapping is applied, the very strong outlier turned into moderate outlier.

Apply principal component analysis and bootstrapping on Thymallus thymallus data.

Principal component analyis is a dimension reduction approach vital for highly multidimensional datasets. The user can decide to apply either PCA and bootstrapping or only one of them.
The number of principal components to be returned are changed using npc argument.
The visualise the cummulation variance captured in the principal components, the argument q is used.

thymallus_outlier_boot_pca <- multidetect(data = thymallus_referencedata,
                      multiple = FALSE,
                      var = 'bio6',
                      exclude = c('x','y'),
                      methods = c('zscore', 'adjbox',
                                  'logboxplot', 'distboxplot',
                                  'iqr', 'semiqr',
                                  'hampel','kmeans',
                                  'jknife', 'onesvm',
                                  'iforest'),
                      bootSettings = list(run = TRUE, maxrecords = 100, nb = 10),
                      pc = list(exec = TRUE, npc = 6, q = FALSE))

Visulise outliers after bootstrapping and bootstrapping

ggoutliers(thymallus_outlier_boot_pca)

Classify data to obtain labels

thymallus_qc_label_boot_pca <- classify_data(refdata = thymallus_referencedata, 
                                outliers = thymallus_outlier_boot_pca)

Visualise after bootstrapping

ggenvironmentalspace(qcdata = thymallus_qc_label_boot_pca, 
                     xvar = 'bio1', 
                     yvar = "bio18",
                     xlab = "Annual mean temperature",
                     ylab = "Precipitation of Warmest Quarter",
                     scalecolor = 'viridis',
                     pointsize = 2)

Notes

Coupling PCA and bootstrapping are robust approaches to handle outlier detection. In this example, moderate outlier turned into poor outliers.

References

Wieczorek, J., Bloom, D., Guralnick, R., Blum, S., Döring, M., Giovanni, R., Robertson, T., & Vieglais, D. (2012). Darwin core: An evolving community-developed biodiversity data standard. PLoS ONE, 7(1). https://doi.org/10.1371/journal.pone.0029715

Any scripts or data that you put into this service are public.

specleanr documentation built on Nov. 26, 2025, 1:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

specleanr
Detecting Environmental Outliers in Data Analysis Pipelines

Environmental outlier detection with bootstrapping and principal component analysis."
In specleanr: Detecting Environmental Outliers in Data Analysis Pipelines

Environmental outlier check for fish species from the Danube River Basin

Scope of application

1.Data acquisition: a) Collate species species records: offline and online data

Merging and harmonizing species records from different sources

1. Data acquisition: b) Environmental predictors from worldclim

2. Precleaning and environmental data extraction

3. Outlier detection for both single and multiple species (No PCA or bootstrapping)

4. Outlier visualization for single and multiple species

5. Extracting clean data from the reference data (precleaned data in step 2).

6. Visualize labeled data in environmental space.

Using bootstrapping during outlier detection

Visulise outliers after bootstrapping

Classify data to obtain labels

Visualise after bootstrapping

Apply principal component analysis and bootstrapping on Thymallus thymallus data.

Visulise outliers after bootstrapping and bootstrapping

Classify data to obtain labels

Visualise after bootstrapping

Try the specleanr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

specleanr Detecting Environmental Outliers in Data Analysis Pipelines

Environmental outlier detection with bootstrapping and principal component analysis." In specleanr: Detecting Environmental Outliers in Data Analysis Pipelines

Environmental outlier check for fish species from the Danube River Basin

Scope of application

1.Data acquisition: a) Collate species species records: offline and online data

Merging and harmonizing species records from different sources

1. Data acquisition: b) Environmental predictors from worldclim

2. Precleaning and environmental data extraction

3. Outlier detection for both single and multiple species (No PCA or bootstrapping)

4. Outlier visualization for single and multiple species

5. Extracting clean data from the reference data (precleaned data in step 2).

6. Visualize labeled data in environmental space.

Using bootstrapping during outlier detection

Visulise outliers after bootstrapping

Classify data to obtain labels

Visualise after bootstrapping

Apply principal component analysis and bootstrapping on Thymallus thymallus data.

Visulise outliers after bootstrapping and bootstrapping

Classify data to obtain labels

Visualise after bootstrapping

Try the specleanr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

specleanr
Detecting Environmental Outliers in Data Analysis Pipelines

Environmental outlier detection with bootstrapping and principal component analysis."
In specleanr: Detecting Environmental Outliers in Data Analysis Pipelines