| multidetect | R Documentation |
The function allows to ensemble multiple outlier detection methods to ably compare the outliers flagged by each method.
multidetect(
data,
var,
select = NULL,
output = "outlier",
exclude = NULL,
multiple,
var_col = NULL,
optpar = list(optdf = NULL, ecoparam = NULL, optspcol = NULL, direction = NULL, maxcol
= NULL, mincol = NULL, maxval = NULL, minval = NULL, checkfishbase = FALSE, mode =
NULL, lat = NULL, lon = NULL, pct = 80, warn = FALSE),
kmpar = list(k = 6, method = "silhouette", mode = "soft"),
ifpar = list(cutoff = 0.5, size = 0.7),
mahalpar = list(mode = "soft"),
jkpar = list(mode = "soft"),
zpar = list(type = "mild", mode = "soft"),
gloshpar = list(k = 3, metric = "manhattan", mode = "soft"),
knnpar = list(metric = "manhattan", mode = "soft"),
lofpar = list(metric = "manhattan", mode = "soft", minPts = 10),
methods,
bootSettings = list(run = FALSE, nb = 5, maxrecords = 30, seed = 1135, th = 0.6),
pc = list(exec = FALSE, npc = 2, q = TRUE, pcvar = "PC1"),
verbose = FALSE,
spname = NULL,
warn = FALSE,
missingness = 0.1,
silence_true_errors = TRUE,
sdm = TRUE,
na.inform = FALSE
)
data |
|
var |
|
select |
|
output |
|
exclude |
|
multiple |
|
var_col |
|
optpar |
|
kmpar |
|
ifpar |
|
mahalpar |
|
jkpar |
|
zpar |
|
gloshpar |
|
knnpar |
|
lofpar |
|
methods |
|
bootSettings |
|
pc |
|
verbose |
|
spname |
|
warn |
|
missingness |
|
silence_true_errors |
|
sdm |
logical If the user sets |
na.inform |
|
This function computes different outlier detection methods including univariate, multivariate and species
ecological ranges to enables seamless comparison and similarities in the outliers detected by each
method. This can be done for multiple species or a single species in a dataframe or lists or dataframes
and thereafter the outliers can be extracted using the extract_clean_data function.
A list of outliers or clean dataset of datacleaner class. The different attributes are
associated with the datacleaner class from multidetect function.
result: dataframe. list of dataframes with the outliers flagged by each method.
mode: logical. Indicating whether it was multiple TRUE or FALSE.
varused: character. Indicating the variable used for the univariate outlier detection methods.
out: character. Whether outliers where indicated by the user or no outlier data.
methodsused: vector. The different methods used the outlier detection process.
dfname: character. The dataset name for the species records.
exclude: vector. The columns which were excluded during outlier detection, if any.
IUCN Standards and Petitions Committee. (2022). THE IUCN RED LIST OF THREATENED SPECIESTM Guidelines for Using the IUCN Red List Categories and Criteria Prepared by the Standards and Petitions Committee of the IUCN Species Survival Commission. https://www.iucnredlist.org/documents/RedListGuidelines.pdf.
Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008, December). Isolation forest. In 2008 eighth ieee international conference on data mining (pp. 413-422). IEEE.
#' #====
#1. Mult detect for general data analysis using iris data
#===
# the outliers are introduced for testing purposes
irisdata1 <- iris
#introduce outlier data and NAs
rowsOutNA1 <- data.frame(x= c(344, NA,NA, NA),
x2 = c(34, 45, 544, NA),
x3= c(584, 5, 554, NA),
x4 = c(575, 4554,474, NA),
x5 =c('setosa', 'setosa', 'setosa', "setosa"))
colnames(rowsOutNA1) <- colnames(irisdata1)
dfinal <- rbind(irisdata1, rowsOutNA1)
#===========
setosadf <- dfinal[dfinal$Species%in%"setosa",c("Sepal.Width", 'Species')]
setosa_outlier_detection <- multidetect(data = setosadf,
var = 'Sepal.Width',
multiple = FALSE, #'one species
methods = c("adjbox", "iqr", "hampel","jknife",
"seqfences", "mixediqr",
"distboxplot", "semiqr",
"zscore", "logboxplot", "medianrule"),
silence_true_errors = FALSE,
missingness = 0.1,
sdm = FALSE,
na.inform = TRUE)
#======
#2.all species
#=====
multspp_outlier_detection <- multidetect(data = dfinal,
var = 'Sepal.Width',
multiple = TRUE, #'for multiple species or groups
var_col = "Species",
methods = c("adjbox", "iqr", "hampel","jknife",
"seqfences", "mixediqr",
"distboxplot", "semiqr",
"zscore", "logboxplot", "medianrule"),
silence_true_errors = FALSE,
missingness = 0.1,
sdm = FALSE,
na.inform = TRUE)
ggoutliers(multspp_outlier_detection)
#======
#3. Multidetect for environmental data
#======
#'Species data
data("abdata")
#area of interest
danube <- system.file('extdata/danube.shp.zip', package='specleanr')
db <- sf::st_read(danube, quiet=TRUE)
worldclim <- terra::rast(system.file('extdata/worldclim.tiff', package='specleanr'))
abpred <- pred_extract(data = abdata,
raster= worldclim ,
lat = 'decimalLatitude',
lon= 'decimalLongitude',
colsp = 'species',
bbox = db,
minpts = 10,
list=TRUE,
merge=FALSE)
about_df <- multidetect(data = abpred, multiple = FALSE,
var = 'bio6',
output = 'outlier',
exclude = c('x','y'),
methods = c('zscore', 'adjbox','iqr', 'semiqr','hampel', 'kmeans',
'logboxplot', 'lof','iforest', 'mahal', 'seqfences'))
ggoutliers(about_df)
#==========
#4. For mulitple species in species distribution models
#======
data("efidata")
data("jdsdata")
matchdata <- match_datasets(datasets = list(jds = jdsdata, efi=efidata),
lats = 'lat',
lons = 'lon',
species = c('speciesname','scientificName'),
date = c('Date', 'sampling_date'),
country = c('JDS4_site_ID'))
#extract data
rdata <- pred_extract(data = matchdata,
raster= worldclim ,
lat = 'decimalLatitude',
lon= 'decimalLongitude',
colsp = 'species',
bbox = db,
minpts = 10,
list=TRUE,
merge=FALSE)
#optimal ranges in the multidetect: made up
multspout_df <- multidetect(data = rdata, multiple = TRUE,
var = 'bio6',
output = 'outlier',
exclude = c('x','y'),
methods = c('zscore', 'adjbox','iqr', 'semiqr','hampel', 'kmeans',
'logboxplot', 'lof','iforest', 'mahal', 'seqfences'))
ggoutliers(multspout_df, "Anguilla anguilla")
#====================================
#5. use optimal ranges as a method
#create species ranges
#===================================
#max temperature of "Thymallus thymallus" is made up to make it appear in outliers
optdata <- data.frame(species= c("Phoxinus phoxinus", "Thymallus thymallus"),
mintemp = c(6, 1.6),maxtemp = c(20, 8.6),
meantemp = c(8.69, 8.4), #'ecoparam
direction = c('greater', 'greater'))
ttdata <- rdata["Thymallus thymallus"]
#even if one species, please indicate multiple to TRUE, since its picked from pred_extract function
thymallus_out_ranges <- multidetect(data = ttdata, multiple = TRUE,
var = 'bio1',
output = 'outlier',
exclude = c('x','y'),
methods = c('zscore', 'adjbox','iqr', 'semiqr','hampel', 'kmeans',
'logboxplot', 'lof','iforest', 'mahal', 'seqfences', 'optimal'),
optpar = list(optdf=optdata, optspcol = 'species',
mincol = "mintemp", maxcol = "maxtemp"))
ggoutliers(thymallus_out_ranges)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.