#' @title Outliers detection (check for influential observations)
#' @name check_outliers
#'
#' @description Checks for and locates influential observations (i.e.,
#' "outliers") via several distance and/or clustering methods. If several
#' methods are selected, the returned "Outlier" vector will be a composite
#' outlier score, made of the average of the binary (0 or 1) results of each
#' method. It represents the probability of each observation of being
#' classified as an outlier by at least one method. The decision rule used by
#' default is to classify as outliers observations which composite outlier
#' score is superior or equal to 0.5 (i.e., that were classified as outliers
#' by at least half of the methods). See the **Details** section below
#' for a description of the methods.
#'
#' @param x A model, a data.frame, a `performance_simres` [`simulate_residuals()`]
#' or a `DHARMa` object.
#' @param method The outlier detection method(s). Can be `"all"` or some of
#' `"cook"`, `"pareto"`, `"zscore"`, `"zscore_robust"`, `"iqr"`, `"ci"`, `"eti"`,
#' `"hdi"`, `"bci"`, `"mahalanobis"`, `"mahalanobis_robust"`, `"mcd"`, `"ics"`,
#' `"optics"` or `"lof"`.
#' @param threshold A list containing the threshold values for each method (e.g.
#' `list('mahalanobis' = 7, 'cook' = 1)`), above which an observation is
#' considered as outlier. If `NULL`, default values will be used (see
#' 'Details'). If a numeric value is given, it will be used as the threshold
#' for any of the method run.
#' @param ID Optional, to report an ID column along with the row number.
#' @param type Type of method to test for outliers. Can be one of `"default"`,
#' `"binomial"` or `"bootstrap"`. Only applies when `x` is an object returned
#' by `simulate_residuals()` or of class `DHARMa`. See 'Details' in
#' `?DHARMa::testOutliers` for a detailed description of the types.
#' @param verbose Toggle warnings.
#' @param ... When `method = "ics"`, further arguments in `...` are passed
#' down to [ICSOutlier::ics.outlier()]. When `method = "mahalanobis"`,
#' they are passed down to [stats::mahalanobis()]. `percentage_central` can
#' be specified when `method = "mcd"`. For objects of class `performance_simres`
#' or `DHARMa`, further arguments are passed down to `DHARMa::testOutliers()`.
#'
#' @inheritParams check_zeroinflation
#' @inheritParams simulate_residuals
#'
#' @return A logical vector of the detected outliers with a nice printing
#' method: a check (message) on whether outliers were detected or not. The
#' information on the distance measure and whether or not an observation is
#' considered as outlier can be recovered with the [as.data.frame]
#' function. Note that the function will (silently) return a vector of `FALSE`
#' for non-supported data types such as character strings.
#'
#' @family functions to check model assumptions and and assess model quality
#'
#' @seealso [`see::plot.see_check_outliers()`] for options to customize the plot.
#'
#' @note There is also a
#' [`plot()`-method](https://easystats.github.io/see/articles/performance.html)
#' implemented in the
#' \href{https://easystats.github.io/see/}{\pkg{see}-package}. **Please
#' note** that the range of the distance-values along the y-axis is re-scaled
#' to range from 0 to 1.
#'
#' @details Outliers can be defined as particularly influential observations.
#' Most methods rely on the computation of some distance metric, and the
#' observations greater than a certain threshold are considered outliers.
#' Importantly, outliers detection methods are meant to provide information to
#' consider for the researcher, rather than to be an automatized procedure
#' which mindless application is a substitute for thinking.
#'
#' An **example sentence** for reporting the usage of the composite method
#' could be:
#'
#' *"Based on a composite outlier score (see the 'check_outliers' function
#' in the 'performance' R package; Lüdecke et al., 2021) obtained via the joint
#' application of multiple outliers detection algorithms (Z-scores, Iglewicz,
#' 1993; Interquartile range (IQR); Mahalanobis distance, Cabana, 2019; Robust
#' Mahalanobis distance, Gnanadesikan and Kettenring, 1972; Minimum Covariance
#' Determinant, Leys et al., 2018; Invariant Coordinate Selection, Archimbaud et
#' al., 2018; OPTICS, Ankerst et al., 1999; Isolation Forest, Liu et al. 2008;
#' and Local Outlier Factor, Breunig et al., 2000), we excluded n participants
#' that were classified as outliers by at least half of the methods used."*
#'
#' @section Model-specific methods:
#'
#' - **Cook's Distance**:
#' Among outlier detection methods, Cook's distance and leverage are less
#' common than the basic Mahalanobis distance, but still used. Cook's distance
#' estimates the variations in regression coefficients after removing each
#' observation, one by one (Cook, 1977). Since Cook's distance is in the metric
#' of an F distribution with p and n-p degrees of freedom, the median point of
#' the quantile distribution can be used as a cut-off (Bollen, 1985). A common
#' approximation or heuristic is to use 4 divided by the numbers of
#' observations, which usually corresponds to a lower threshold (i.e., more
#' outliers are detected). This only works for frequentist models. For Bayesian
#' models, see `pareto`.
#'
#' - **Pareto**:
#' The reliability and approximate convergence of Bayesian models can be
#' assessed using the estimates for the shape parameter k of the generalized
#' Pareto distribution. If the estimated tail shape parameter k exceeds 0.5, the
#' user should be warned, although in practice the authors of the **loo**
#' package observed good performance for values of k up to 0.7 (the default
#' threshold used by `performance`).
#'
#' @section Univariate methods:
#'
#' - **Z-scores** `("zscore", "zscore_robust")`:
#' The Z-score, or standard score, is a way of describing a data point as
#' deviance from a central value, in terms of standard deviations from the mean
#' (`"zscore"`) or, as it is here the case (`"zscore_robust"`) by
#' default (Iglewicz, 1993), in terms of Median Absolute Deviation (MAD) from
#' the median (which are robust measures of dispersion and centrality). The
#' default threshold to classify outliers is 1.959 (`threshold = list("zscore" = 1.959)`),
#' corresponding to the 2.5% (`qnorm(0.975)`) most extreme observations
#' (assuming the data is normally distributed). Importantly, the Z-score
#' method is univariate: it is computed column by column. If a data frame is
#' passed, the Z-score is calculated for each variable separately, and the
#' maximum (absolute) Z-score is kept for each observations. Thus, all
#' observations that are extreme on at least one variable might be detected
#' as outliers. Thus, this method is not suited for high dimensional data
#' (with many columns), returning too liberal results (detecting many outliers).
#'
#' - **IQR** `("iqr")`:
#' Using the IQR (interquartile range) is a robust method developed by John
#' Tukey, which often appears in box-and-whisker plots (e.g., in
#' [ggplot2::geom_boxplot]). The interquartile range is the range between the first
#' and the third quartiles. Tukey considered as outliers any data point that
#' fell outside of either 1.5 times (the default threshold is 1.7) the IQR below
#' the first or above the third quartile. Similar to the Z-score method, this is
#' a univariate method for outliers detection, returning outliers detected for
#' at least one column, and might thus not be suited to high dimensional data.
#' The distance score for the IQR is the absolute deviation from the median of
#' the upper and lower IQR thresholds. Then, this value is divided by the IQR
#' threshold, to “standardize” it and facilitate interpretation.
#'
#' - **CI** `("ci", "eti", "hdi", "bci")`:
#' Another univariate method is to compute, for each variable, some sort of
#' "confidence" interval and consider as outliers values lying beyond the edges
#' of that interval. By default, `"ci"` computes the Equal-Tailed Interval
#' (`"eti"`), but other types of intervals are available, such as Highest
#' Density Interval (`"hdi"`) or the Bias Corrected and Accelerated
#' Interval (`"bci"`). The default threshold is `0.95`, considering
#' as outliers all observations that are outside the 95% CI on any of the
#' variable. See [bayestestR::ci()] for more details
#' about the intervals. The distance score for the CI methods is the absolute
#' deviation from the median of the upper and lower CI thresholds. Then, this
#' value is divided by the difference between the upper and lower CI bounds
#' divided by two, to “standardize” it and facilitate interpretation.
#'
#' @section Multivariate methods:
#'
#' - **Mahalanobis Distance**:
#' Mahalanobis distance (Mahalanobis, 1930) is often used for multivariate
#' outliers detection as this distance takes into account the shape of the
#' observations. The default `threshold` is often arbitrarily set to some
#' deviation (in terms of SD or MAD) from the mean (or median) of the
#' Mahalanobis distance. However, as the Mahalanobis distance can be
#' approximated by a Chi squared distribution (Rousseeuw and Van Zomeren, 1990),
#' we can use the alpha quantile of the chi-square distribution with k degrees
#' of freedom (k being the number of columns). By default, the alpha threshold
#' is set to 0.025 (corresponding to the 2.5\% most extreme observations;
#' Cabana, 2019). This criterion is a natural extension of the median plus or
#' minus a coefficient times the MAD method (Leys et al., 2013).
#'
#' - **Robust Mahalanobis Distance**:
#' A robust version of Mahalanobis distance using an Orthogonalized
#' Gnanadesikan-Kettenring pairwise estimator (Gnanadesikan and Kettenring,
#' 1972). Requires the **bigutilsr** package. See the [bigutilsr::dist_ogk()]
#' function.
#'
#' - **Minimum Covariance Determinant (MCD)**:
#' Another robust version of Mahalanobis. Leys et al. (2018) argue that
#' Mahalanobis Distance is not a robust way to determine outliers, as it uses
#' the means and covariances of all the data - including the outliers - to
#' determine individual difference scores. Minimum Covariance Determinant
#' calculates the mean and covariance matrix based on the most central subset of
#' the data (by default, 66\%), before computing the Mahalanobis Distance. This
#' is deemed to be a more robust method of identifying and removing outliers
#' than regular Mahalanobis distance.
#' This method has a `percentage_central` argument that allows specifying
#' the breakdown point (0.75, the default, is recommended by Leys et al. 2018,
#' but a commonly used alternative is 0.50).
#'
#' - **Invariant Coordinate Selection (ICS)**:
#' The outlier are detected using ICS, which by default uses an alpha threshold
#' of 0.025 (corresponding to the 2.5\% most extreme observations) as a cut-off
#' value for outliers classification. Refer to the help-file of
#' [ICSOutlier::ics.outlier()] to get more details about this procedure.
#' Note that `method = "ics"` requires both **ICS** and **ICSOutlier**
#' to be installed, and that it takes some time to compute the results. You
#' can speed up computation time using parallel computing. Set the number of
#' cores to use with `options(mc.cores = 4)` (for example).
#'
#' - **OPTICS**:
#' The Ordering Points To Identify the Clustering Structure (OPTICS) algorithm
#' (Ankerst et al., 1999) is using similar concepts to DBSCAN (an unsupervised
#' clustering technique that can be used for outliers detection). The threshold
#' argument is passed as `minPts`, which corresponds to the minimum size
#' of a cluster. By default, this size is set at 2 times the number of columns
#' (Sander et al., 1998). Compared to the other techniques, that will always
#' detect several outliers (as these are usually defined as a percentage of
#' extreme values), this algorithm functions in a different manner and won't
#' always detect outliers. Note that `method = "optics"` requires the
#' **dbscan** package to be installed, and that it takes some time to compute
#' the results. Additionally, the `optics_xi` (default to 0.05) is passed to
#' the [dbscan::extractXi()] function to further refine the cluster selection.
#'
#' - **Local Outlier Factor**:
#' Based on a K nearest neighbors algorithm, LOF compares the local density of
#' a point to the local densities of its neighbors instead of computing a
#' distance from the center (Breunig et al., 2000). Points that have a
#' substantially lower density than their neighbors are considered outliers. A
#' LOF score of approximately 1 indicates that density around the point is
#' comparable to its neighbors. Scores significantly larger than 1 indicate
#' outliers. The default threshold of 0.025 will classify as outliers the
#' observations located at `qnorm(1-0.025) * SD)` of the log-transformed
#' LOF distance. Requires the **dbscan** package.
#'
#' @section Methods for simulated residuals:
#'
#' The approach for detecting outliers based on simulated residuals differs
#' from the traditional methods and may not be detecting outliers as expected.
#' Literally, this approach compares observed to simulated values. However, we
#' do not know the deviation of the observed data to the model expectation, and
#' thus, the term "outlier" should be taken with a grain of salt. It refers to
#' "simulation outliers". Basically, the comparison tests whether on observed
#' data point is outside the simulated range. It is strongly recommended to read
#' the related documentations in the **DHARMa** package, e.g. `?DHARMa::testOutliers`.
#'
#' @section Threshold specification:
#'
#' Default thresholds are currently specified as follows:
#'
#' ```
#' list(
#' zscore = stats::qnorm(p = 1 - 0.001 / 2),
#' zscore_robust = stats::qnorm(p = 1 - 0.001 / 2),
#' iqr = 1.7,
#' ci = 1 - 0.001,
#' eti = 1 - 0.001,
#' hdi = 1 - 0.001,
#' bci = 1 - 0.001,
#' cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
#' pareto = 0.7,
#' mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
#' mahalanobis_robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
#' mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
#' ics = 0.001,
#' optics = 2 * ncol(x),
#' optics_xi = 0.05,
#' lof = 0.001
#' )
#' ```
#'
#' @section Meta-analysis models:
#' For meta-analysis models (e.g. objects of class `rma` from the *metafor*
#' package or `metagen` from package *meta*), studies are defined as outliers
#' when their confidence interval lies outside the confidence interval of the
#' pooled effect.
#'
#' @references
#' - Archimbaud, A., Nordhausen, K., and Ruiz-Gazen, A. (2018). ICS for
#' multivariate outlier detection with application to quality control.
#' *Computational Statistics and Data Analysis*, *128*, 184-199.
#' \doi{10.1016/j.csda.2018.06.011}
#'
#' - Gnanadesikan, R., and Kettenring, J. R. (1972). Robust estimates, residuals,
#' and outlier detection with multiresponse data. *Biometrics*, 81-124.
#'
#' - Bollen, K. A., and Jackman, R. W. (1985). Regression diagnostics: An
#' expository treatment of outliers and influential cases. *Sociological Methods
#' and Research*, *13*(4), 510-542.
#'
#' - Cabana, E., Lillo, R. E., and Laniado, H. (2019). Multivariate outlier
#' detection based on a robust Mahalanobis distance with shrinkage estimators.
#' arXiv preprint arXiv:1904.02596.
#'
#' - Cook, R. D. (1977). Detection of influential observation in linear
#' regression. *Technometrics*, *19*(1), 15-18.
#'
#' - Iglewicz, B., and Hoaglin, D. C. (1993). How to detect and handle outliers
#' (Vol. 16). Asq Press.
#'
#' - Leys, C., Klein, O., Dominicy, Y., and Ley, C. (2018). Detecting
#' multivariate outliers: Use a robust variant of Mahalanobis distance. *Journal
#' of Experimental Social Psychology*, 74, 150-156.
#'
#' - Liu, F. T., Ting, K. M., and Zhou, Z. H. (2008, December). Isolation forest.
#' In 2008 Eighth IEEE International Conference on Data Mining (pp. 413-422).
#' IEEE.
#'
#' - Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., and Makowski, D.
#' (2021). performance: An R package for assessment, comparison and testing of
#' statistical models. *Journal of Open Source Software*, *6*(60), 3139.
#' \doi{10.21105/joss.03139}
#'
#' - Thériault, R., Ben-Shachar, M. S., Patil, I., Lüdecke, D., Wiernik, B. M.,
#' and Makowski, D. (2023). Check your outliers! An introduction to identifying
#' statistical outliers in R with easystats. *Behavior Research Methods*, 1-11.
#' \doi{10.3758/s13428-024-02356-w}
#'
#' - Rousseeuw, P. J., and Van Zomeren, B. C. (1990). Unmasking multivariate
#' outliers and leverage points. *Journal of the American Statistical
#' association*, *85*(411), 633-639.
#'
#' @examples
#' data <- mtcars # Size nrow(data) = 32
#'
#' # For single variables ------------------------------------------------------
#' outliers_list <- check_outliers(data$mpg) # Find outliers
#' outliers_list # Show the row index of the outliers
#' as.numeric(outliers_list) # The object is a binary vector...
#' filtered_data <- data[!outliers_list, ] # And can be used to filter a data frame
#' nrow(filtered_data) # New size, 28 (4 outliers removed)
#'
#' # Find all observations beyond +/- 2 SD
#' check_outliers(data$mpg, method = "zscore", threshold = 2)
#'
#' # For dataframes ------------------------------------------------------
#' check_outliers(data) # It works the same way on data frames
#'
#' # You can also use multiple methods at once
#' outliers_list <- check_outliers(data, method = c(
#' "mahalanobis",
#' "iqr",
#' "zscore"
#' ))
#' outliers_list
#'
#' # Using `as.data.frame()`, we can access more details!
#' outliers_info <- as.data.frame(outliers_list)
#' head(outliers_info)
#' outliers_info$Outlier # Including the probability of being an outlier
#'
#' # And we can be more stringent in our outliers removal process
#' filtered_data <- data[outliers_info$Outlier < 0.1, ]
#'
#' # We can run the function stratified by groups using `{datawizard}` package:
#' group_iris <- datawizard::data_group(iris, "Species")
#' check_outliers(group_iris)
#' # nolint start
#' @examplesIf all(insight::check_if_installed(c("bigutilsr", "MASS", "ICSOutlier", "ICS", "dbscan", "loo", "see"), quietly = TRUE))
#' # nolint end
#' \donttest{
#' # You can also run all the methods
#' check_outliers(data, method = "all", verbose = FALSE)
#'
#' # For statistical models ---------------------------------------------
#' # select only mpg and disp (continuous)
#' mt1 <- mtcars[, c(1, 3, 4)]
#' # create some fake outliers and attach outliers to main df
#' mt2 <- rbind(mt1, data.frame(
#' mpg = c(37, 40), disp = c(300, 400),
#' hp = c(110, 120)
#' ))
#' # fit model with outliers
#' model <- lm(disp ~ mpg + hp, data = mt2)
#'
#' outliers_list <- check_outliers(model)
#' plot(outliers_list)
#'
#' insight::get_data(model)[outliers_list, ] # Show outliers data
#' }
#' @export
check_outliers <- function(x, ...) {
UseMethod("check_outliers")
}
#' @export
check_outliers.character <- function(x, ...) {
rep(0, length(x))
}
# default ---------------------
#' @rdname check_outliers
#' @export
check_outliers.default <- function(x,
method = c("cook", "pareto"),
threshold = NULL,
ID = NULL,
verbose = TRUE,
...) {
# Check args
if (all(method == "all")) {
method <- c(
"zscore_robust", "iqr", "ci", "cook", "pareto", "mahalanobis",
"mahalanobis_robust", "mcd", "ics", "optics", "lof"
)
}
method <- match.arg(
method,
c(
"zscore", "zscore_robust", "iqr", "ci", "hdi", "eti", "bci", "cook",
"pareto", "mahalanobis", "mahalanobis_robust", "mcd", "ics", "optics", "lof"
),
several.ok = TRUE
)
# Get model information
m_info <- insight::model_info(x)
# Get data
my_data <- insight::get_data(x, verbose = FALSE)
# sanity check for date, POSIXt and difftime variables
if (any(vapply(my_data, inherits, FUN.VALUE = logical(1), what = c("Date", "POSIXt", "difftime"))) && verbose) {
insight::format_alert(
paste(
"Date variables are not supported for outliers detection. These will be ignored.",
"Make sure any date variables are converted to numeric or factor {.b before} fitting the model."
)
)
}
# Remove non-numerics, but in case of binomial, only check predictors
if (m_info$is_binomial) {
model_predictors <- unique(insight::find_predictors(x, flatten = TRUE))
} else {
model_predictors <- colnames(my_data)
}
my_data <- datawizard::data_select(
my_data[model_predictors],
select = is.numeric,
verbose = FALSE
)
# check if any data left
if (is.null(my_data) || ncol(my_data) == 0) {
insight::format_error("No numeric variables found. No data to check for outliers.")
}
# Thresholds
if (is.null(threshold)) {
thresholds <- .check_outliers_thresholds(my_data)
} else if (is.list(threshold)) {
thresholds <- .check_outliers_thresholds(my_data)
thresholds[names(threshold)] <- threshold[names(threshold)]
} else {
insight::format_error(
paste(
"The `threshold` argument must be NULL (for default values) or a list containing",
"threshold values for desired methods (e.g., `list('mahalanobis' = 7)`)."
)
)
}
if (!missing(ID) && verbose) {
insight::format_warning(paste0("ID argument not supported for model objects of class `", class(x)[1], "`."))
}
# Others
if (all(method %in% c("cook", "pareto"))) {
my_df <- data.frame(Row = seq_len(nrow(as.data.frame(my_data))))
outlier_count <- list()
outlier_var <- list()
} else {
out <- check_outliers(my_data, method, threshold)
outlier_var <- attributes(out)$outlier_var
outlier_count <- attributes(out)$outlier_count
my_df <- attributes(out)$data
my_df <- my_df[names(my_df) != "Outlier"]
}
# Cook
if ("cook" %in% method && !m_info$is_bayesian && !inherits(x, "bife")) {
data_cook <- .check_outliers_cook(
x,
threshold = thresholds$cook
)$data_cook
my_df <- datawizard::data_merge(list(my_df, data_cook),
join = "full",
by = "Row"
)
count.table <- datawizard::data_filter(
data_cook, "Outlier_Cook > 0.5"
)
count.table <- datawizard::data_remove(
count.table, "Cook",
regex = TRUE, as_data_frame = TRUE
)
if (nrow(count.table) >= 1) {
count.table$n_Cook <- "(Multivariate)"
}
outlier_count$cook <- count.table
if (all(method %in% c("cook", "pareto"))) {
outlier_count$all <- count.table
} else {
outlier_count$all <- datawizard::data_merge(
list(outlier_count$all, count.table),
join = "full",
by = "Row"
)
}
} else {
method <- method[method != "cook"]
}
# Pareto
if ("pareto" %in% method && m_info$is_bayesian) {
data_pareto <- .check_outliers_pareto(
x,
threshold = thresholds$pareto
)$data_pareto
my_df <- datawizard::data_merge(list(my_df, data_pareto),
join = "full",
by = "Row"
)
count.table <- datawizard::data_filter(
data_pareto, "Outlier_Pareto > 0.5"
)
count.table <- datawizard::data_remove(
count.table, "Pareto",
regex = TRUE, as_data_frame = TRUE
)
if (nrow(count.table) >= 1) {
count.table$n_Pareto <- "(Multivariate)"
}
outlier_count$pareto <- count.table
if (all(method %in% c("cook", "pareto"))) {
outlier_count$all <- count.table
} else {
outlier_count$all <- datawizard::data_merge(
list(outlier_count$all, count.table),
join = "full",
by = "Row"
)
}
} else {
method <- method[method != "pareto"]
}
outlier_count$all <- datawizard::convert_na_to(outlier_count$all,
replace_num = 0,
replace_char = "0",
replace_fac = 0
)
num.df <- outlier_count$all[!names(outlier_count$all) %in% c("Row", ID)]
if (isTRUE(nrow(num.df) > 0)) {
num.df <- datawizard::recode_values(
num.df,
recode = list(`2` = "(Multivariate)")
)
num.df <- as.data.frame(lapply(num.df, as.numeric))
outlier_count$all$max <- apply(num.df, 1, max)
outlier_count$all <- datawizard::data_filter(
outlier_count$all,
max >= 2
)
outlier_count$all <- datawizard::data_remove(
outlier_count$all,
"max"
)
}
row.names(outlier_count$all) <- NULL
thresholds <- thresholds[names(thresholds) %in% method]
# Composite outlier score
my_df$Outlier <- rowMeans(my_df[grepl("Outlier_", names(my_df), fixed = TRUE)])
my_df <- my_df[c(names(my_df)[names(my_df) != "Outlier"], "Outlier")]
# Out
outlier <- my_df$Outlier > 0.5
# Attributes
class(outlier) <- c("check_outliers", "see_check_outliers", class(outlier))
attr(outlier, "data") <- my_df
attr(outlier, "threshold") <- thresholds
attr(outlier, "method") <- method
attr(outlier, "text_size") <- 3
attr(outlier, "influential_obs") <- .safe(.model_diagnostic_outlier(x, threshold = unlist(thresholds))) # nolint
attr(outlier, "variables") <- "(Whole model)"
attr(outlier, "raw_data") <- my_data
attr(outlier, "outlier_var") <- outlier_var
attr(outlier, "outlier_count") <- outlier_count
outlier
}
# Methods -----------------------------------------------------------------
#' @export
as.data.frame.check_outliers <- function(x, ...) {
attributes(x)$data
}
#' @export
as.double.check_outliers <- function(x, ...) {
attributes(x)$data$Outlier
}
#' @export
print.check_outliers <- function(x, ...) {
outliers <- which(x)
method <- attr(x, "method")
round_to_last_digit <- function(x, n = 3) {
max(abs(round(x, n)), abs(signif(x, 1))) * sign(x)
}
thresholds <- lapply(attr(x, "threshold"), round_to_last_digit, 3)
method.thresholds <- data.frame(
method = method,
thresholds = unlist(thresholds)
)
method.thresholds <- paste0(method.thresholds$method, " (",
method.thresholds$thresholds, ")",
collapse = ", "
)
method.univariate <- c(
"zscore", "zscore_robust", "iqr", "ci",
"eti", "hdi", "bci"
)
vars <- toString(attr(x, "variables"))
vars.outliers <- attr(x, "outlier_var")
var.plural <- ifelse(length(attr(x, "variables")) > 1,
"variables", "variable"
)
method.plural <- ifelse(length(thresholds) > 1,
"methods and thresholds",
"method and threshold"
)
long_dash <- paste0("\n", strrep("-", 77), "\n")
if (length(outliers) > 1) {
outlier.plural <- "outliers"
case.plural <- "cases"
} else {
outlier.plural <- "outlier"
case.plural <- "case"
}
if (length(outliers) >= 1) {
outlier.count <- attr(x, "outlier_count")
o <- toString(outliers)
insight::print_color(insight::format_message(
sprintf(
"%i %s detected: %s %s.", length(outliers),
outlier.plural, case.plural, o
),
sprintf(
"- Based on the following %s: %s.",
method.plural, method.thresholds
),
sprintf("- For %s: %s.\n", var.plural, vars),
indent = ""
), color = "yellow")
if (length(method) > 1) {
insight::print_color(
c(
"\nNote: Outliers were classified as such by",
"at least half of the selected methods. \n"
), "yellow"
)
}
if ((isTRUE(nrow(outlier.count$all) > 0) || isTRUE(attributes(x)$grouped)) &&
(length(method) > 1 || all(method %in% method.univariate))) {
cat(long_dash, insight::format_message(
"\nThe following observations were considered outliers for two or more",
"variables by at least one of the selected methods:\n\n"
))
ifelse(isTRUE(attributes(x)$grouped),
print(lapply(outlier.count, function(x) x$all)),
print(outlier.count$all)
)
}
if (length(method) == 1 && all(method %in% method.univariate)) {
cat(long_dash, "Outliers per variable (", method,
"): \n\n",
sep = ""
)
ifelse(isTRUE(attributes(x)$grouped),
print(vars.outliers),
print(vars.outliers[[1]])
)
}
} else {
insight::print_color(
sprintf("OK: No outliers detected.
- Based on the following %s: %s.
- For %s: %s\n\n", method.plural, method.thresholds, var.plural, vars),
"green"
)
}
invisible(x)
}
#' @export
print.check_outliers_metafor <- function(x, ...) {
outliers <- which(x)
round_to_last_digit <- function(x, n = 3) {
max(abs(round(x, n)), abs(signif(x, 1))) * sign(x)
}
thresholds <- lapply(attr(x, "threshold"), round_to_last_digit, 3)
studies <- attr(x, "outlier_var")
if (length(outliers) > 1) {
outlier.plural <- "outliers"
case.plural <- "studies"
} else {
outlier.plural <- "outlier"
case.plural <- "study"
}
if (length(outliers) >= 1) {
if (all(as.character(studies) == as.character(outliers))) {
o <- datawizard::text_concatenate(outliers)
} else {
o <- datawizard::text_concatenate(paste0(outliers, " (", studies, ")"))
}
insight::print_color(insight::format_message(
sprintf("%i %s detected: %s %s.\n", length(outliers), outlier.plural, case.plural, o)
), "yellow")
} else {
insight::print_color("OK: No outliers detected.\n", "green")
}
invisible(x)
}
#' @export
print.check_outliers_metagen <- function(x, ...) {
outliers_fixed <- which(x$fixed)
outliers_random <- which(x$random)
studies <- attr(x, "studies")
if (length(outliers_fixed) > 1) {
outlier.plural <- "outliers"
case.plural <- "studies"
} else {
outlier.plural <- "outlier"
case.plural <- "study"
}
if (length(outliers_fixed) >= 1) {
if (all(as.character(studies[outliers_fixed]) == as.character(outliers_fixed))) {
o <- datawizard::text_concatenate(outliers_fixed)
} else {
o <- datawizard::text_concatenate(paste0(outliers_fixed, " (", studies[outliers_fixed], ")"))
}
insight::print_color(insight::format_message(
sprintf("- %i %s in fixed effects detected: %s %s.\n", length(outliers_fixed), outlier.plural, case.plural, o)
), "yellow")
}
if (length(outliers_random) > 1) {
outlier.plural <- "outliers"
case.plural <- "cases"
} else {
outlier.plural <- "outlier"
case.plural <- "case"
}
if (length(outliers_random) >= 1) {
if (all(as.character(studies[outliers_random]) == as.character(outliers_random))) {
o <- datawizard::text_concatenate(outliers_random)
} else {
o <- datawizard::text_concatenate(paste0(outliers_random, " (", studies[outliers_random], ")"))
}
if (length(outliers_fixed) >= 1) {
cat("\n")
}
insight::print_color(insight::format_message(
sprintf("- %i %s in random effects detected: %s %s.\n", length(outliers_random), outlier.plural, case.plural, o)
), "yellow")
}
if (!length(outliers_random) && !length(outliers_fixed)) {
insight::print_color("OK: No outliers detected.\n", "green")
}
invisible(x)
}
#' @export
plot.check_outliers <- function(x, ...) {
insight::check_if_installed("see", "to plot outliers")
NextMethod()
}
#' @export
print.check_outliers_simres <- function(x, digits = 2, ...) {
result <- paste0(
insight::format_value(100 * x$Expected, digits = digits, ...),
"%, ",
insight::format_ci(100 * x$CI_low, 100 * x$CI_high, digits = digits, ...)
)
insight::print_color("# Outliers detection\n\n", "blue")
cat(sprintf(" Proportion of observed outliers: %.*f%%\n", digits, 100 * x$Coefficient))
cat(sprintf(" Proportion of expected outliers: %s\n\n", result))
p_string <- paste0(" (", insight::format_p(x$p_value), ")")
if (x$p_value < 0.05) {
message("Outliers were detected", p_string, ".")
} else {
message("No outliers were detected", p_string, ".")
}
invisible(x)
}
# other classes -------------------------
#' @rdname check_outliers
#' @export
check_outliers.numeric <- function(x,
method = "zscore_robust",
threshold = NULL,
...) {
x <- as.data.frame(x)
names(x) <- datawizard::text_remove(sys.call()[2], "()")
check_outliers(x,
method = method,
threshold = threshold,
...
)
}
#' @rdname check_outliers
#' @export
check_outliers.data.frame <- function(x,
method = "mahalanobis",
threshold = NULL,
ID = NULL,
...) {
# Preserve ID column if desired
ID.names <- switch(!is.null(ID),
x[ID]
)
# Remove non-numerics
my_data <- x
x <- x[, vapply(x, is.numeric, logical(1)), drop = FALSE]
# Check args
if (all(method == "all")) {
method <- c(
"zscore_robust", "iqr", "ci", "cook", "pareto", "mahalanobis",
"mahalanobis_robust", "mcd", "ics", "optics", "lof"
)
}
method <- match.arg(method, c(
"zscore", "zscore_robust", "iqr", "ci", "hdi",
"eti", "bci", "cook", "pareto", "mahalanobis",
"mahalanobis_robust", "mcd", "ics", "optics",
"lof"
), several.ok = TRUE)
# Thresholds
if (is.null(threshold)) {
thresholds <- .check_outliers_thresholds(x)
} else if (is.list(threshold)) {
thresholds <- .check_outliers_thresholds(x)
thresholds[names(threshold)] <- threshold[names(threshold)]
} else if (is.numeric(threshold)) {
thresholds <- .check_outliers_thresholds(x)
thresholds <- lapply(thresholds, function(x) threshold)
# need to fix this manually - "optics" automatically includes method
# "optics_xi", which is allowed to range between 0 and 1 - since values
# for "optics" can be > 1, it might overwrite "optics_xi" with an invalid
# value...
if (thresholds$optics_xi > 1) {
thresholds$optics_xi <- 0.05
}
} else {
insight::format_error(
paste(
"The `threshold` argument must be NULL (for default values) or a list containing",
"threshold values for desired methods (e.g., `list('mahalanobis' = 7)`)."
)
)
}
# Keep only relevant threshold
valid <- method
if ("optics" %in% valid) {
valid <- c(valid, "optics_xi")
method <- c(method, "optics_xi")
}
thresholds <- thresholds[names(thresholds) %in% valid]
out.meta <- .check_outliers.data.frame_method(x, method, thresholds, ID, ID.names, ...)
out <- out.meta$out
outlier_count <- out.meta$outlier_count
outlier_var <- out.meta$outlier_var
# Combine outlier data
my_df <- out[vapply(out, is.data.frame, logical(1))]
if (length(my_df) > 1 && !is.null(ID)) {
my_df <- datawizard::data_merge(my_df, by = c("Row", ID))
} else if (length(my_df) > 1) {
my_df <- datawizard::data_merge(my_df, by = "Row")
} else {
my_df <- my_df[[1]]
}
# Composite outlier score
my_df$Outlier <- rowMeans(my_df[grepl("Outlier_", names(my_df), fixed = TRUE)])
# Out
outlier <- my_df$Outlier > 0.5
# Combine outlier frequency table
if (length(outlier_count) > 1 && !is.null(ID)) {
outlier_count$all <- datawizard::data_merge(outlier_count,
join = "full",
by = c("Row", ID)
)
} else if (length(outlier_count) > 1) {
outlier_count$all <- datawizard::data_merge(outlier_count,
join = "full",
by = "Row"
)
} else if (length(outlier_count) == 1) {
outlier_count$all <- outlier_count[[1]]
} else {
outlier_count$all <- data.frame()
}
outlier_count$all <- datawizard::convert_na_to(outlier_count$all,
replace_num = 0,
replace_char = "0",
replace_fac = 0
)
outlier_count <- lapply(outlier_count, function(x) {
num.df <- x[!names(x) %in% c("Row", ID)]
if (isTRUE(nrow(num.df) >= 1)) {
num.df <- datawizard::recode_values(
num.df,
recode = list(`2` = "(Multivariate)")
)
num.df <- as.data.frame(lapply(num.df, as.numeric))
x$max <- apply(num.df, 1, max)
x <- datawizard::data_filter(x, max >= 2)
x <- datawizard::data_remove(x, "max")
}
})
row.names(outlier_count$all) <- NULL
# Attributes
class(outlier) <- c("check_outliers", "see_check_outliers", class(outlier))
attr(outlier, "data") <- my_df
attr(outlier, "threshold") <- thresholds
attr(outlier, "method") <- method
attr(outlier, "text_size") <- 3
attr(outlier, "variables") <- names(x)
attr(outlier, "raw_data") <- my_data
attr(outlier, "outlier_var") <- outlier_var
attr(outlier, "outlier_count") <- outlier_count
outlier
}
.check_outliers.data.frame_method <- function(x, method, thresholds, ID, ID.names, ...) {
# Clean up per-variable list of outliers
process_outlier_list <- function(outlier.list, Outlier_method) {
outlier.list <- lapply(outlier.list, "[[", 1)
outlier.list <- lapply(outlier.list, function(x) {
x[x[[Outlier_method]] >= 0.5, ]
})
outlier.list <- outlier.list[vapply(outlier.list, nrow, numeric(1)) > 0]
outlier.list <- lapply(outlier.list, datawizard::data_remove,
Outlier_method,
as_data_frame = TRUE
)
outlier.list
}
# Count table of repeated outliers (for several variables)
count_outlier_table <- function(outlier.list) {
count.table <- do.call(rbind, outlier.list)
name.method <- grep("Distance_", names(count.table), value = TRUE, fixed = TRUE)
name.method <- paste0("n_", gsub("Distance_", "", name.method, fixed = TRUE))
if (isTRUE(nrow(count.table) > 0)) {
count.values <- rle(sort(count.table$Row))
count.table <- data.frame(Row = count.values$values)
if (!is.null(ID)) {
count.table[ID] <- ID.names[count.table$Row, ]
}
count.table <- cbind(count.table, val = count.values$lengths)
names(count.table)[names(count.table) == "val"] <- name.method
count.table <- count.table[order(-count.table[[name.method]]), ]
}
count.table
}
# Preparation
out <- list()
outlier_var <- list()
outlier_count <- list()
# Z-score
if ("zscore" %in% method) {
out <- c(out, .check_outliers_zscore(
x,
threshold = thresholds$zscore,
robust = FALSE,
method = "max",
ID.names = ID.names
))
# Outliers per variable
zscore.var <- lapply(
x,
.check_outliers_zscore,
threshold = thresholds$zscore,
robust = FALSE,
method = "max",
ID.names = ID.names
)
outlier_var$zscore <- process_outlier_list(zscore.var, "Outlier_Zscore")
outlier_count$zscore <- count_outlier_table(outlier_var$zscore)
}
if ("zscore_robust" %in% method) {
out <- c(out, .check_outliers_zscore(
x,
threshold = thresholds$zscore_robust,
robust = TRUE,
method = "max",
ID.names = ID.names
))
# Outliers per variable
zscore_robust.var <- lapply(x, .check_outliers_zscore,
threshold = thresholds$zscore_robust,
robust = TRUE, method = "max", ID.names = ID.names
)
outlier_var$zscore_robust <- process_outlier_list(
zscore_robust.var, "Outlier_Zscore_robust"
)
outlier_count$zscore_robust <- count_outlier_table(
outlier_var$zscore_robust
)
}
# IQR
if ("iqr" %in% method) {
out <- c(out, .check_outliers_iqr(
x,
threshold = thresholds$iqr,
method = "tukey",
ID.names = ID.names
))
# Outliers per variable
iqr.var <- lapply(x, function(x) {
y <- as.data.frame(x)
.check_outliers_iqr(
y,
threshold = thresholds$iqr,
method = "tukey",
ID.names = ID.names
)
})
outlier_var$iqr <- process_outlier_list(iqr.var, "Outlier_IQR")
outlier_count$iqr <- count_outlier_table(outlier_var$iqr)
}
# CI
if (any(c("ci", "hdi", "eti", "bci") %in% method)) {
for (i in method[method %in% c("ci", "hdi", "eti", "bci")]) {
out <- c(out, .check_outliers_ci(
x,
threshold = thresholds[i],
method = i,
ID.names = ID.names
))
# Outliers per variable
loop.var <- lapply(x, function(x) {
y <- as.data.frame(x)
.check_outliers_ci(
y,
threshold = thresholds[i],
method = i,
ID.names = ID.names
)
})
outlier_var[[i]] <- process_outlier_list(
loop.var, paste0("Outlier_", i)
)
outlier_count[[i]] <- count_outlier_table(outlier_var[[i]])
}
}
# Mahalanobis
if ("mahalanobis" %in% method) {
out <- c(out, .check_outliers_mahalanobis(
x,
threshold = thresholds$mahalanobis,
ID.names = ID.names,
...
))
count.table <- datawizard::data_filter(
out$data_mahalanobis, "Outlier_Mahalanobis > 0.5"
)
count.table <- datawizard::data_remove(
count.table, "Mahalanobis",
regex = TRUE, as_data_frame = TRUE
)
if (nrow(count.table) >= 1) {
count.table$n_Mahalanobis <- "(Multivariate)"
}
outlier_count$mahalanobis <- count.table
}
# Robust Mahalanobis
if ("mahalanobis_robust" %in% method) {
out <- c(out, .check_outliers_mahalanobis_robust(
x,
threshold = thresholds$mahalanobis_robust,
ID.names = ID.names
))
count.table <- datawizard::data_filter(
out$data_mahalanobis_robust, "Outlier_Mahalanobis_robust > 0.5"
)
count.table <- datawizard::data_remove(
count.table, "Mahalanobis",
regex = TRUE, as_data_frame = TRUE
)
if (nrow(count.table) >= 1) {
count.table$n_Mahalanobis_robust <- "(Multivariate)"
}
outlier_count$mahalanobis_robust <- count.table
}
# MCD
if ("mcd" %in% method) {
out <- c(out, .check_outliers_mcd(
x,
threshold = thresholds$mcd,
ID.names = ID.names,
...
))
count.table <- datawizard::data_filter(
out$data_mcd, "Outlier_MCD > 0.5"
)
count.table <- datawizard::data_remove(
count.table, "MCD",
regex = TRUE, as_data_frame = TRUE
)
if (nrow(count.table) >= 1) {
count.table$n_MCD <- "(Multivariate)"
}
outlier_count$mcd <- count.table
}
# ICS
if ("ics" %in% method) {
out <- c(out, .check_outliers_ics(
x,
threshold = thresholds$ics,
ID.names = ID.names
))
# make sure we have valid results
if (!is.null(out)) {
count.table <- datawizard::data_filter(
out$data_ics, "Outlier_ICS > 0.5"
)
count.table <- datawizard::data_remove(
count.table, "ICS",
regex = TRUE, as_data_frame = TRUE
)
if (nrow(count.table) >= 1) {
count.table$n_ICS <- "(Multivariate)"
}
outlier_count$ics <- count.table
}
}
# OPTICS
if ("optics" %in% method) {
out <- c(out, .check_outliers_optics(
x,
threshold = thresholds$optics,
ID.names = ID.names,
xi = thresholds$optics_xi
))
count.table <- datawizard::data_filter(
out$data_optics, "Outlier_OPTICS > 0.5"
)
count.table <- datawizard::data_remove(
count.table, "OPTICS",
regex = TRUE, as_data_frame = TRUE
)
if (nrow(count.table) >= 1) {
count.table$n_OPTICS <- "(Multivariate)"
}
outlier_count$optics <- count.table
}
# Isolation Forest
# if ("iforest" %in% method) {
# out <- c(out, .check_outliers_iforest(x, threshold = thresholds$iforest))
# }
# Local Outlier Factor
if ("lof" %in% method) {
out <- c(out, .check_outliers_lof(
x,
threshold = thresholds$lof,
ID.names = ID.names
))
count.table <- datawizard::data_filter(
out$data_lof, "Outlier_LOF > 0.5"
)
count.table <- datawizard::data_remove(
count.table, "LOF",
regex = TRUE, as_data_frame = TRUE
)
if (nrow(count.table) >= 1) {
count.table$n_LOF <- "(Multivariate)"
}
outlier_count$lof <- count.table
}
out.meta <- list(out = out, outlier_var = outlier_var, outlier_count = outlier_count)
out.meta
}
#' @export
check_outliers.grouped_df <- function(x,
method = "mahalanobis",
threshold = NULL,
ID = NULL,
...) {
info <- attributes(x)
# poorman < 0.8.0?
if ("indices" %in% names(info)) {
grps <- lapply(attr(x, "indices", exact = TRUE), function(x) x + 1)
} else {
grps <- attr(x, "groups", exact = TRUE)[[".rows"]]
}
# Initialize elements
my_data <- data.frame()
out <- NULL
thresholds <- list()
outlier_var <- list()
outlier_count <- list()
# Loop through groups
for (i in seq_along(grps)) {
rows <- grps[[i]]
outliers_subset <- check_outliers(
as.data.frame(x[rows, ]),
method = method,
threshold = threshold,
ID = ID,
...
)
my_data <- rbind(my_data, as.data.frame(outliers_subset))
out <- c(out, outliers_subset)
thresholds[[paste0("group_", i)]] <- attributes(outliers_subset)$threshold
outlier_var[[i]] <- lapply(
attributes(outliers_subset)$outlier_var, lapply, function(y) {
y$Row <- rows[which(seq_along(rows) %in% y$Row)]
y
}
)
outlier_count[[i]] <- lapply(
attributes(outliers_subset)$outlier_count, function(y) {
y$Row <- rows[which(seq_along(rows) %in% y$Row)]
y
}
)
}
# Add compatibility between dplyr and poorman
info$groups$.rows <- lapply(info$groups$.rows, as.numeric)
outlier_var <- stats::setNames(outlier_var, info$groups[[1]])
outlier_count <- stats::setNames(outlier_count, info$groups[[1]])
groups <- lapply(seq_along(info$groups$.rows), function(x) {
info$groups$.rows[[x]] <- rep(
info$groups[[1]][x],
length(info$groups$.rows[[x]])
)
info$groups$.rows[[x]] <- as.data.frame(info$groups$.rows[[x]])
})
my_data[names(info$groups)[1]] <- do.call(rbind, groups)
my_data <- datawizard::data_relocate(
my_data,
select = names(info$groups)[1],
after = "Row"
)
my_data$Row <- seq_len(nrow(my_data))
class(out) <- c("check_outliers", "see_check_outliers", class(out))
attr(out, "data") <- my_data
attr(out, "method") <- method
attr(out, "threshold") <- thresholds[[1]]
attr(out, "text_size") <- 3
attr(out, "variables") <- names(x[, vapply(x, is.numeric, logical(1)), drop = FALSE])
attr(out, "raw_data") <- x
attr(out, "outlier_var") <- outlier_var
attr(out, "outlier_count") <- outlier_count
attr(out, "grouped") <- TRUE
out
}
#' @export
check_outliers.BFBayesFactor <- function(x,
ID = NULL,
...) {
if (!insight::is_model(x)) {
insight::format_error("Collinearity only applicable to regression models.")
}
if (!missing(ID)) {
insight::format_warning(paste0("ID argument not supported for objects of class `", class(x)[1], "`."))
}
d <- insight::get_predictors(x)
d[[insight::find_response(x)]] <- insight::get_response(x)
check_outliers(d, ID = ID, ...)
}
#' @export
check_outliers.gls <- function(x,
method = "pareto",
threshold = NULL,
ID = NULL,
...) {
if (!missing(ID)) {
insight::format_warning(
paste0("ID argument not supported for objects of class `", class(x)[1], "`.")
)
}
valid_methods <- c("zscore_robust", "iqr", "ci", "pareto", "optics")
if (all(method == "all")) {
method <- valid_methods
}
if (!all(method %in% valid_methods)) {
method <- "pareto"
}
check_outliers.default(x, method = method, threshold = threshold, ...)
}
#' @export
check_outliers.lme <- check_outliers.gls
#' @export
check_outliers.fixest <- check_outliers.gls
#' @export
check_outliers.fixest_multi <- function(x,
method = "pareto",
threshold = NULL,
ID = NULL,
...) {
lapply(x, check_outliers.fixest)
}
#' @export
check_outliers.geeglm <- check_outliers.gls
#' @export
check_outliers.rma <- function(x, ...) {
## TODO Check whether we can enable CI argument
# but we need to find out at which CI-level the overall effect interval was estimated
ci <- 0.95
thresholds <- c(x$ci.lb, x$ci.ub)
lower_bounds <- as.numeric(x$yi - stats::qnorm((1 + ci) / 2) * sqrt(x$vi))
upper_bounds <- as.numeric(x$yi + stats::qnorm((1 + ci) / 2) * sqrt(x$vi))
# which study's CI-range is not covered by/does not overlap with overall CI?
outlier <- upper_bounds < thresholds[1] | lower_bounds > thresholds[2]
d <- data.frame(
Row = seq_along(x$yi),
Outlier = as.numeric(outlier)
)
# Attributes
class(outlier) <- c("check_outliers_metafor", class(outlier))
attr(outlier, "data") <- d
attr(outlier, "threshold") <- thresholds
attr(outlier, "text_size") <- 3
attr(outlier, "outlier_var") <- x$slab[outlier]
attr(outlier, "outlier_count") <- sum(outlier)
outlier
}
#' @export
check_outliers.rma.uni <- check_outliers.rma
#' @export
check_outliers.metagen <- function(x, ...) {
ci <- 0.95
thresholds_fixed <- c(x$lower.fixed, x$upper.fixed)
thresholds_random <- c(x$lower.random, x$upper.random)
lower_bounds <- as.numeric(x$TE - stats::qnorm((1 + ci) / 2) * x$seTE)
upper_bounds <- as.numeric(x$TE + stats::qnorm((1 + ci) / 2) * x$seTE)
# which study's CI-range is not covered by/does not overlap with overall CI?
outlier <- list(
fixed = upper_bounds < thresholds_fixed[1] | lower_bounds > thresholds_fixed[2],
random = upper_bounds < thresholds_random[1] | lower_bounds > thresholds_random[2]
)
d <- data.frame(
Row = seq_along(x$TE),
Outlier_fixed = as.numeric(outlier$fixed),
Outlier_random = as.numeric(outlier$random)
)
# Attributes
class(outlier) <- c("check_outliers_metagen", class(outlier))
attr(outlier, "data") <- d
attr(outlier, "text_size") <- 3
attr(outlier, "studies") <- x$studlab
attr(outlier, "outlier_count") <- sum(c(outlier$fixed, outlier$random))
outlier
}
#' @export
check_outliers.meta <- check_outliers.metagen
#' @export
check_outliers.metabin <- check_outliers.metagen
#' @rdname check_outliers
#' @export
check_outliers.performance_simres <- function(x, type = "default", iterations = 100, alternative = "two.sided", ...) {
type <- match.arg(type, c("default", "binomial", "bootstrap"))
alternative <- match.arg(alternative, c("two.sided", "greater", "less"))
insight::check_if_installed("DHARMa")
result <- DHARMa::testOutliers(x, type = type, nBoot = iterations, alternative = alternative, plot = FALSE, ...)
outlier <- list(
Coefficient = as.vector(result$estimate),
Expected = as.numeric(gsub("(.*)\\(expected: (\\d.*)\\)", "\\2", names(result$estimate))),
CI_low = result$conf.int[1],
CI_high = result$conf.int[2],
p_value = result$p.value
)
class(outlier) <- c("check_outliers_simres", class(outlier))
outlier
}
#' @export
check_outliers.DHARMa <- check_outliers.performance_simres
# Thresholds --------------------------------------------------------------
.check_outliers_thresholds <- function(x) {
suppressWarnings(.check_outliers_thresholds_nowarn(x))
}
.check_outliers_thresholds_nowarn <- function(x) {
list(
zscore = stats::qnorm(p = 1 - 0.001 / 2),
zscore_robust = stats::qnorm(p = 1 - 0.001 / 2),
iqr = 1.7,
ci = 1 - 0.001,
hdi = 1 - 0.001,
eti = 1 - 0.001,
bci = 1 - 0.001,
cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
pareto = 0.7,
mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
mahalanobis_robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
ics = 0.001,
optics = 2 * ncol(x),
optics_xi = 0.05,
lof = 0.001
)
}
# utilities --------------------
.check_outliers_zscore <- function(x,
threshold = stats::qnorm(p = 1 - 0.001 / 2),
robust = TRUE,
method = "max",
ID.names = NULL) {
if (threshold < 1) {
insight::format_error(
"The `threshold` argument must be one or greater for method `zscore`."
)
}
x <- as.data.frame(x)
# Standardize
if (robust) {
d <- abs(as.data.frame(lapply(
x,
function(x) (x - stats::median(x, na.rm = TRUE)) / stats::mad(x, na.rm = TRUE)
)))
} else {
d <- abs(as.data.frame(lapply(
x,
function(x) (x - mean(x, na.rm = TRUE)) / stats::sd(x, na.rm = TRUE)
)))
}
out <- data.frame(Row = seq_len(nrow(as.data.frame(d))))
if (!is.null(ID.names)) {
out <- cbind(out, ID.names)
}
out$Distance_Zscore <- apply(d, 1, function(x) {
ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE))
})
# Filter
out$Outlier_Zscore <- as.numeric(out$Distance_Zscore > threshold)
output <- list(
data_zscore = out,
threshold_zscore = threshold
)
if (isTRUE(robust)) {
names(output) <- paste0(names(output), "_robust")
output$data_zscore_robust <- datawizard::data_addsuffix(
output$data_zscore_robust, "_robust",
select = "Zscore$", regex = TRUE
)
}
output
}
.check_outliers_iqr <- function(x,
threshold = 1.7,
method = "tukey",
ID.names = NULL) {
d <- data.frame(Row = seq_len(nrow(as.data.frame(x))))
Distance_IQR <- d
for (col in seq_len(ncol(as.data.frame(x)))) {
v <- x[, col]
if (method == "tukey") {
iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE)
} else {
iqr <- stats::IQR(v, na.rm = TRUE)
}
lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold)
upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold)
m.int <- stats::median(c(lower, upper), na.rm = TRUE)
d2 <- abs(v - m.int)
Distance_IQR[names(as.data.frame(x))[col]] <- d2 / (iqr * threshold)
d[names(as.data.frame(x))[col]] <- ifelse(v > upper, 1, ifelse(v < lower, 1, 0)) # nolint
}
out <- data.frame(Row = d$Row)
d$Row <- NULL
Distance_IQR$Row <- NULL
if (!is.null(ID.names)) {
out <- cbind(out, ID.names)
}
# out$Distance_IQR <- Distance_IQR
out$Distance_IQR <- vapply(as.data.frame(t(Distance_IQR)), function(x) {
ifelse(all(is.na(x)), NA_real_, max(x, na.rm = TRUE))
}, numeric(1))
out$Outlier_IQR <- vapply(as.data.frame(t(d)), function(x) {
ifelse(all(is.na(x)), NA_real_, max(x, na.rm = TRUE))
}, numeric(1))
list(
data_iqr = out,
threshold_iqr = threshold
)
}
.check_outliers_ci <- function(x,
threshold = 1 - 0.001,
method = "ci",
ID.names = NULL) {
# Run through columns
d <- data.frame(Row = seq_len(nrow(x)))
Distance_CI <- d
for (col in names(x)) {
v <- x[, col]
ci <- bayestestR::ci(v, ci = threshold, method = method)
d[col] <- ifelse(x[[col]] > ci$CI_high | x[[col]] < ci$CI_low, 1, 0) # nolint
m.int <- stats::median(c(ci$CI_low, ci$CI_high), na.rm = TRUE)
d2 <- abs(v - m.int)
ci.range <- (ci$CI_high - ci$CI_low) / 2
Distance_CI[col] <- d2 / ci.range
}
out.0 <- data.frame(Row = d$Row)
d$Row <- NULL
Distance_CI$Row <- NULL
if (!is.null(ID.names)) {
out.0 <- cbind(out.0, ID.names)
}
# Take the max
out <- as.data.frame(apply(Distance_CI, 1, max, na.rm = TRUE))
names(out) <- paste0("Distance_", method)
# Filter
out[paste0("Outlier_", method)] <- vapply(
as.data.frame(t(d)),
function(x) ifelse(all(is.na(x)), NA_real_, max(x, na.rm = TRUE)),
numeric(1)
)
out <- cbind(out.0, out)
output <- list(
data_ = out,
threshold_ = threshold
)
names(output) <- paste0(names(output), method)
output
}
.check_outliers_cook <- function(x,
threshold = NULL,
ID.names = NULL) {
# Compute
d <- unname(stats::cooks.distance(x))
out <- data.frame(Row = seq_along(d))
out$Distance_Cook <- d
# Filter
out$Outlier_Cook <- as.numeric(out$Distance_Cook > threshold)
list(
data_cook = out,
threshold_cook = threshold
)
}
.check_outliers_pareto <- function(x, threshold = 0.7) {
insight::check_if_installed("loo")
# Compute
d <- suppressWarnings(loo::pareto_k_values(loo::loo(x)))
out <- data.frame(Row = seq_along(d))
out$Distance_Pareto <- d
# Filter
out$Outlier_Pareto <- as.numeric(out$Distance_Pareto > threshold)
list(
data_pareto = out,
threshold_pareto = threshold
)
}
.check_outliers_mahalanobis <- function(x,
threshold = stats::qchisq(
p = 1 - 0.001, df = ncol(x)
),
ID.names = NULL,
...) {
if (anyNA(x) || any(with(x, x == Inf))) {
insight::format_error("Missing or infinite values are not allowed.")
}
out <- data.frame(Row = seq_len(nrow(x)))
if (!is.null(ID.names)) {
out <- cbind(out, ID.names)
}
# Compute
out$Distance_Mahalanobis <- stats::mahalanobis(x, center = colMeans(x), cov = stats::cov(x), ...)
# Filter
out$Outlier_Mahalanobis <- as.numeric(out$Distance_Mahalanobis > threshold)
list(
data_mahalanobis = out,
threshold_mahalanobis = threshold
)
}
# Bigutils not yet fully available on CRAN
.check_outliers_mahalanobis_robust <- function(x,
threshold = stats::qchisq(
p = 1 - 0.001, df = ncol(x)
),
ID.names = NULL) {
out <- data.frame(Row = seq_len(nrow(x)))
if (!is.null(ID.names)) {
out <- cbind(out, ID.names)
}
insight::check_if_installed("bigutilsr")
# Compute
U <- svd(scale(x))$u
out$Distance_Mahalanobis_robust <- bigutilsr::dist_ogk(U)
# Filter
out$Outlier_Mahalanobis_robust <- as.numeric(
out$Distance_Mahalanobis_robust > threshold
)
list(
data_mahalanobis_robust = out,
threshold_mahalanobis_robust = threshold
)
}
.check_outliers_mcd <- function(x,
threshold = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
percentage_central = 0.75,
ID.names = NULL,
verbose = TRUE,
...) {
out <- data.frame(Row = seq_len(nrow(x)))
if (!is.null(ID.names)) {
out <- cbind(out, ID.names)
}
# check whether N to p ratio is not too large, else MCD flags too many outliers
# See #672: This does seem to be a function of the N/p (N = sample size; p =
# number of parameters) ratio. When it is larger than 10, the % of outliers
# flagged is okay (in well behaved data). This makes sense: the MCD looks at
# the cov matrix of subsamples of the data - with high dimensional data, small
# samples sizes will give highly variable cov matrices, as so the "smallest"
# one will probably miss-represent the data.
if ((nrow(x) / ncol(x)) <= 10 && isTRUE(verbose)) {
insight::format_warning(
"The sample size is too small in your data, relative to the number of variables, for MCD to be reliable.",
"You may try to increase the `percentage_central` argument (must be between 0 and 1), or choose another method."
)
}
insight::check_if_installed("MASS")
# Compute
mcd <- MASS::cov.mcd(x, quantile.used = percentage_central * nrow(x))
out$Distance_MCD <- stats::mahalanobis(x, center = mcd$center, cov = mcd$cov)
# Filter
out$Outlier_MCD <- as.numeric(out$Distance_MCD > threshold)
list(
data_mcd = out,
threshold_mcd = threshold
)
}
.check_outliers_ics <- function(x,
threshold = 0.001,
ID.names = NULL,
...) {
out <- data.frame(Row = seq_len(nrow(x)))
if (!is.null(ID.names)) {
out <- cbind(out, ID.names)
}
insight::check_if_installed("ICS")
insight::check_if_installed("ICSOutlier")
# Get n cores
n_cores <- if (requireNamespace("parallel", quietly = TRUE)) {
getOption("mc.cores", 1L)
} else {
NULL
}
# tell user about n-cores option
if (is.null(n_cores)) {
insight::format_alert(
"Package `parallel` is not installed. `check_outliers()` will run on a single core.",
"Install package `parallel` and set, for example, `options(mc.cores = 4)` to run on multiple cores."
)
} else if (n_cores == 1) {
insight::format_alert(
"Package `parallel` is installed, but `check_outliers()` will run on a single core.",
"To use multiple cores, set `options(mc.cores = 4)` (for example)."
)
}
# Run algorithm
# Try
outliers <- tryCatch(
{
ics <- ICS::ics2(x)
ICSOutlier::ics.outlier(
object = ics,
ncores = n_cores,
level.dist = threshold,
...
)
},
error = function(e) {
NULL
}
)
if (is.null(outliers)) {
if (ncol(x) == 1) {
insight::print_color("At least two numeric predictors are required to detect outliers.\n", "red")
} else {
insight::print_color(sprintf("`check_outliers()` does not support models of class `%s`.\n", class(x)[1]), "red")
}
return(NULL)
}
# Get results
cutoff <- .safe(outliers@ics.dist.cutoff)
# validation check
if (is.null(cutoff)) {
insight::print_color("Could not detect cut-off for outliers.\n", "red")
return(NULL)
}
out$Distance_ICS <- outliers@ics.distances
out$Outlier_ICS <- as.numeric(out$Distance_ICS > cutoff)
# Out
list(
data_ics = out,
threshold_ics = threshold
)
}
.check_outliers_optics <- function(x,
threshold = NULL,
ID.names = NULL,
xi = 0.05) {
out <- data.frame(Row = seq_len(nrow(x)))
if (!is.null(ID.names)) {
out <- cbind(out, ID.names)
}
insight::check_if_installed("dbscan")
# Compute
rez <- dbscan::optics(x, minPts = threshold)
rez <- dbscan::extractXi(rez, xi = xi) # TODO: find automatic way of setting xi
out$Distance_OPTICS <- rez$coredist
# Filter
if (is.null(rez$cluster)) {
out$Outlier_OPTICS <- 0
} else {
out$Outlier_OPTICS <- as.numeric(rez$cluster == 0)
}
list(
data_optics = out,
threshold_optics = threshold
)
}
# .check_outliers_iforest <- function(x, threshold = 0.025) {
# out <- data.frame(Row = seq_len(nrow(x)))
#
# # Install packages
# insight::check_if_installed("solitude")
#
# # Compute
# if (utils::packageVersion("solitude") < "0.2.0") {
# iforest <- solitude::isolationForest(x)
# out$Distance_iforest <- stats::predict(iforest, x, type = "anomaly_score")
# } else if (utils::packageVersion("solitude") == "0.2.0") {
# stop(paste("Must update package `solitude` (above version 0.2.0).",
# "Please run `install.packages('solitude')`."), call. = FALSE)
# } else {
# iforest <- solitude::isolationForest$new(sample_size = nrow(x))
# suppressMessages(iforest$fit(x))
# out$Distance_iforest <- iforest$scores$anomaly_score
# }
#
#
# # Threshold
# cutoff <- stats::median(out$Distance_iforest) + stats::qnorm(1 - threshold) * stats::mad(out$Distance_iforest)
# # Filter
# out$Outlier_iforest <- as.numeric(out$Distance_iforest >= cutoff)
#
# out$Row <- NULL
# list(
# "data_iforest" = out,
# "threshold_iforest" = threshold
# )
# }
.check_outliers_lof <- function(x,
threshold = 0.001,
ID.names = NULL) {
if (threshold < 0 || threshold > 1) {
insight::format_error(
"The `threshold` argument must be between 0 and 1 for method `lof`."
)
}
out <- data.frame(Row = seq_len(nrow(x)))
if (!is.null(ID.names)) {
out <- cbind(out, ID.names)
}
insight::check_if_installed("dbscan")
# Compute
out$Distance_LOF <- log(dbscan::lof(x, minPts = ncol(x)))
# Threshold
cutoff <- stats::qnorm(1 - threshold) * stats::sd(out$Distance_LOF)
# Filter
out$Outlier_LOF <- as.numeric(out$Distance_LOF > cutoff)
list(
data_lof = out,
threshold_lof = threshold
)
}
# Non-supported model classes ---------------------------------------
#' @export
check_outliers.glmmTMB <- function(x, ...) {
insight::format_alert(paste0("`check_outliers()` does not yet support models of class `", class(x)[1], "`."))
NULL
}
#' @export
check_outliers.lmrob <- check_outliers.glmmTMB
#' @export
check_outliers.glmrob <- check_outliers.glmmTMB
#' @export
check_outliers.rq <- check_outliers.glmmTMB
#' @export
check_outliers.rqs <- check_outliers.glmmTMB
#' @export
check_outliers.rqss <- check_outliers.glmmTMB
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.