detect_outlier | R Documentation |
This function identifies outliers in a numeric vector using either the interquartile range (IQR) method or the z-score method. The IQR method defines outliers as values below Q1 - multiplier * IQR or above Q3 + multiplier * IQR, where Q1 and Q3 are the first and third quartiles. The z-score method identifies outliers as values with an absolute z-score exceeding a specified threshold.
detect_outlier(
x,
method = "iqr",
multiplier = 1.5,
z_threshold = 3,
na.rm = TRUE,
groups = NULL,
summary = FALSE
)
iqr_outlier(x, multiplier)
zscore_outlier(x, z_threshold)
zscore_outlier2(x, z_threshold)
x |
A numeric vector in which to detect outliers. |
method |
A character string specifying the outlier detection method. Options are '"iqr"' (default) for the interquartile range method or '"zscore"' for the z-score method. |
multiplier |
A positive numeric value specifying the multiplier for the IQR method. Default is '1.5', typically used for moderate outliers; '3' is common for extreme outliers. Ignored if 'method = "zscore"'. |
z_threshold |
A positive numeric value specifying the z-score threshold for the 'method = "zscore"' option. Default is '3', meaning values with an absolute z-score greater than 3 are flagged as outliers. Ignored if 'method = "iqr"'. |
na.rm |
A logical value indicating whether to remove 'NA' values before computation. Default is 'TRUE'. If 'FALSE' and 'NA' values are present, the function stops with an error. |
groups |
An optional vector of group names or labels corresponding to each value in 'x'. If provided, must be the same length as 'x'. Outlier detection is performed separately for each unique group, and results are returned as a nested list. Default is 'NULL' (no grouping). |
summary |
A logical value indicating whether to include a summary in the output. Default is 'FALSE'. If 'TRUE', the output list includes a 'summary' element with descriptive statistics and outlier counts, either overall or by group if 'groups' is provided. |
The function returns a list containing the outliers, their indices, detection bounds or thresholds, and a logical vector indicating which elements are outliers. If a grouping vector is provided via 'groups', outlier detection is performed separately for each group, and results are returned as a nested list by group. If 'na.rm = TRUE' (default), missing values ('NA') are removed before computation. If 'na.rm = FALSE' and 'NA' values are present, the function stops with an error. The function also stops for non-numeric input, insufficient valid data, or mismatched group lengths.
The function requires at least two non-'NA' values per group (if 'groups' is provided) or overall (if 'groups = NULL') to compute meaningful statistics when 'na.rm = TRUE'. If 'na.rm = FALSE', the presence of 'NA' values triggers an error. If all values in a group are identical or there are insufficient data points, an error is thrown for that group. The IQR method is robust to non-normal data, while the z-score method assumes approximate normality and is sensitive to extreme values.
If 'groups = NULL' (default), a list with the following components: - 'outliers': A numeric vector of the outlier values. - 'indices': An integer vector of the indices where outliers occur in the input vector. - 'bounds' (if 'method = "iqr"'): A named numeric vector with the 'lower' and 'upper' bounds for outliers. - 'threshold' (if 'method = "zscore"'): A named numeric vector with the 'lower' and 'upper' z-score thresholds. - 'is_outlier': A logical vector of the same length as 'x', where 'TRUE' indicates an outlier. - 'summary' (if 'summary = TRUE'): A list with summary statistics including the mean, median, standard deviation (for z-score), quartiles (for IQR), and number of outliers.
If 'groups' is provided, a named list where each element corresponds to a unique group, containing the same components as above but computed separately for that group’s values.
# Example 1: Basic IQR method without groups
x <- c(1, 2, 3, 4, 100)
detect_outlier(x)
# IQR method with summary
detect_outlier(x, summary = TRUE)
# Z-score method with custom threshold
y <- c(-10, 1, 2, 3, 4, 5, 20)
detect_outlier(y, method = "zscore", z_threshold = 2.5)
# Handling missing values
z <- c(1, 2, NA, 4, 5, 100)
detect_outlier(z, method = "iqr", multiplier = 3)
# Example 2: IQR method with groups
x2 <- c(1, 2, 3, 100, 5, 6, 7, 200)
groups2 <- c("A", "A", "A", "A", "B", "B", "B", "B")
detect_outlier(x2, groups = groups2)
# Example 3: Z-score method with groups and summary
x3 <- c(-10, 1, 2, 20, 3, 4, 5, 50)
groups3 <- c("X", "X", "X", "X", "Y", "Y", "Y", "Y")
detect_outlier(x3, method = "zscore", z_threshold = 2, groups = groups3, summary = TRUE)
# Example 4: IQR method with groups and NA values
x4 <- c(1, 2, NA, 100, 5, 6, 7, 200,1000)
groups4 <- c("G1", "G1", "G1", "G1", "G2", "G2", "G2", "G2","G1")
detect_outlier(x4, groups = groups4)
# Error cases
## Not run:
detect_outlier(c("a", "b")) # Non-numeric input
detect_outlier(c(1), groups = c("A")) # Insufficient data
detect_outlier(c(1, 2), groups = c("A")) # Mismatched group length
detect_outlier(c(1, NA, 3), na.rm = FALSE) # NA with na.rm = FALSE
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.