outliers: Identify Univariate Outliers Using Boxplot Methods

identify_outliersR Documentation

Identify Univariate Outliers Using Boxplot Methods

Description

Detect outliers using boxplot methods. Boxplots are a popular and an easy method for identifying outliers. There are two categories of outlier: (1) outliers and (2) extreme points.

Values above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered as outliers. Values above Q3 + 3xIQR or below Q1 - 3xIQR are considered as extreme points (or extreme outliers).

Q1 and Q3 are the first and third quartile, respectively. IQR is the interquartile range (IQR = Q3 - Q1).

Generally speaking, data points that are labelled outliers in boxplots are not considered as troublesome as those considered extreme points and might even be ignored. Note that, any NA and NaN are automatically removed before the quantiles are computed.

Usage

identify_outliers(data, ..., variable = NULL)

is_outlier(x, coef = 1.5)

is_extreme(x)

Arguments

data

a data frame

...

One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.

variable

variable name for detecting outliers

x

a numeric vector

coef

coefficient specifying how far the outlier should be from the edge of their box. Possible values are 1.5 (for outlier) and 3 (for extreme points only). Default is 1.5

Value

  • identify_outliers(). Returns the input data frame with two additional columns: "is.outlier" and "is.extreme", which hold logical values.

  • is_outlier() and is_extreme(). Returns logical vectors.

Functions

  • identify_outliers(): takes a data frame and extract rows suspected as outliers according to a numeric column. The following columns are added "is.outlier" and "is.extreme".

  • is_outlier(): detect outliers in a numeric vector. Returns logical vector.

  • is_extreme(): detect extreme points in a numeric vector. An alias of is_outlier(), where coef = 3. Returns logical vector.

Examples

# Generate a demo data
set.seed(123)
demo.data <- data.frame(
  sample = 1:20,
  score = c(rnorm(19, mean = 5, sd = 2), 50),
  gender = rep(c("Male", "Female"), each = 10)
)

# Identify outliers according to the variable score
demo.data %>%
  identify_outliers(score)

# Identify outliers by groups
demo.data %>%
  group_by(gender) %>%
  identify_outliers("score")

kassambara/rstatix documentation built on Feb. 6, 2023, 3:36 a.m.