ez.outlier: univariate outlier cleanup
In jerryzhujian9/zmisc: ez Zhu's miscellaneous functions

ez.outlier

R Documentation

univariate outlier cleanup

Description

univariate outlier cleanup

Usage

ez.outlier(
  x,
  col = NULL,
  method = c("z", "mad", "iqr"),
  cutoff = NA,
  fillout = c("null", "na", "mean", "median"),
  hack = FALSE,
  plot = FALSE,
  na.rm = TRUE,
  print2scr = TRUE
)

Arguments

`x`	a data frame or a vector
`col`	passed to `ez.selcol`. colwise processing if x is a data frame, col specified, process that col only. if x is a data frame, col unspecified (i.e., NULL default), process all cols if x is not a data frame, col is ignored could be multiple cols
`method`	z score, mad, or IQR (John Tukey)
`cutoff`	abs(x) > cutoff will be treated as outliers. Default/auto values (i.e. if NA): z 95 mad 2.5, which is the standard recommendation, or 5.2 iqr 1.5 if multiple values specified, use the first one (an exception is hack=T, during which method and cutoff same length or scalar)
`fillout`	how to process outlier, fill with na, mean, median (columnwise for data frame), or null –> remove outlier (only for vector or df with single col specified, auto switch to na if otherwise)
`hack`	call mapply to try all method and cutoff (same length or scalar, ie, different methods with corresponding cutoff, or same method with different cutoff).
`plot`	boxplot and hist before and after outlier processing.

Value

returns a new data frame or vector. If hack=T, returns nothings

Note

univariate outlier approach The Z-score method relies on the mean and standard deviation of a group of data to measure central tendency and dispersion. This is troublesome, because the mean and standard deviation are highly affected by outliers – they are not robust. In fact, the skewing that outliers bring is one of the biggest reasons for finding and removing outliers from a dataset! Another drawback of the Z-score method is that it behaves strangely in small datasets – in fact, the Z-score method will never detect an outlier if the dataset has fewer than 12 items in it.

Median absolute deviation, modified z-score. The median and MAD are robust measures of central tendency and dispersion, respectively.

Interquartile range method is that, like the modified Z-score method, it uses a robust measure of dispersion.

Examples

set.seed(1234)
x = rnorm(10)
iris %>% ez.outlier(1,fill='na',plot=T,hack=T,method=c('mad'),cutoff=c(1,3,2))
iris %>% ez.outlier(1,fill='null',plot=T,hack=T,method=c('z','mad','iqr'),cutoff=c(3,5,1.5))
iris %>% ez.outlier(1,fill='null',plot=T,hack=T,method=c('z','mad','iqr'),cutoff=NA)

jerryzhujian9/zmisc documentation built on June 13, 2025, 11:17 p.m.