ez.outlier: univariate outlier cleanup

View source: R/stats.R

ez.outlierR Documentation

univariate outlier cleanup

Description

univariate outlier cleanup

Usage

ez.outlier(
  x,
  col = NULL,
  method = c("z", "mad", "iqr"),
  cutoff = NA,
  fillout = c("null", "na", "mean", "median"),
  hack = FALSE,
  plot = FALSE,
  na.rm = TRUE,
  print2scr = TRUE
)

Arguments

x

a data frame or a vector

col

passed to ez.selcol. colwise processing
if x is a data frame, col specified, process that col only.
if x is a data frame, col unspecified (i.e., NULL default), process all cols
if x is not a data frame, col is ignored
could be multiple cols

method

z score, mad, or IQR (John Tukey)

cutoff

abs(x) > cutoff will be treated as outliers. Default/auto values (i.e. if NA):
z 95
mad 2.5, which is the standard recommendation, or 5.2
iqr 1.5
if multiple values specified, use the first one (an exception is hack=T, during which method and cutoff same length or scalar)

fillout

how to process outlier, fill with na, mean, median (columnwise for data frame), or null –> remove outlier (only for vector or df with single col specified, auto switch to na if otherwise)

hack

call mapply to try all method and cutoff (same length or scalar, ie, different methods with corresponding cutoff, or same method with different cutoff).

plot

boxplot and hist before and after outlier processing.

Value

returns a new data frame or vector. If hack=T, returns nothings

Note

univariate outlier approach The Z-score method relies on the mean and standard deviation of a group of data to measure central tendency and dispersion. This is troublesome, because the mean and standard deviation are highly affected by outliers – they are not robust. In fact, the skewing that outliers bring is one of the biggest reasons for finding and removing outliers from a dataset! Another drawback of the Z-score method is that it behaves strangely in small datasets – in fact, the Z-score method will never detect an outlier if the dataset has fewer than 12 items in it.

Median absolute deviation, modified z-score. The median and MAD are robust measures of central tendency and dispersion, respectively.

Interquartile range method is that, like the modified Z-score method, it uses a robust measure of dispersion.

Examples

set.seed(1234)
x = rnorm(10)
iris %>% ez.outlier(1,fill='na',plot=T,hack=T,method=c('mad'),cutoff=c(1,3,2))
iris %>% ez.outlier(1,fill='null',plot=T,hack=T,method=c('z','mad','iqr'),cutoff=c(3,5,1.5))
iris %>% ez.outlier(1,fill='null',plot=T,hack=T,method=c('z','mad','iqr'),cutoff=NA)

jerryzhujian9/zmisc documentation built on March 9, 2024, 12:49 a.m.