ratioSize: Identifies outliers on ratios and filter them by a size... In univOutl: Detection of Univariate Outliers

Description

Identifies outliers on transformed ratios (centering with respect to their median) using the adjusted boxplot for skewed distributions. Outliers can be sorted/filtered according to a size measure.

Usage

 ```1 2``` ```ratioSize(numerator, denominator, id=NULL, size=NULL, U=1, size.th=NULL, return.dataframe=FALSE) ```

Arguments

 `numerator` Numeric vector with the values that go at numerator of the ratio `denominator` Numeric vector with the values that go at denominator of the ratio `id` Optional numeric or character vector, with identifiers of units. If `id=NULL` units identifiers will be set equal to their positions in `x`. `size` Optional numeric vector providing a measure of the importance of a ratio. If `size = NULL` the size measure is the maximum value between the numerator and the denominator of each ratio (makes sense if both the variables are observed using the same unit of measure). Observations' importance is also controlled by the argument `U`. `U` Numeric, constant with 0

Details

This function searches for outliers starting from ratios `r=numerator/denominator`. At first the ratios are centered around their median, as in Hidiroglou Berthelot (1986) procedure (see `HBmethod`), then the outlier identification is based on the adjusted boxplot for skewed distribution (Hubert and Vandervieren 2008) (see `adjboxStats`). The subset of outliers is sorted in decreasing order according the size measure. If a size threshold is provided then just outliers with (size^U) > (size.th^U) will be returned.

Value

A list whose components depend on the `return.dataframe` argument. When `return.dataframe = FALSE` just the following components are returned:

 `median.r` the median of the ratios `bounds` The bounds of the interval for centered ratios `excluded` The position or the identifiers of the units with values excluded by the computations because of 0s or NAs. `outliers` The position or the identifiers of the units detected as outliers. Remember that when `size.th` is set, just outliers with (size^U) > (size.th^U) will be returned.

When `return.dataframe=TRUE` the latter two components are substituted with two dataframes:

 `excluded` A dataframe with the subset of observations excluded `data` A dataframe with the not excluded observations with the following columns: ‘id’ (units' identifiers), ‘numerator’, ‘denominator’, ‘ratio’ (= numerator/denominator), ‘c.ratio’ (centered ratios, see Details), ‘sizeU’ (size^U values) and finally ‘outliers’, where value 1 indicates observations detected as an outlier and 0 otherwise. The data frame will be sorted in decreasing manner according to size^U. Note that when a size threshold is provided then ONLY outliers with (size^U) > (size.th^U) will be returned.

Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

References

Hidiroglou, M.A. and Berthelot, J.-M. (1986) ‘Statistical editing and Imputation for Periodic Business Surveys’. Survey Methodology, Vol 12, pp. 73-83.

Hubert, M., and Vandervieren, E. (2008) ‘An Adjusted Boxplot for Skewed Distributions’, Computational Statistics and Data Analysis, 52, pp. 5186-5201.

`HBmethod`, `plot4ratios`, `boxB`,`adjboxStats`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20``` ```set.seed(444) x1 <- rnorm(30, 50, 5) set.seed(555) rr <- runif(30, 0.9, 1.2) rr[10] <- 2 x2 <- x1 * rr out <- ratioSize(numerator = x2, denominator = x1) out out <- ratioSize(numerator = x2, denominator = x1, return.dataframe = TRUE) head(out\$data) out <- ratioSize(numerator = x2, denominator = x1, size.th = 65, return.dataframe = TRUE) head(out\$data) ```