ratioSize | R Documentation |
Identifies outliers on transformed ratios (centering with respect to their median) using the adjusted boxplot for skewed distributions. Outliers can be sorted/filtered according to a size measure.
ratioSize(numerator, denominator, id=NULL, size=NULL, U=1, size.th=NULL, return.dataframe=FALSE)
numerator |
Numeric vector with the values that go at numerator of the ratio |
denominator |
Numeric vector with the values that go at denominator of the ratio |
id |
Optional numeric or character vector, with identifiers of units. If |
size |
Optional numeric vector providing a measure of the importance of a ratio. If |
U |
Numeric, constant with 0<U<=1 controlling importance of each unit, in practice the final size measure is derived as (size^U). Commonly used values are 0.4, 0.5 or 1 (default). |
size.th |
Numeric, size threshold. Can be specified when a size measure is used. In such a case just outliers with a size greater than the threshold will be returned. Note that when argument |
return.dataframe |
Logical, if |
This function searches for outliers starting from ratios r=numerator/denominator
. At first the ratios are centered around their median, as in Hidiroglou Berthelot (1986) procedure (see HBmethod
), then the outlier identification is based on the adjusted boxplot for skewed distribution (Hubert and Vandervieren 2008) (see adjboxStats
).
The subset of outliers is sorted in decreasing order according the size measure. If a size threshold is provided then just outliers with (size^U) > (size.th^U) will be returned.
A list whose components depend on the return.dataframe
argument. When return.dataframe = FALSE
just the following components are returned:
median.r |
the median of the ratios |
bounds |
The bounds of the interval for centered ratios |
excluded |
The position or the identifiers of the units with values excluded by the computations because of 0s or NAs. |
outliers |
The position or the identifiers of the units detected as outliers. Remember that when |
When return.dataframe=TRUE
the latter two components are substituted with two dataframes:
excluded |
A dataframe with the subset of observations excluded |
data |
A dataframe with the not excluded observations with the following columns: ‘id’ (units' identifiers), ‘numerator’, ‘denominator’, ‘ratio’ (= numerator/denominator), ‘c.ratio’ (centered ratios, see Details), ‘sizeU’ (size^U values) and finally ‘outliers’, where value 1 indicates observations detected as an outlier and 0 otherwise. The data frame will be sorted in decreasing manner according to size^U. Note that when a size threshold is provided then ONLY outliers with (size^U) > (size.th^U) will be returned. |
Marcello D'Orazio mdo.statmatch@gmail.com
Hidiroglou, M.A. and Berthelot, J.-M. (1986) ‘Statistical editing and Imputation for Periodic Business Surveys’. Survey Methodology, Vol 12, pp. 73-83.
Hubert, M., and Vandervieren, E. (2008) ‘An Adjusted Boxplot for Skewed Distributions’, Computational Statistics and Data Analysis, 52, pp. 5186-5201.
HBmethod
, plot4ratios
, boxB
,adjboxStats
set.seed(444) x1 <- rnorm(30, 50, 5) set.seed(555) rr <- runif(30, 0.9, 1.2) rr[10] <- 2 x2 <- x1 * rr out <- ratioSize(numerator = x2, denominator = x1) out out <- ratioSize(numerator = x2, denominator = x1, return.dataframe = TRUE) head(out$data) out <- ratioSize(numerator = x2, denominator = x1, size.th = 65, return.dataframe = TRUE) head(out$data)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.