HBmethod | R Documentation |
This function implements the method proposed by Hidiroglou and Berthelot (1986) to identify outliers in periodic data, i.e. when the same variable is measured at two time points.
HBmethod(yt1, yt2, U=0.5, A=0.05, C=4, pct=0.25, id=NULL, std.score=FALSE, return.dataframe=FALSE, adjboxE=FALSE)
yt1 |
Numeric vector providing the values observed at time t1. |
yt2 |
Numeric vector providing the values observed at time t2 (t2 > t1). |
U |
Numeric, parameter needed to determine the ‘importance’ of a ratio. The value should lie in [0, 1] interval; commonly used values are 0.3, 0.4, or 0.5 (default) (see Details for further information). |
A |
Numeric, parameter needed when computing the scale measure used to derive the bounds. Hidiroglou and Berthelot (1986) suggest setting A = 0.05 (default) (see Details for further information). |
C |
Numeric, parameter determining the extension of the interval; greater values will provide larger intervals, i.e. fewer expected outliers. Values commonly used are 4 (default) or 7, but also values close or grater than 40 can be used in some particular cases. Note that two C values can be provided instead of one, the first one will be used to determine the left tail bound, while the second determines the right tail bound; this setting can help in improving outlier detection in skewed distributions (see Details for further information). |
pct |
Numeric, the percentage point of the scores that will be used to calculate the lower and upper bounds. By default, |
id |
Optional numeric or character vector, with identifiers of units. If |
std.score |
Logical, if |
return.dataframe |
Logical, if |
adjboxE |
Logical (default |
The method proposed by Hidiroglou and Berthelot (1986) to identify outliers in periodic data consists in deriving a score variable based on the ratios r_i = y_{i,t2}/y_{i,t1} (yt2/yt1
) with i=1,2,...,n being n the number of observations after discarding NAs and 0s in both yt1
and yt2
.
At first the ratios are centered around their median r_M:
s_i = 1 - r_M/r_i if 0 < r_i < r_M
s_i = r_i/r_M - 1 if r_i >= r_M
Then, in order to account for the magnitude of data, the following score is derived:
E_i = s_i [max(y_{i,t1}, y_{i,t2})]^U
Finally, the interval is calculated as:
(E_M - C*d_{Q1}, E_M + C*d_{Q3})
where
max( E_M - E_{Q1}, |A*E_M| ) and max( E_{Q3} - E_M, |A*E_M| )
being E_Q1, E_M and E_Q3 the quartiles of the E scores (when pct = 0.25
, default)).
Recently Hidiroglou and Emond (2018) suggest using percentiles P10 and P90 of the E scores in replacement of respectively Q1 and Q3 to avoid the drawback of many units identified as outliers; this is likely to occur when a large proportion of units (>1/4) has the same ratio. P10 and P90 are achieved by setting pct = 0.10
when running the function.
In practice, all the units with an E score outside the interval are considered as outliers. Notice that when two C
values are provided, then the first is used to derive the left bound while the second determines the right bound.
When std.score=TRUE
a standardized score is derived in the following manner:
z_E,i = g*(E_i - E_M)/d_{Q1} if E_i < E_M
z_E,i = g*(E_i - E_M)/d_{Q3} if E_i >= E_M
The constant g is set equal to qnorm(1-pct)
and makes d_{Q1} and d_{Q3} approximately unbiased estimators when the E scores follow the normal distribution.
When adjboxE = TRUE
outliers on the E scores will all be searched using the boxplot adjusted for skewness as implemented in the function boxB
when run with with argument method = "adjbox"
.
A list whose components depend on the return.dataframe
argument. When return.dataframe=FALSE
just the following components are provided:
median.r |
the median of the ratios |
quartiles.E |
Quartiles of the E score |
bounds.E |
Bounds of the interval of the E score, values outside are considered outliers. |
excluded |
The identifiers or positions (when |
outliers |
The identifiers or positions (when |
outliersBB |
The identifiers or positions (when |
When return.dataframe=TRUE
, the first three components remain the same with, in addition, two dataframes:
excluded |
A dataframe with the subset of observations excluded. The data frame has the following columns: 'id' (units' identifiers), 'yt1' columns 'yt2' |
data |
A dataframe with the the not excluded observations and the following columns: ‘id’ (units' identifiers), ‘yt1’, ‘yt2’, ‘ratio’ (= yt1/yt2), ‘sizeU’ (=max(yt1, yt2)^U),‘Escore’ (the E scores, see Details), ‘std.Escore’ (the standardized E scores when |
Marcello D'Orazio mdo.statmatch@gmail.com
Hidiroglou, M.A. and Berthelot, J.-M. (1986) ‘Statistical editing and Imputation for Periodic Business Surveys’. Survey Methodology, Vol 12, pp. 73-83.
Hidiroglou, M.A. and Emond, N. (2018) ‘Modifying the Hidiroglou-Berthelot (HB) method’. Unpublished note, Business Survey Methods Division, Statistics Canada, May 18 2018.
plot4ratios
, ratioSize
set.seed(222) x0 <- rnorm(30, 50, 5) x0[1] <- NA set.seed(333) rr <- runif(30, 0.9, 1.2) rr[10] <- 2 x1 <- x0 * rr x1[20] <- 0 out <- HBmethod(yt1 = x0, yt2 = x1) out$excluded out$median.r out$bounds.E out$outliers cbind(x0[out$outliers], x1[out$outliers]) out <- HBmethod(yt1 = x0, yt2 = x1, return.dataframe = TRUE) out$excluded head(out$data)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.