Vignette: Weighted BACON algorithms
In wbacon: Weighted BACON Algorithms

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "",
    prompt = TRUE
)

```{css, echo = FALSE} .my-callout { padding: 0.25rem; padding-left: 1rem; margin-top: 0.25rem; margin-bottom: 0.25rem; border: 1px solid #eee; border-left-width: 0.75rem; border-left-color: #df536b; border-radius: .25rem }

## 1 Introduction

The package `wbacon` implements a weighted variant of the BACON (blocked
adaptive computationally-efficient outlier nominators) algorithms [Billor et
al.](#biblio) (2000) for multivariate outlier detection and robust linear
regression. The extension of the BACON algorithm for outlier detection to allow
for weighting is due to [Béguin and Hulliger](#biblio) (2008).

The details of the package are discussed in the accompanying paper; see
[Schoch](#biblio) (2021)

First, we attach the package to the search space.

```r
library("wbacon")

1.1 Available methods

wBACON() is for multivariate outlier nomination and robust estimation of location/ center and covariance matrix
wBACON_reg() is for robust linear regression (the method is robust against outliers in the response variable and the model's design matrix)

1.2 Assumptions

The BACON algorithms assume that the underlying model is an appropriate description of the non-outlying observations; Billor et al. (2000). More precisely,

the outlier nomination method assumes that the "good" data have (roughly) an elliptically contoured distribution (this includes the Gaussian distribution as a special case);
the regression method assumes that the non-outlying ("good") data are described by a linear (homoscedastic) regression model and that the independent variables (having removed the regression intercept/constant, if there is a constant) follow (roughly) an elliptically contoured distribution.

"Although the algorithms will often do something reasonable even when these assumptions are violated, it is hard to say what the results mean." [Billor et al.](#biblio) (2000, p. 290)

It is strongly recommended that the structure of the data be examined and whether the assumptions made about the "good" observations are reasonable.

1.3 The role of the data analyst

In line with Billor et al. (2000, p. 290), we use the term outlier "nomination" rather than "detection" to highlight that algorithms should not go beyond nominating observations as potential outliers; see also Béguin and Hulliger (2008). It is left to the analyst to finally label outlying observations as such.

The software provides the analyst with tools and measures to study potentially outlying observations. It is strongly recommended to use the tools.

1.4 Additional information

Additional information on the BACON algorithms and the implementation can be found in the documents:

methods.pdf: A mathematical description of the algorithms and their implementation;
doc_c_functions.pdf: A documentation of the C functions.

Both documents can be found in the package folder doc.

2 Multivariate outlier detection

In this section, we study multivariate outlier detection for the two datasets

bushfire data (with sampling weights),
philips data (without sampling weights).

2.1 Bushfire data

The bushfire dataset is on satellite remote sensing. These data were used by Campbell (1984) to locate bushfire scars. The data are radiometer readings from polar-orbiting satellites of the National Oceanic and Atmospheric Administration (NOAA) which have been collected continuously since 1981. The measurements are taken on five frequency bands or channels. In the near infrared band, it is possible to distinguish vegetation types from burned surface. At visible wavelengths, the vegetation spectra are similar to burned surface. The spatial resolution is rather low (1.1 km per pixel).

2.1.1 Data preparation

The bushfire data contain radiometer readings for 38 pixels and have been studied in Maronna and Yohai (1995), Béguin and Hulliger (2002), Béguin and Hulliger (2008), and Hulliger and Schoch (2009). The data can be obtained from the R package modi(Hulliger, 2023).¹

data(bushfire, package = "modi")

The first 6 readings on the five frequency bands (variables) are

head(bushfire)

Béguin and Hulliger (2008) generated a set of sampling weights. The weights can be attached to the current session by

data(bushfire.weights, package = "modi")

2.1.2 Outlier detection

fit <- wBACON(bushfire, w = bushfire.weights, alpha = 0.05)
fit

The argument alpha determines the $(1-\alpha)$-quantile $\chi_{\alpha,d}^2$ of the chi-square distribution with $d$ degrees of freedom.² All observations whose squared Mahalanobis distances are smaller than the quantile (times a correction factor) are selected into the subset of outlier-free data. It is recommended to choose alpha on grounds of an educated guess of the share of "good" observations in the data. Here, we suppose that 95\% of the observations are not outliers.

By default, the initial subset is determined by the Euclidean norm (initialization method: version = "V2").

This initialization method is robust because it is based on the coordinate-wise (weighted) median. The resulting estimators of center and scatter are not affine equivariant. Let $T(\cdot)$ denote an estimator of a parameter of interest (e.g., covariance matrix) and let $X$ denote the $(n \times p)$ data matrix. An estimator $T$ is affine equivariant if and only if $T(A X + b) = A T(X) + b$, for any nonsingular $(m \times n)$ matrix $A$ and any $n$-vector $b$. Although version "V2" of the BACON method yields an estimator that is not affine equivariant in the above sense, Billor et al. (2000) point out that the method is nearly affine equivariant.
There exists an alternative initialization method ("version = V1") which is based on the coordinate-wise (weighted) means; therefore, it is affine equivariant but not robust.

From the above output, we see that the algorithm converged in three iterations. In case the algorithm does not converge, we may increase the maximum number of iterations (default: maxiter = 50) and toggle verbose = TRUE to (hopefully) learn more why the method did not converge.

In the next step, we want to study the result in more detail. In particular, we are interested in the estimated center and scatter (or covariance) matrix. To this end, we can call the summary() method on the object fit.

summary(fit)

2.1.3 Diagnostics

The method has detected r sum(is_outlier(fit)) potential outliers. It is important to study the diagnostic plot to learn more about the potential outliers. The robust (Mahalanobis) distances vs. the index of the observations (1:n) can be plotted as follows.

plot(fit, 1)

The dashed horizontal line shows the cutoff threshold on the robust distances. Observations above the line are nominated as potential outliers by the BACON algorithm. It is left to the analyst to finally label outlying observations as such. In the next section, we introduce an alternative plotting method (see below).

The method is_outlier() returns a vector of logicals whether an observation has been flagged as a potential outlier.

which(is_outlier(fit))

The (robust) center and covariance (scatter) matrix can be extracted with the auxiliary functions, respectively, center() and cov().

center(fit)

The robust Mahalanobis distances can be extracted with the distance() method.

2.2 Philips data

Old television sets had a cathode ray tube with an electron gun. The emitted beam runs through a diaphragm that lets pass only a partial beam to the screen. The diaphragm consists of 9 components. The Philips data set contains $n = 667$ measurements on the $p = 9$ components (variables); see Rousseeuw and van Driessen (1999).³ These data do not have sampling weights.

data(philips)
head(philips)

We compute the BACON algorithm but this time with the initialization method version = "V1".

fit <- wBACON(philips, alpha = 0.05, version = "V1")
fit

The BACON algorithm detected r sum(is_outlier(fit)) potential outliers. The robust (Mahalanobis) distances can be plotted against the univariate projection of the data, which maximizes the separation criterion of Qiu and Joe (2006). This kind of diagnostic graph attempts to separate outlying from non-outlying observations as much as possible; see Willems et al. (2009). It is helpful if the outliers are clustered. The graph is generated as follows.

plot(fit, which = 2)

From the visual display, we see a cluster of potential outliers in the top right corner. The dashed horizontal line indicates the cutoff threshold on the distances as imposed by the BACON algorithm.

For very large datasets, the plot method can be called with the (additional) argument hex = TRUE to show a hexagonally binned scatter plot; see below. This plot method uses the functionality of the R package hexbin (Carr et al., 2023).

plot(fit, which = 2, hex = TRUE)

3 Robust linear regression

The education data is on education expenditures in 50 US states in 1975 (Chatterjee and Hadi, 2012, Chap. 5.7). The data can be loaded from the robustbase package.

data(education, package = "robustbase")

It is convenient to rename the variables.

names(education)[3:6] <- c("RES", "INC", "YOUNG", "EXP")
head(education)

The measured variables for the 50 states are:

State: State
Region: group variable with outcomes: 1=Northeastern, 2=North central, 3=Southern, and 4=Western
RES: Number of residents per thousand residing in urban areas in 1970
INC: Per capita personal income in 1973 (\$US)
YOUNG: Number of residents per thousand under 18 years of age in 1974
EXP: Per capita expenditure on public education in a state (\$US), projected for 1975

3.1 Model fit

We want to regress education expenditures (EXP) on the variables RES, INC, and YOUNG by the BACON algorithm, and obtain

reg <- wBACON_reg(EXP ~ RES + INC + YOUNG, data = education)
reg

The instance reg is an object of the class wbaconlm. The printed output of wBACON_reg is identical with the one of the lm function. In addition, we are told the size of the subset on which the regression has been computed. The observations not in the subset are considered outliers (here 1 out of 50 observations).

The summary() method can be used to obtain a summary of the estimated model.

summary(reg)

The summary output of wBACON_reg is identical with the output of the lm estimate on the subset of outlier-free data,

summary(lm(EXP ~ RES + INC + YOUNG, data = education[!is_outlier(reg), ]))

where we have used is_outlier() to extract the set of declared outliers from reg (the summary output of the lm estimate is not shown).

3.2 Tuning

By default, wBACON_reg uses the parametrization $\alpha = 0.05$, collect = 4, and version = "V2". These parameters are used to call the wBACON algorithm on the design matrix. Then, the same parameters are used to compute the robust regression.

To ensure a high breakdown point, version = "V2" should not be changed to version = "V1" unless you have good reasons. The main "turning knob" to tune the algorithm is alpha, which defines the $(1-$alpha$)$ quantile of the Student $t$-distribution. All observations whose distances/discrepancies [See document methods.pdf in the folder doc of the package.] are smaller (in absolute value) than the quantile are selected into the subset of "good" data. By choosing smaller values for alpha (e.g., 0.2), more observations are selected (ceteris paribus) into the subset of "good" data (and vice versa).

The parameter collect specifies the initial subset size, which is defined as $m = p \cdot collect$. It can be modified but should be chosen such that $m$ is considerably smaller than the number of observations $n$. Otherwise there is a high risk of selecting too many "bad" observations into the initial subset, which will eventually bias the regression estimates.

In case the algorithm does not converge, we may increase the maximum number of iterations (default: maxiter = 50) and toggle verbose = TRUE to (hopefully) learn more why the method did not converge.

3.3 Model diagnostics

The methods coef(), vcov(), and predict() work exactly the same as their lm counterparts. This is also true for the first three plot types, that is

which = 1: Residuals vs Fitted,
which = 2: Normal Q-Q,
which = 3: Scale-Location

The plot types 4:6 of plot.lm are not implemented for objects of the class wbaconlm because it is not sensible to study the standard regression influence diagnostics in the presence of outliers in the model's design space. Instead, type four (which = 4) plots the robust Mahalanobis distances with respect to the non-constant design variables against the standardized residual. This plot has been proposed by Rousseeuw and van Zomeren (1990).

plot(reg, 4)

The filled circle(s) represent the outliers nominated by the BACON algorithm. The outlier in the top right corner is both a residual outlier and an outlier in the model's design space.

Observation with robust Mahalanobis distances larger than 4.57 (see abscissae) are flagged as outliers in the model's design space (leverage observations).
Observations whose standardized residual falls outside the interval spanned by $\pm \, t_{\alpha/(2m+2), m - p}$, where $t_{\alpha, m - p}$ is the $(1-\alpha)$ quantile of the Student $t$-distribution with $m-p$ degrees of freedom, $m$ denoting the size of the final subset of outlier-free data. Here, we have $m=49$, $\alpha = 0.05$ (see argument alpha of wBACON_reg), thus the interval is $[-3.52, \; 3.52]$.

References {#biblio}

Béguin, C. and B. Hulliger (2002). Robust Multivariate Outlier Detection and Imputation with Incomplete Survey Data, Deliverable D4/5.2.1/2 Part C: EUREDIT project, https://www.cs.york.ac.uk/euredit/euredit-main.html, research project funded by the European Commission, IST-1999-10226.

Béguin, C. and B. Hulliger (2008). The BACON-EEM Algorithm for Multivariate Outlier Detection in Incomplete Survey Data, Survey Methodology, 34, 91--103.

Billor, N., A. S. Hadi, and P. F. Vellemann (2000). BACON: Blocked Adaptive Computationally-efficient Outlier Nominators, Computational Statistics and Data Analysis, 34, 279--298. DOI 10.1016/S0167-9473(99)00101-2

Campbell, N. A. (1989). Bushfire Mapping using NOAA AVHRR Data. Technical Report. Commonwealth Scientific and Industrial Research Organisation, North Ryde.

Carr, D., N. Lewin-Koh, and M. Maechler (2023). hexbin: Hexagonal Binning Routines. R package version 1.28.3. (The package contains copies of lattice functions written by Deepayan Sarkar). URL https://CRAN.R-project.org/package=hexbin

Chatterjee, S. and A. H. Hadi (2012). Regression Analysis by Example, 5th ed., Hoboken (NJ): John Wiley \& Sons.

Hulliger, B. and T. Schoch (2009). Robust multivariate imputation with survey data, in Proceedings of the 57th Session of the International Statistical Institute, Durban.

Hulliger, B. (2023). modi: Multivariate Outlier Detection and Imputation for Incomplete Survey Data, R package version 0.1-2. URL https://CRAN.R-project.org/package=modi

Maechler, M., P. Rousseeuw, C. Croux, V. Todorov, A. Ruckstuhl, M. Salibian-Barrera, T. Verbeke, M. Koller, E. L. T. Conceicao, and M. Anna di Palma (2024). robustbase: Basic Robust Statistics, R package version 0.99-2. URL https://CRAN.R-project.org/package=robustbase

Maronna, R. A. and V. J. Yohai (1995). The Behavior of the Stahel-Donoho Robust Multivariate Estimator, Journal of the American Statistical Association, 90 330--341. DOI 10.2307/2291158

Qiu, W. and H. Joe (2006). Separation index and partial membership for clustering, Computational Statistics and Data Analysis, 50, 585--603. DOI 10.1016/j.csda.2004.09.009

Raymaekers, J. and P. Rousseeuw (2023). cellWise: Analyzing Data with Cellwise Outliers, R package version 2.5.3. URL https://CRAN.R-project.org/package=cellWise

Rousseeuw, P. J. and K. van Driessen (1999). A fast algorithm for the Minimum Covariance Determinant estimator, Technometrics, 41, 212--223. DOI 10.2307/1270566

Rousseeuw, P. J. and K. van Zomeren (1990). Unmasking Multivariate Outliers and Leverage Points, Journal of the American Statistical Association, 411, 633--639. DOI 10.2307/2289995

Schoch, T. (2021) wbacon: Weighted BACON algorithms for multivariate outlier nomination (detection) and robust linear regression, Journal of Open Source Software, 6, 3238. DOI 10.21105/joss.03238

Willems, G., H. Joe, and R. Zamar (2009). Diagnosing Multivariate Outliers Detected by Robust Estimators, Journal of Computational and Graphical Statistics, 18, 73--91. DOI 10.1198/jcgs.2009.0005

Notes {#notes}

¹ The data are also distributed with the R package robustbase (Maechler et al., 2023).

² The degrees of freedom $d$ is a function of the number of variables $p$, the number of observations $n$, and the size of the current subset $m$; see methods.pdf in the inst/doc folder of the package.

³ The philips data has been published in the R package cellWise (Raymaekers and Rousseeuw, 2023).

Any scripts or data that you put into this service are public.

wbacon documentation built on June 8, 2025, 10:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

wbacon
Weighted BACON Algorithms

Vignette: Weighted BACON algorithms
In wbacon: Weighted BACON Algorithms

1.1 Available methods

1.2 Assumptions

1.3 The role of the data analyst

1.4 Additional information

2 Multivariate outlier detection

2.1 Bushfire data

2.1.1 Data preparation

2.1.2 Outlier detection

2.1.3 Diagnostics

2.2 Philips data

3 Robust linear regression

3.1 Model fit

3.2 Tuning

3.3 Model diagnostics

References {#biblio}

Notes {#notes}

Try the wbacon package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

wbacon Weighted BACON Algorithms

Vignette: Weighted BACON algorithms In wbacon: Weighted BACON Algorithms

1.1 Available methods

1.2 Assumptions

1.3 The role of the data analyst

1.4 Additional information

2 Multivariate outlier detection

2.1 Bushfire data

2.1.1 Data preparation

2.1.2 Outlier detection

2.1.3 Diagnostics

2.2 Philips data

3 Robust linear regression

3.1 Model fit

3.2 Tuning

3.3 Model diagnostics

References {#biblio}

Notes {#notes}

Try the wbacon package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

wbacon
Weighted BACON Algorithms

Vignette: Weighted BACON algorithms
In wbacon: Weighted BACON Algorithms