| deadwood | R Documentation |
Deadwood is an anomaly detection algorithm based on Mutual Reachability Minimum Spanning Trees. It trims protruding tree segments and marks small debris as outliers.
More precisely, the use of a mutual reachability distance pulls peripheral points farther away from each other. Tree edges with weights beyond the detected elbow point are removed. All the resulting connected components whose sizes are smaller than a given threshold are deemed anomalous.
deadwood(d, ...)
## Default S3 method:
deadwood(
d,
M = 5L,
contamination = NA_real_,
max_debris_size = NA_real_,
max_contamination = 0.5,
ema_dt = 0.01,
distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"),
verbose = FALSE,
...
)
## S3 method for class 'dist'
deadwood(
d,
M = 5L,
contamination = NA_real_,
max_debris_size = NA_real_,
max_contamination = 0.5,
ema_dt = 0.01,
verbose = FALSE,
...
)
## S3 method for class 'mstclust'
deadwood(
d,
contamination = NA_real_,
max_debris_size = NA_real_,
max_contamination = 0.5,
ema_dt = 0.01,
verbose = FALSE,
...
)
## S3 method for class 'mst'
deadwood(
d,
contamination = NA_real_,
max_debris_size = NA_real_,
max_contamination = 0.5,
ema_dt = 0.01,
cut_edges = NULL,
verbose = FALSE,
...
)
d |
a numeric matrix with |
... |
further arguments passed to |
M |
smoothing factor; |
contamination |
single numeric value or |
max_debris_size |
single integer value or |
max_contamination |
single numeric value;
maximal contamination level assumed when |
ema_dt |
single numeric value;
controls the smoothing parameter |
distance |
metric used in the case where |
verbose |
logical; whether to print diagnostic messages and progress information |
cut_edges |
numeric vector or |
As with all distance-based methods (this includes k-means and DBSCAN as well), applying data preprocessing and feature engineering techniques (e.g., feature scaling, feature selection, dimensionality reduction) might lead to more meaningful results.
If d is a numeric matrix or an object of class dist,
mst will be called to compute an MST, which
generally takes at most O(n^2) time. However, by default,
for low-dimensional Euclidean spaces, a faster algorithm based on K-d trees
is selected automatically; see mst_euclid from
the quitefastmst package.
Once the spanning tree is determined (\Omega(n \log n)-O(n^2)),
the Deadwood algorithm runs in O(n) time.
Memory use is also O(n).
A logical vector y of length n, where y[i] == TRUE
means that the i-th observation is deemed to be an outlier.
The mst attribute gives the computed minimum
spanning tree which can be reused in further calls to the functions
from genieclust, lumbermark, and deadwood.
cut_edges gives the cut_edges passed as argument.
contamination gives the detected contamination levels
in each cluster (which can be different from the observed proportion
of outliers detected).
M. Gagolewski, deadwood, in preparation, 2026, TODO
V. Satopaa, J. Albrecht, D. Irwin, B. Raghavan, Finding a "Kneedle" in a haystack: Detecting knee points in system behavior, In: 31st Intl. Conf. Distributed Computing Systems Workshops, 2011, 166-171, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1109/ICDCSW.2011.20")}
R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/978-3-642-37456-2_14")}
The official online manual of deadwood at https://deadwood.gagolewski.com/
library("datasets")
data("iris")
X <- jitter(as.matrix(iris[1:2])) # some data
is_outlier <- deadwood(X, M=5)
plot(X, col=c("#ff000066", "#55555555")[is_outlier+1],
pch=c(16, 1)[is_outlier+1], asp=1, las=1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.