dbcv | R Documentation |
Calculate the Density-Based Clustering Validation Index (DBCV) for a clustering.
dbcv(x, cl, d, metric = "euclidean", sample = NULL)
x |
a data matrix or a dist object. |
cl |
a clustering (e.g., a integer vector) |
d |
dimensionality of the original data if a dist object is provided. |
metric |
distance metric used. The available metrics are the methods
implemented by |
sample |
sample size used for large datasets. |
DBCV (Moulavi et al, 2014) computes a score based on the density sparseness of each cluster and the density separation of each pair of clusters.
The density sparseness of a cluster (DSC) is defined as the maximum edge weight of a minimal spanning tree for the internal points of the cluster using the mutual reachability distance based on the all-points-core-distance. Internal points are connected to more than one other point in the cluster. Since clusters of a size less then 3 cannot have internal points, they are ignored (considered noise) in this implementation.
The density separation of a pair of clusters (DSPC) is defined as the minimum reachability distance between the internal nodes of the spanning trees of the two clusters.
The validity index for a cluster is calculated using these measures and aggregated to a validity index for the whole clustering using a weighted average.
The index is in the range [-1,1]
. If the cluster density compactness is better
than the density separation, a positive value is returned. The actual value depends
on the separability of the data. In general, greater values
of the measure indicating a better density-based clustering solution.
Noise points are included in the calculation only in the weighted average, therefore clustering with more noise points will get a lower index.
Performance note: This implementation calculates a distance matrix and thus can only be used for small or sampled datasets.
A list with the DBCV score
for the clustering,
the density sparseness of cluster (dsc
) values,
the density separation of pairs of clusters (dspc
) distances,
and the validity indices of clusters (c_c
).
Matt Piekenbrock and Michael Hahsler
Davoud Moulavi and Pablo A. Jaskowiak and Ricardo J. G. B. Campello and Arthur Zimek and Jörg Sander (2014). Density-Based Clustering Validation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 839-847 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1137/1.9781611973440.96")}
Pablo A. Jaskowiak (2022). MATLAB implementation of DBCV. https://github.com/pajaskowiak/dbcv
# Load a test dataset
data(Dataset_1)
x <- Dataset_1[, c("x", "y")]
class <- Dataset_1$class
clplot(x, class)
# We use MinPts 3 and use the knee at eps = .1 for dbscan
kNNdistplot(x, minPts = 3)
cl <- dbscan(x, eps = .1, minPts = 3)
clplot(x, cl)
dbcv(x, cl)
# compare to the DBCV index on the original class labels and
# with a random partitioning
dbcv(x, class)
dbcv(x, sample(1:4, replace = TRUE, size = nrow(x)))
# find the best eps using dbcv
eps_grid <- seq(.05,.2, by = .01)
cls <- lapply(eps_grid, FUN = function(e) dbscan(x, eps = e, minPts = 3))
dbcvs <- sapply(cls, FUN = function(cl) dbcv(x, cl)$score)
plot(eps_grid, dbcvs, type = "l")
eps_opt <- eps_grid[which.max(dbcvs)]
eps_opt
cl <- dbscan(x, eps = eps_opt, minPts = 3)
clplot(x, cl)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.