Fast reimplementation of the DBSCAN (Density-based spatial clustering of applications with noise) clustering algorithm using a kd-tree. The implementation is significantly faster and can work with larger data sets then dbscan in fpc.

1 2 3 4 |

`x` |
a data matrix or a dist object. Alternatively, a frNN object with fixed-radius nearest neighbors can also be specified. In this case eps can be missing and will be taken form the frNN object. |

`eps` |
size of the epsilon neighborhood. |

`minPts` |
number of minimum points in the eps region (for core points). Default is 5 points. |

`weights` |
numeric; weights for the data points. Only needed to perform weighted clustering. |

`borderPoints` |
logical; should border points be assigned. The default
is |

`object` |
a DBSCAN clustering object. |

`data` |
the data set used to create the DBSCAN clustering object. |

`newdata` |
new data set for which cluster membership should be predicted. |

`...` |
additional arguments are passed on to fixed-radius
nearest neighbor search algorithm. See |

*Note:* use `dbscan::dbscan`

to call this
implementation when you also use package fpc.

This implementation of DBSCAN implements the original algorithm as described by
Ester et al (1996). DBSCAN estimates the density around each data point by counting the number of points in a user-specified eps-neighborhood and applies a used-specified minPts thresholds to identify core, border and noise points. In a second step, core points are joined into a cluster if they are density-reachable (i.e., there is a chain of core points where one falls inside the eps-neighborhood of the next). Finally, border points are assigned to clusters. The algorithm only needs
parameters `eps`

and `minPts`

.

Border points are arbitrarily assigned to clusters in the original algorithm.
DBSCAN* (see Campello et al 2013) treats all border points as noise points. This is implemented with `borderPoints = FALSE`

.

Setting parameters for DBSCAN:
`minPts`

is often set to be dimensionality of the data plus one or higher.
The knee in `kNNdistplot`

can be used to find suitable values for `eps`

.

See `frNN`

for more information on the parameters related to nearest neighbor search.

A precomputed frNN object can be supplied as `x`

. In this case
`eps`

does not need to be specified. This option us useful
for large data sets, where a sparse distance matrix is available.
See `frNN`

how to create frNN objects.

`predict`

can be used to predict cluster memberships for new data points.
A point is considered a member of a cluster if it is within the eps
neighborhood of a member of the cluster. Points which cannot be assigned to a
cluster will be reported as members of the noise cluster 0.

A object of class 'dbscan_fast' with the following components:

`eps ` |
value of the eps parameter. |

`minPts ` |
value of the minPts parameter. |

`cluster ` |
A integer vector with cluster assignments. Zero indicates noise points. |

Michael Hahsler

Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. *Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).*

Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). Density-Based Clustering
Based on Hierarchical Density Estimates. *Proceedings of the 17th
Pacific-Asia Conference on Knowledge Discovery
in Databases, PAKDD 2013,* Lecture Notes in Computer Science 7819, p. 160.

`kNNdistplot`

,
`frNN`

,
`dbscan`

in fpc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | ```
data(iris)
iris <- as.matrix(iris[,1:4])
## find suitable eps parameter using a k-NN plot for k = dim + 1
## Look for the knee!
kNNdistplot(iris, k = 5)
abline(h=.4, col = "red", lty=2)
res <- dbscan(iris, eps = .4, minPts = 5)
res
pairs(iris, col = res$cluster + 1L)
## use precomputed frNN
fr <- frNN(iris, eps = .4)
dbscan(fr, minPts = 5)
## example data from fpc
set.seed(665544)
n <- 100
x <- cbind(
x = runif(10, 0, 10) + rnorm(n, sd = 0.2),
y = runif(10, 0, 10) + rnorm(n, sd = 0.2)
)
res <- dbscan(x, eps = .3, minPts = 3)
res
## plot clusters and add noise (cluster 0) as crosses.
plot(x, col=res$cluster)
points(x[res$cluster==0,], pch = 3, col = "grey")
hullplot(x, res)
## predict cluster membership for new data points
## (Note: 0 means it is predicted as noise)
newdata <- x[1:5,] + rnorm(10, 0, .2)
predict(res, x, newdata)
## compare speed against fpc version (if microbenchmark is installed)
## Note: we use dbscan::dbscan to make sure that we do now run the
## implementation in fpc.
## Not run:
if (requireNamespace("fpc", quietly = TRUE) &&
requireNamespace("microbenchmark", quietly = TRUE)) {
t_dbscan <- microbenchmark::microbenchmark(
dbscan::dbscan(x, .3, 3), times = 10, unit = "ms")
t_dbscan_linear <- microbenchmark::microbenchmark(
dbscan::dbscan(x, .3, 3, search = "linear"), times = 10, unit = "ms")
t_dbscan_dist <- microbenchmark::microbenchmark(
dbscan::dbscan(x, .3, 3, search = "dist"), times = 10, unit = "ms")
t_fpc <- microbenchmark::microbenchmark(
fpc::dbscan(x, .3, 3), times = 10, unit = "ms")
r <- rbind(t_fpc, t_dbscan_dist, t_dbscan_linear, t_dbscan)
r
boxplot(r,
names = c('fpc', 'dbscan (dist)', 'dbscan (linear)', 'dbscan (kdtree)'),
main = "Runtime comparison in ms")
## speedup of the kd-tree-based version compared to the fpc implementation
median(t_fpc$time) / median(t_dbscan$time)
}
## End(Not run)
``` |

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.