pkg <- 'dbscan'

source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R")
pkg_title(pkg, anaconda = "r-dbscan", stackoverflow = "dbscan%2br")

Introduction

This R package [@hahsler2019dbscan] provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes:

Clustering

Outlier Detection

Cluster Evaluation

Fast Nearest-Neighbor Search (using kd-trees)

The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package fpc), or the implementations in WEKA, ELKI and Python's scikit-learn.

pkg_usage(pkg)
pkg_citation(pkg, 2)
pkg_install(pkg)

Usage

Load the package and use the numeric variables in the iris dataset

library("dbscan")

data("iris")
x <- as.matrix(iris[, 1:4])

DBSCAN

db <- dbscan(x, eps = .42, minPts = 5)
db

Visualize the resulting clustering (noise points are shown in black).

pairs(x, col = db$cluster + 1L)

OPTICS

opt <- optics(x, eps = 1, minPts = 4)
opt

Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

opt <- extractDBSCAN(opt, eps_cl = .4)
plot(opt)

HDBSCAN

hdb <- hdbscan(x, minPts = 4)
hdb

Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.

plot(hdb, show_flat = TRUE)

Using dbscan with tidyverse

dbscan provides for all clustering algorithms tidy(), augment(), and glance() so they can be easily used with tidyverse, ggplot2 and tidymodels.

library(tidyverse)
db <- x %>% dbscan(eps = .42, minPts = 5)

Get cluster statistics as a tibble

tidy(db)

Visualize the clustering with ggplot2 (use an x for noise points)

augment(db, x) %>% 
  ggplot(aes(x = Petal.Length, y = Petal.Width)) +
    geom_point(aes(color = .cluster, shape = noise)) +
    scale_shape_manual(values=c(19, 4))

Using dbscan from Python

R, the R package dbscan, and the Python package rpy2 need to be installed.

```{python, eval = FALSE} import pandas as pd import numpy as np

prepare data

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']) iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]

get R dbscan package

from rpy2.robjects import packages dbscan = packages.importr('dbscan')

enable automatic conversion of pandas dataframes to R dataframes

from rpy2.robjects import pandas2ri pandas2ri.activate()

db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5) print(db)


DBSCAN clustering for 150 objects.

Parameters: eps = 0.5, minPts = 5

Using euclidean distances and borderpoints = TRUE

The clustering contains 2 cluster(s) and 17 noise points.

0 1 2

17 49 84

Available fields: cluster, eps, minPts, dist, borderPoints

```{python, eval = FALSE}
# get the cluster assignment vector
labels = np.array(db.rx('cluster'))
labels
## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
##         1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
##         2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
##         2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
##       dtype=int32)

License

The dbscan package is licensed under the GNU General Public License (GPL) Version 3. The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.

Changes

References



mhahsler/dbscan documentation built on June 15, 2025, 9:42 a.m.