pkg <- 'dbscan'

source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R")
pkg_title(pkg)

Anaconda.org

Introduction

This R package provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes:

Clustering

Outlier Detection

Fast Nearest-Neighbor Search (using kd-trees)

The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are typically faster than the native R implementations (e.g., dbscan in package fpc), or the implementations in WEKA, ELKI and Python's scikit-learn.

pkg_usage(pkg)
pkg_citation(pkg, 2)
pkg_install(pkg)

Usage

Load the package and use the numeric variables in the iris dataset

library("dbscan")

data("iris")
x <- as.matrix(iris[, 1:4])

DBSCAN

db <- dbscan(x, eps = .4, minPts = 4)
db

Visualize the resulting clustering (noise points are shown in black).

pairs(x, col = db$cluster + 1L)

OPTICS

opt <- optics(x, eps = 1, minPts = 4)
opt

Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

opt <- extractDBSCAN(opt, eps_cl = .4)
plot(opt)

HDBSCAN

hdb <- hdbscan(x, minPts = 4)
hdb

Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.

plot(hdb, show_flat = TRUE)

Using dbscan from Python

R, R package dbscan, and Python package rpy2 need to be installed.

```{python, eval = FALSE} import pandas as pd import numpy as np

prepare data

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']) iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]

get R dbscan package

from rpy2.robjects import packages dbscan = packages.importr('dbscan')

enable automatic conversion of pandas dataframes to R dataframes

from rpy2.robjects import pandas2ri pandas2ri.activate()

db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5) print(db)


DBSCAN clustering for 150 objects.

Parameters: eps = 0.5, minPts = 5

Using euclidean distances and borderpoints = TRUE

The clustering contains 2 cluster(s) and 17 noise points.

0 1 2

17 49 84

Available fields: cluster, eps, minPts, dist, borderPoints

```{python, eval = FALSE}
# get the cluster assignment vector
labels = np.array(db.rx('cluster'))
labels
## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
##         1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
##         2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
##         2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
##       dtype=int32)

License

The dbscan package is licensed under the GNU General Public License (GPL) Version 3. The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.

Changes

References



mhahsler/dbscan documentation built on March 4, 2024, 7:42 a.m.