Cluster sharpening using `ksharp`

suppressMessages(library(Rcssplot))
library(cluster)
library(dbscan)
library(ksharp)
set.seed(300299)
RcssDefaultStyle <- Rcss("ksharp.Rcss")
dbscan.named = function(data, eps) {
  result = dbscan(data, eps=eps)
  names(result$cluster) = rownames(data)
  result$data = data
  result
}
plotClusters <- function(data, clusters=NULL, main="", Rcssclass="sharp") {

  # when clusters not specified, assume all belong to class "plain"
  if (is.null(clusters)) {
    clusters = setNames(rep("plain", nrow(data)), rownames(data))
  }
  if (is.null(names(clusters))) {
    names(clusters) = rownames(data)
  }

  RcssCompulsoryClass = RcssGetCompulsoryClass(Rcssclass)

  titlepos = RcssValue("ksharp", "title.pos", default=c(0, 1))
  padding = RcssValue("ksharp", "padding", default=0.1)

  xlim = range(data[,1])
  ylim = range(data[,2])
  if (xlim[2]-xlim[1] > ylim[2]-ylim[1]) {
    ylim = xlim
  } else {
    xlim = ylim
  }
  xysize = xlim[2]-xlim[1]
  xylim = xlim+(c(-1,1)*xysize*padding)
  xysize = xylim[2]-xylim[1]
  rm(xlim, ylim)    

  plot(xylim, xylim)
  rect(xylim[1], xylim[1], xylim[2], xylim[2])

  # create cluster names like C0, C1, etc
  cnames = paste0("C", clusters)
  clusters = split(names(clusters), cnames)

  # add points to the chart
  for (iname in names(clusters)) {
    idata = data[clusters[[iname]], , drop=FALSE]
    points(idata[,1], idata[,2], Rcssclass=iname)
  }

  text(xylim[1] + titlepos[1]*xysize,
       xylim[1] + titlepos[2]*xysize,
       main, Rcssclass="main")
}
emptyplot = function(n) {
  for (i in 1:n) {
    plot(c(0, 1), c(0, 1), type="n")
  }
}
# override table to remove spurious empty line
table = function(x) {
  x = base::table(x)
  width = 5
  padspaces = function(y) {
    if (length(y)>1) {
      return(sapply(y, padspaces))
    }
    y = as.character(y)
    paste(c(rep(" ", width-nchar(y)), y), collapse="")
  }

  line1 = paste(sapply(names(x), padspaces), collapse="")
  line2 = paste(sapply(x, padspaces), collapse="")
  cat(line1, "\n")
  cat(line2, "\n")
}

 

Introduction

Clustering assigns data points to groups, i.e clusters, so that each group is internally coherent and also different from the others. In practice, however, clusters often turn out to be almost merging with one another. Sharpening in this context refers to increasing contrasts between cluster groups.

The concept of cluster sharpening goes back at least to the 1980s and the works of Tukey [@cleveland1988collected]. Methods that specifically mention cluster sharpening include dendrogram sharpening [@stanberry2003cluster] and variations on the k-means algorithm [@garcia2008general]. Some R packages that produce clusterings and provide controls on the level of sharpening include tclust [@tclust] and trimcluster [@trimcluster].

In contrast to methods that produce de-novo clusterings with specific sharpness properties, the ksharp package aims to adjust existing clusterings. The package provides a general interface for this purpose along with several distinct algorithms. All are compatible with clusterings produced using the stats package, the cluster package [@cluster], the dbscan package [@dbscan], as well as custom-made assignments.

This vignette first explains cluster sharpening in the methods section. It then demonstrates the algorithms implemented in the package on a range of examples. Throughout, most code is hidden for readability, but is available in the package source.

Methods

Toy example

As a form of motivation and illustration, let's look at the toy dataset below.

K1 = kdata.1
km1 = kmeans(K1, 2)
oldpar = par(mfrow=c(1,4))
plotClusters(K1, main="raw data", Rcssclass="unsharp")
plotClusters(K1, km1$cluster, main="kmeans (k=2)", Rcssclass="unsharp")
emptyplot(2)
# this vignette uses Rcssplot, which changes the behavior of par a little
# resetting par is achieved either via graphics:: or dev.off()
graphics::par(oldpar) 

The raw data shows two circular blobs that are nearby on the plane (above, left). k-means clustering using two colors partitions the points into groups that appear symmetrical, except on the boundary between the groups (above, right). The distorted shapes suggest that the true groups may be in fact overlapping and, therefore, certain items assigned to one group may should belong to the other.

The ambiguous points can be marked by demanding increasing levels of sharpness.

sharp1 = ksharp(km1, data=K1)
oldpar = par(mfrow=c(1,4))
thresholds = c(0.05, 0.1, 0.2, 0.4)
for (t in thresholds) {
  plotClusters(K1, ksharp(sharp1, t, method="sil")$cluster, main=paste0("k=2, threshold=", t))
}
graphics::par(oldpar)

With a mild level of sharpening (above, left), a few of the points on the boundary fade into a third group, displayed in gray. Increasing the sharpening threshold makes this group larger. At the cost of discarding some of the data, the leftover groups become more distinct and well-separated (above, right).

Using ksharp

To achieve this effect using the ksharp package, first we cluster the dataset kdata.1 using k-means.

library(ksharp)
km1 = kmeans(kdata.1, centers=2)
table(km1$cluster)

The algorithm partitions the data into groups of roughly equal size. Next, we sharpen the clustering using command ksharp.

ks1 = ksharp(km1, threshold=0.05, data=kdata.1)
table(ks1$cluster)

The sharpening function creates an additional small group with cluster index 0 containing 5\% of the data.

On the first run, the sharpening command requires access to the original data and can take a moment to compute various sharpening-related metrics. But once computed, it is straightforward to re-use those metrics to adjust the level of sharpening. We can adjust the size of the left-out group by varying the sharpness threshold.

ks1.10 = ksharp(ks1, threshold=0.1)
table(ks1.10$cluster)
ks1.20 = ksharp(ks1, threshold=0.2)
table(ks1.20$cluster)

By increasing the threshold, we delegate a larger proportion of the points to the group with index 0, reproducing the effect in the figure above.

Sharpening heuristics

Just like there are many ways to define a clustering algorithm, there are also many conceivable procedures to mask out ambiguous points in a clustering. ksharp implements three, which can be toggled via an argument called method.

Each of these methods has characterisitic properties that may be suitable/unsuitable for certain datasets.

Examples

Two overlapping clusters

Before moving to more intricate examples, let's revisit the dataset with the two overlapping blobs using the three sharpening methods.

oldpar = par(mfrow=c(3,4))
methods = c("silhouette", "medoid", "neighbor")
for (m in methods) {
  for (t in thresholds) {
    plotClusters(K1, ksharp(sharp1, t, method=m)$cluster, main=paste0(m, ", ", t))
  }
}
graphics::par(oldpar)

Each row represents output from a sharpening method. From left to right, increasing the threshold masks out more points. Although the results differ slightly across methods, the changes here appear to be minor.

Non-overlapping, non-spherical clusters

Another dataset with two groups is kdata.2.

K2 = kdata.2
K2dist = dist(K2)
km2 = kmeans(K2, 2)
db2 = dbscan.named(K2, 0.3)
sharp2 = ksharp(db2, method="neighbor")
oldpar = par(mfrow=c(1,4))
plotClusters(K2, cluster=NULL, main="raw", Rcssclass="unsharp")
plotClusters(K2, km2$cluster, main="kmeans, k=2", Rcssclass="unsharp")
plotClusters(K2, db2$cluster, main="dbscan, k=2", Rcssclass="unsharp")
emptyplot(1)
graphics::par(oldpar)

Here, points are arranged in non-circular shapes (above, left) and this can confuse the k-means algorithm (above, center). To avoid that, we can create the initial clustering using a density-based algorithm instead, dbscan (above, right).

Sharpening of a dbscan clustering proceeds in the same way as before. (A caveat is that dbscan provides cluster assignments in unnamed lists. ksharp requires a named list, so the initial clustering must be adjusted manually before sharpening.)

oldpar = par(mfrow=c(3,4))
for (m in methods) {
  for (t in thresholds) {
    plotClusters(K2, ksharp(sharp2, t, method=m)$cluster, main=paste0(m, ", ", t))
  }
}
graphics::par(oldpar)

The silhouette and medoid methods (top and middle) tend to remove points from the left and right sides of the rectangles. The neighbor-based method (bottom) instead subtracts from the upper and lower sides.

Non-convex groups

Differences between sharpening methods can also be observed when data form groups with non-convex shapes. As an example, let's use dataset kdata.3 with a clustering by dbscan.

K3 = kdata.3
db3 = dbscan.named(K3, 0.25)
oldpar = par(mfrow=c(1,4))
plotClusters(K3, cluster=NULL, main="raw", Rcssclass="unsharp")
plotClusters(K3, db3$cluster, main="dbscan", Rcssclass="unsharp")
emptyplot(2)
graphics::par(oldpar)

The result from dbscan contains three main groups. It also already assigns some points to a noise group, which has a cluster index of 0. But we can sharpen the result further.

sharp3 = ksharp(db3, data=K3, method="neighbor")
oldpar = par(mfrow=c(3,4))
for (m in methods) {
  for (t in thresholds) {
    plotClusters(K3, ksharp(sharp3, t, method=m)$cluster, main=paste0(m, ", ", t))
  }
}
graphics::par(oldpar)

An interesting feature that appears here is that the presence of a noise group in the original clustering also removes points in the real clusters that might overlap with the noise.

Noise points

The last example consists of groups that are well defined, but are surrounded by noise, kdata.4.

K4 = kdata.4
K4.centers = matrix(c(-3,-3,-3,3,3,-3,3,3), ncol=2, byrow=T)
km4 = kmeans(K4, K4.centers)
sharp4 = ksharp(km4, data=K4)
oldpar = par(mfrow=c(3,4))
thresholds2 = c(0.05, 0.1, 0.2, 0.4)
for (m in methods) {
  for (t in thresholds2) {
    plotClusters(K4, ksharp(sharp4, t, method=m)$cluster, main=paste0(m, ", ", t))
  }
}
graphics::par(oldpar)

Here, the initial clustering is generated by k-means with a manual group seeding. As k-means partitions all points to the specified number of clusters, the surrounding noise points are assigned to groups along the four dense regions near the center. Sharpening can in this case remove some of the noise points.

Interestingly, an already-sharp cluster assignment can be produced in this case by dbscan, which can produce a noise group in its raw output. The above result thus reproduces a similar effect using a different algorithm.

Discussion

Sharpening is a procedure that adjusts an existing clustering in order to better bring out the distinctions between cluster groups. Package ksharp implements a specific form of cluster sharpening called 'sharpening by excision' (Cleveland, 1988). This works by removing some boundary points from a dataset to increase contrast between the remaining groups.

There are several possible algorithms for cluster sharpening. The ksharp command delivers three different ones. The object output by function ksharp contains details relevant to all of them. This makes it possible to adjust the sharpening level of already-sharpened clusterings in time O(N), where N is the number of data points. However, the initial creation of these details requires recalculation and sorting of the distance matrix and can thus be expensive on large datasets. Speeding up these calculations using alternative implementations is a possible avenue for development.

References

Appendix

Implementation notes

Function ksharp is the main API function of the package. This function is designed to be compatible with clusterings produced by k-means and by functions from the cluster package, e.g. pam. It is also compatible with output from dbscan after an adjustment of that clustering output to include data-point names.

The function can also sharpen self-made clusterings, as long as they are prepared as a list with a component cluster consisting of a named vector associating each data point to a numeric cluster id.

Beside the ksharp function, the package also exports other functions tat compute auxiliary information such as silhouette widths. These implementations differ slightly from existing ones from package cluster; see the documentation, e.g. help(silinfo) for details.

Session info

sessionInfo()


Try the ksharp package in your browser

Any scripts or data that you put into this service are public.

ksharp documentation built on Jan. 26, 2020, 5:06 p.m.