Home

/

CRAN

/

GMKMcharlie

/

KMconstrained: K-means over dense data input with constraints on cluster...

KMconstrained: K-means over dense data input with constraints on cluster...
In GMKMcharlie: Unsupervised Gaussian Mixture and Minkowski and Spherical K-Means with Constraints

Description Usage Arguments Details Value Note Examples

View source: R/KM.r

Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over dense representation of data given cluster size (weight) constraints.

KMconstrained(
  X,
  centroid,
  Xw = rep(1, ncol(X)),
  clusterWeightUB = rep(ncol(X) + 1, ncol(centroid)),
  minkP = 2,
  convergenceTail = 5L,
  tailConvergedRelaErr = 1e-04,
  maxIter = 100L,
  maxCore = 7L,
  paraSortInplaceMerge = FALSE,
  verbose = TRUE
  )

`X`	A `d x N` numeric matrix where `N` is the number of data points — each column is an observation, and `d` is the dimensionality. Column-observation representation promotes cache locality.
`centroid`	A `d x K` numeric matrix where `K` is the number of clusters. Each column represents a cluster center.
`Xw`	A numeric vector of size `N`. `Xw[i]` is the weight on observation `X[, i]`. Users should normalize `Xw` such that the elements sum up to `N`. Default uniform weights for all observations.
`clusterWeightUB`	An integer vector of size `K`. The upper bound of weight for each cluster. If `Xw` are all 1, `clusterWeightUB` upper-bound cluster sizes.
`minkP`	A numeric value or a character string. If numeric, `minkP` is the power `p` in the definition of Minkowski distance. If character string, `"max"` implies Chebyshev distance, `"cosine"` implies cosine dissimilarity. Default 2.
`convergenceTail`	An integer. The algorithm may end up with "cyclical convergence" due to the size / weight constraints, that is, every few iterations produce the same clustering. If the cost (total in-cluster distance) of each of the last `convergenceTail` iterations has a relative difference less than `tailConvergedRelaErr` against the cost from the prior iteration, the program stops.
`tailConvergedRelaErr`	A numeric value, explained in `convergenceTail`.
`maxIter`	An integer. The maximal number of iterations. Default 100.
`maxCore`	An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7.
`paraSortInplaceMerge`	A boolean value. `TRUE` let the algorithm call `std::inplace_merge()` (`std` refers to C++ STL namespace) instead of `std::merge()` for parallel-sorting the observation-centroid distances. In-place merge is slower but requires no extra memory.
`verbose`	A boolean value. `TRUE` prints progress.

See details in KM() for common implementation highlights. Weight upper bounds are implemented as follows:

In each iteration, all the (observation ID, centroid ID, distance) tuples are sorted by distance. From the first to the last tuple, the algorithm puts observation in the cluster labeled by the centroid ID, if (i) the observation has not already been assigned and (ii) the cluster size has not exceeded its upper bound. The actual implementation is slightly different. A parallel merge sort is crafted for computing speed.

A list of size K, the number of clusters. Each element is a list of 3 vectors:

`centroid`	a numeric vector of size `d`.
`clusterMember`	an integer vector of indexes of elements grouped to `centroid`.
`member2centroidDistance`	a numeric vector of the same size of `clusterMember`. The `i`th element is the Minkowski distance or cosine dissimilarity from `centroid` to the `clusterMember[i]`th observation in `X`.

Empty clusterMember implies empty cluster.

Although rarely happens, divergence of K-means with non-Euclidean distance minkP != 2 measure is still a theoretical possibility. Bounding the cluster weights / sizes increases the chance of divergence.

N = 3000L # Number of points.
d = 500L # Dimensionality.
K = 50L # Number of clusters.
dat = matrix(rnorm(N * d) + runif(N * d), nrow = d)


# Use kmeans++ initialization.
centroidInd = GMKMcharlie::KMppIni(
  X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE,
  seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE)


centroid = dat[, centroidInd]


# Each cluster size should not be greater than N / K * 2.
sizeConstraints = as.integer(rep(N / K * 2, K))
system.time({rst = GMKMcharlie::KMconstrained(
  X = dat, centroid = centroid, clusterWeightUB = sizeConstraints,
  maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})


# Size upper bounds vary in [N / K * 1.5, N / K * 2]
sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2)))
system.time({rst = GMKMcharlie::KMconstrained(
  X = dat, centroid = centroid, clusterWeightUB = sizeConstraints,
  maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})

GMKMcharlie documentation built on May 29, 2021, 9:08 a.m.

GMKMcharlie index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

GMKMcharlie
Unsupervised Gaussian Mixture and Minkowski and Spherical K-Means with Constraints

KMconstrained: K-means over dense data input with constraints on cluster...
In GMKMcharlie: Unsupervised Gaussian Mixture and Minkowski and Spherical K-Means with Constraints

Description

Usage

Arguments

Details

Value

Note

Examples

Related to KMconstrained in GMKMcharlie...

R Package Documentation

Browse R Packages

We want your feedback!

GMKMcharlie Unsupervised Gaussian Mixture and Minkowski and Spherical K-Means with Constraints

KMconstrained: K-means over dense data input with constraints on cluster... In GMKMcharlie: Unsupervised Gaussian Mixture and Minkowski and Spherical K-Means with Constraints

Description

Usage

Arguments

Details

Value

Note

Examples

Related to KMconstrained in GMKMcharlie...

R Package Documentation

Browse R Packages

We want your feedback!

GMKMcharlie
Unsupervised Gaussian Mixture and Minkowski and Spherical K-Means with Constraints

KMconstrained: K-means over dense data input with constraints on cluster...
In GMKMcharlie: Unsupervised Gaussian Mixture and Minkowski and Spherical K-Means with Constraints