# Ckmeans.1d.dp: Optimal (Weighted) Univariate Clustering In Ckmeans.1d.dp: Optimal, Fast, and Reproducible Univariate Clustering

## Description

Perform optimal univariate k-means or k-median clustering in linear (fastest), loglinear, or quadratic (slowest) time.

## Usage

 ```1 2 3 4 5 6 7``` ```Ckmeans.1d.dp(x, k=c(1,9), y=1, method=c("linear", "loglinear", "quadratic"), estimate.k=c("BIC", "BIC 3.4.12")) Ckmedian.1d.dp(x, k=c(1,9), y=1, method=c("linear", "loglinear", "quadratic"), estimate.k=c("BIC", "BIC 3.4.12")) ```

## Arguments

 `x` a numeric vector of data to be clustered. All `NA` elements must be removed from `x` before calling this function. The function will run faster on sorted `x` (in non-decreasing order) than an unsorted input. `k` either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is `c(1,9)`. When `k` is a range, the actual number of clusters is determined by Bayesian information criterion. `y` a value of 1 (default) to specify equal weights of 1 for each element in `x`, or a numeric vector of unequal non-negative weights for each element in `x`. It is highly recommended to use positive (instead of zero) weights to account for the influence of every element. The weights have a strong impact on the clustering result. When the number of clusters `k` is given as a range, the weights should be linearly scaled to sum up to the observed sample size. Currently, `Ckmedian.1d.dp` only works with an equal weight of 1. `method` a character string to specify the speedup method to the original cubic runtime dynamic programming. The default is `"linear"`. All methods generate the same optimal results but differ in runtime or memory usage. See Details. `estimate.k` a character string to specify the method to estimate optimal `k`. This argument is effective only when a range for `k` is provided. The default is `"BIC"`. See Details.

## Details

`Ckmean.1d.dp` minimizes unweighted or weighted within-cluster sum of squared distance (L2).

`Ckmedian.1d.dp` minimizes within-cluster sum of distance (L1). Only unweighted solution is implemented and guarantees optimality.

In contrast to the heuristic k-means algorithms implemented in function `kmeans`, this function optimally assigns elements in numeric vector `x` into `k` clusters by dynamic programming \insertCiteWang2011Ckmeans,song2020wucCkmeans.1d.dp. It minimizes the total of within-cluster sums of squared distances (withinss) between each element and its corresponding cluster mean. When a range is provided for `k`, the exact number of clusters is determined by Bayesian information criterion \insertCitesong2020wucCkmeans.1d.dp. Different from the heuristic k-means algorithms whose results may be non-optimal or change from run to run, the result of Ckmeans.1d.dp is guaranteed to be optimal and reproducible, and its advantage in efficiency and accuracy over heuristic k-means methods is most pronounced at large k.

The `estimate.k` argument specifies the method to select optimal `k` based on the Gaussian mixture model using the Bayesian information criterion (BIC). When `estimate.k="BIC"`, it effectively deals with variance estimation for a cluster with identical values. When `estimate.k="BIC 3.4.12"`, it uses the code in version 3.4.12 and earlier to estimate `k`.

The `method` argument specifies one of three options to speed up the original dynamic programming taking a runtime cubic in sample size n. The default `"linear"` option, giving a total runtime of O(n lg n + kn) or O(kn) (if `x` is already sorted in ascending order) is the fastest option but uses the most memory (still O(kn)) \insertCitesong2020wucCkmeans.1d.dp; the `"loglinear"` option, with a runtime of O(kn lg n), is slightly slower but uses the least memory \insertCitesong2020wucCkmeans.1d.dp; the slowest `"quadratic"` option \insertCiteWang2011CkmeansCkmeans.1d.dp, with a runtime of O(kn^2), is provided for the purpose of testing on small data sets.

When the sample size n is too large to create two k x n dynamic programming matrices in memory, we recommend the heuristic solutions implemented in the `kmeans` function in package stats.

## Value

An object of class "`Ckmeans.1d.dp`" or "`Ckmedian.1d.dp`". It is a list containing the following components:

 `cluster` a vector of clusters assigned to each element in `x`. Each cluster is indexed by an integer from 1 to `k`. `centers` a numeric vector of the (weighted) means for each cluster. `withinss` a numeric vector of the (weighted) within-cluster sum of squares for each cluster. `size` a vector of the (weighted) number of elements in each cluster. `totss` total sum of (weighted) squared distances between each element and the sample mean. This statistic is not dependent on the clustering result. `tot.withinss` total sum of (weighted) within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters. `betweenss` sum of (weighted) squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters. `xname` a character string. The actual name of the `x` argument. `yname` a character string. The actual name of the `y` argument.

Each class has a print and a plot method, which are described along with `print.Ckmeans.1d.dp` and `plot.Ckmeans.1d.dp`.

## Author(s)

Joe Song and Haizhou Wang

## References

\insertAllCited

`ahist`, `plot.Ckmeans.1d.dp`, `print.Ckmeans.1d.dp` in this package.

`kmeans` in package stats that implements several heuristic k-means algorithms.

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65``` ```# Ex. 1 The number of clusters is provided. # Generate data from a Gaussian mixture model of three components x <- c(rnorm(50, sd=0.2), rnorm(50, mean=1, sd=0.3), rnorm(100, mean=-1, sd=0.25)) # Divide x into 3 clusters k <- 3 result <- Ckmedian.1d.dp(x, k) plot(result, main="Optimal univariate k-median given k") result <- Ckmeans.1d.dp(x, k) plot(result, main="Optimal univariate k-means given k") plot(x, col=result\$cluster, pch=result\$cluster, cex=1.5, main="Optimal univariate k-means clustering given k", sub=paste("Number of clusters given:", k)) abline(h=result\$centers, col=1:k, lty="dashed", lwd=2) legend("bottomleft", paste("Cluster", 1:k), col=1:k, pch=1:k, cex=1.5, bty="n") # Ex. 2 The number of clusters is determined by Bayesian # information criterion # Generate data from a Gaussian mixture model of three components x <- c(rnorm(50, mean=-3, sd=1), rnorm(50, mean=0, sd=.5), rnorm(50, mean=3, sd=1)) # Divide x into k clusters, k automatically selected (default: 1~9) result <- Ckmedian.1d.dp(x) plot(result, main="Optimal univariate k-median with k estimated") result <- Ckmeans.1d.dp(x) plot(result, main="Optimal univariate k-means with k estimated") k <- max(result\$cluster) plot(x, col=result\$cluster, pch=result\$cluster, cex=1.5, main="Optimal univariate k-means clustering with k estimated", sub=paste("Number of clusters is estimated to be", k)) abline(h=result\$centers, col=1:k, lty="dashed", lwd=2) legend("topleft", paste("Cluster", 1:k), col=1:k, pch=1:k, cex=1.5, bty="n") # Ex. 3 Segmenting a time course using optimal weighted # univariate clustering n <- 160 t <- seq(0, 2*pi*2, length=n) n1 <- 1:(n/2) n2 <- (max(n1)+1):n y1 <- abs(sin(1.5*t[n1]) + 0.1*rnorm(length(n1))) y2 <- abs(sin(0.5*t[n2]) + 0.1*rnorm(length(n2))) y <- c(y1, y2) w <- y^8 # stress the peaks res <- Ckmeans.1d.dp(t, k=c(1:10), w) plot(res) plot(t, w, main = "Time course weighted k-means", col=res\$cluster, pch=res\$cluster, xlab="Time t", ylab="Transformed intensity w", type="h") abline(v=res\$centers, col="chocolate", lty="dashed") text(res\$centers, max(w) * .95, cex=0.5, font=2, paste(round(res\$size / sum(res\$size) * 100), "/ 100")) ```

### Example output        ```
```

Ckmeans.1d.dp documentation built on July 22, 2020, 5:09 p.m.