Ckmeans.1d.dp: Optimal and Fast Univariate Clustering

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/Ckmeans.1d.dp.R

Description

Perform optimal univariate k-means, k-median, or k-segments clustering in linear (fastest), loglinear, or quadratic (slowest) time.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Ckmeans.1d.dp(x, k=c(1,9), y=1,
              method=c("linear", "loglinear", "quadratic"),
              estimate.k=c("BIC", "BIC 3.4.12"))

Ckmedian.1d.dp(x, k=c(1,9), y=1,
               method=c("linear", "loglinear", "quadratic"),
               estimate.k=c("BIC", "BIC 3.4.12"))

Cksegs.1d.dp(y, k=c(1,9), x=seq_along(y),
             method=c("quadratic", "linear", "loglinear"),
             estimate.k=c("BIC", "BIC 3.4.12"))

Arguments

x

a numeric vector of data to be clustered. All NA elements must be removed from x before calling this function. The function will run faster on sorted x (in non-decreasing order) than an unsorted input.

k

either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is c(1,9). When k is a range, the actual number of clusters is determined by Bayesian information criterion.

y

a value of 1 to specify equal weights or a numeric vector of unequal weights for each element. The default weight is one. It is highly recommended to use positive (instead of zero) weights to account for the influence of every element. The weights have a strong impact on the clustering result, except in Ckmedian.1d.dp where weights are forced to be equal. Only Cksegs.1d.dp can take negative weights.

method

a character string to specify the speedup method to the original cubic runtime dynamic programming. The default is "linear". All methods generate the same optimal results but differ in runtime or memory usage. See Details.

estimate.k

a character string to specify the method to estimate optimal k. The default is "BIC". See Details.

Details

Ckmean.1d.dp minimizes unweighted or weighted within-cluster sum of squared distance (L2).

In contrast to the heuristic k-means algorithms implemented in function kmeans, this function optimally assigns elements in numeric vector x into k clusters by dynamic programming (Wang and Song, 2011). It minimizes the total of within-cluster sums of squared distances (withinss) between each element and its corresponding cluster mean. When a range is provided for k, the exact number of clusters is determined by Bayesian information criterion. Different from the heuristic k-means algorithms whose results may be non-optimal or change from run to run, the result of Ckmeans.1d.dp is guaranteed to be optimal and reproducible, and its advantage in efficiency and accuracy over heuristic k-means methods is most pronounced at large k.

The Bayesian information criterion (BIC) method to select optimal k is updated to deal with duplicates in the data. Otherwise, the estimated k would be the same with previous versions. Set estimate.k="BIC" to use the latest method; use estimate.k="BIC 3.4.12" to use the BIC method in version 3.4.12 or earlier to estimated k from the given range. This option is effective only when a range for k is provided.

method specifies one of three options to speed up the original dynamic programming taking a runtime cubic in sample size n. The default "linear" option, giving a total runtime of O(n lg n + kn) or O(kn) (if x is already sorted in ascending order) is the fastest option but uses the most memory (still O(kn)); the "loglinear" option, with a runtime of O(kn lg n), is slightly slower but uses the least memory; the slowest "quadratic" option, with a runtime of O(kn^2), is provided for the purpose of testing on small data sets.

Ckmedian.1d.dp minimizes within-cluster sum of distance (L1). Only unweighted solution is implemented and guarantees optimality.

Cksegs.1d.dp minimizes within-cluster sum of squared distance on y. It offers optimal piece-wise constant approximation of y within clusters of x. Only method="quadratic" guarantees optimality. The "linear" and "loglinear" options are faster but not always optimal and are provided for comparison purposes.

When the sample size n is too large to create two k x n dynamic programming matrices in memory, we recommend the heuristic solutions implemented in the kmeans function in package stats.

Value

An object of class "Ckmeans.1d.dp", "Ckmedian.1d.dp", or "Cksegs.1d.dp". It is a list containing the following components:

cluster

a vector of clusters assigned to each element in x. Each cluster is indexed by an integer from 1 to k.

centers

a numeric vector of the (weighted) means for each cluster.

withinss

a numeric vector of the (weighted) within-cluster sum of squares for each cluster.

size

a vector of the (weighted) number of elements in each cluster.

totss

total sum of (weighted) squared distances between each element and the sample mean. This statistic is not dependent on the clustering result.

tot.withinss

total sum of (weighted) within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters.

betweenss

sum of (weighted) squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters.

xname

a character string. The actual name of the x argument.

yname

a character string. The actual name of the y argument.

Each of the three classes has a print and a plot method, which are described along with print.Ckmeans.1d.dp and plot.Ckmeans.1d.dp.

Author(s)

Joe Song and Haizhou Wang

References

Wang, H. and Song, M. (2011) Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming. The R Journal 3(2), 29–33. Retrieved from https://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf

See Also

ahist, plot.Ckmeans.1d.dp, print.Ckmeans.1d.dp in this package.

kmeans in package stats that implements several heuristic k-means algorithms.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# Ex. 1 The number of clusters is provided.
# Generate data from a Gaussian mixture model of three components
x <- c(rnorm(50, sd=0.2), rnorm(50, mean=1, sd=0.3), rnorm(100,
       mean=-1, sd=0.25))
# Divide x into 3 clusters
k <- 3

result <- Ckmedian.1d.dp(x, k)

plot(result, main="Optimal univariate k-median given k")

result <- Ckmeans.1d.dp(x, k)

plot(result, main="Optimal univariate k-means given k")

plot(x, col=result$cluster, pch=result$cluster, cex=1.5,
     main="Optimal univariate k-means clustering given k",
     sub=paste("Number of clusters given:", k))
abline(h=result$centers, col=1:k, lty="dashed", lwd=2)
legend("bottomleft", paste("Cluster", 1:k), col=1:k, pch=1:k,
       cex=1.5, bty="n")

# Ex. 2 The number of clusters is determined by Bayesian
#       information criterion
# Generate data from a Gaussian mixture model of three components
x <- c(rnorm(50, mean=-3, sd=1), rnorm(50, mean=0, sd=.5),
       rnorm(50, mean=3, sd=1))
# Divide x into k clusters, k automatically selected (default: 1~9)

result <- Ckmedian.1d.dp(x)

plot(result, main="Optimal univariate k-median with k estimated")

result <- Ckmeans.1d.dp(x)

plot(result, main="Optimal univariate k-means with k estimated")

k <- max(result$cluster)
plot(x, col=result$cluster, pch=result$cluster, cex=1.5,
     main="Optimal univariate k-means clustering with k estimated",
     sub=paste("Number of clusters is estimated to be", k))
abline(h=result$centers, col=1:k, lty="dashed", lwd=2)
legend("topleft", paste("Cluster", 1:k), col=1:k, pch=1:k,
       cex=1.5, bty="n")

# Ex. 3 Segmenting a time course using optimal weighted
#       univariate clustering
n <- 160
t <- seq(0, 2*pi*2, length=n)
n1 <- 1:(n/2)
n2 <- (max(n1)+1):n
y1 <- abs(sin(1.5*t[n1]) + 0.1*rnorm(length(n1)))
y2 <- abs(sin(0.5*t[n2]) + 0.1*rnorm(length(n2)))
y <- c(y1, y2)

w <- y^8 # stress the peaks
res <- Ckmeans.1d.dp(t, k=c(1:10), w)
plot(res)
plot(t, w, main = "Time course weighted k-means",
     col=res$cluster, pch=res$cluster,
     xlab="Time t", ylab="Transformed intensity w",
     type="h")
abline(v=res$centers, col="chocolate", lty="dashed")
text(res$centers, max(w) * .95, cex=0.5, font=2,
     paste(round(res$size / sum(res$size) * 100), "/ 100"))


# Ex 4. Segmenting by y

y <- c(1,1,1,2,2,2,4,4,4,4)

res <- Ckmeans.1d.dp(x=seq_along(y), k=c(1:10), y)
main <- "k-means gave one cluster\nbut failed to find segments"
plot(res, main=main)

res <- Cksegs.1d.dp(y, k=c(1:10))
plot(res)


# Ex 5. Segmenting by y

y <- c(1,1,1.1,1, 2,2.5,2, 4,5,4,4)
res <- Cksegs.1d.dp(y, k=c(1:10))
plot(res)


# Ex 6. Segmenting by y

y <- rep(c(1,-3,4,-2), each=20)
y <- y + 0.5*rnorm(length(y))
k <- 1:10
res.q <- Cksegs.1d.dp(y, k=k, method="quadratic")
main <- paste("Cksegs (method=\"quadratic\"):\ntot.withinss =",
              format(res.q$tot.withinss, digits=4),
              "\nGUARANTEE TO BE OPTIMAL")
plot(res.q, main=main)
res.l <- Cksegs.1d.dp(y, k=k, method="linear")
main <- paste("Cksegs (method=\"linear\"):\ntot.withinss =",
               format(res.l$tot.withinss, digits=4),
               "\nFAST BUT MAY NOT BE OPTIMAL")
plot(res.l, main=main)
res.g <- Cksegs.1d.dp(y, k=k, method="loglinear")
main <- paste("Cksegs (method=\"loglinear\"):\ntot.withinss =",
              format(res.g$tot.withinss, digits=4),
              "\nFAST BUT MAY NOT BE OPTIMAL")
plot(res.g, main=main)


# Ex 7. Segmenting a sinusoidal curve by y
x <- 1:125
y <- sin(x * .2)
res.q <- Cksegs.1d.dp(y, k=8, x=x)
plot(res.q, lwd=3, xlab="x")

Ckmeans.1d.dp documentation built on May 31, 2017, 5:17 a.m.

Search within the Ckmeans.1d.dp package
Search all R packages, documentation and source code