ward: Create a hierarchy by Ward's method

View source: R/mining.R

wardR Documentation

Create a hierarchy by Ward's method

Description

Produces a hierarchical clustering of one-dimensional data via Ward's method.

Usage

ward(
  x,
  n = rep(1, length(x)),
  s = rep(1, length(x)),
  sortx = TRUE,
  same.var = T
)

Arguments

x

a numerical vector, or a list of vectors.

n

if x is a vector of cluster means, n is the size of each cluster.

s

if x is a vector of cluster means, s is the sum of squares in each cluster. only needed if same.var=F.

sortx

if sortx=F, only clusters which are adjacent in x can be merged. Used by break.ts.

same.var

if same.var=T, clusters are assumed to have the same true variance, otherwise not. This affects the cost function for merging.

Details

Repeatedly merges clusters in order to minimize the clustering cost. By default, it is the same as hclust(method="ward"). If same.var=T, the cost is the sum of squares:

sum_c sum_{i in c} (x_i - m_c)^2

The incremental cost of merging clusters ca and cb is

(n_a*n_b)/(n_a+n_b)*(m_a - m_b)^2

It prefers to merge clusters which are small and have similar means.

If same.var=F, the cost is the sum of log-variances:

sum_c n_c*log(1/n_c*sum_{i in c} (x_i - m_c)^2)

It prefers to merge clusters which are small, have similar means, and have similar variances.

If x is a list of vectors, each vector is assumed to be a cluster. n and s are computed for each cluster and x is replaced by the cluster means. Thus you can say ward(split(x,f)) to cluster the data for different factors.

Value

The same type of object returned by hclust.

Bugs

Because of the adjacency constraint used in implementation, the clustering that results from sortx=T and same.var=F may occasionally be suboptimal.

Author(s)

Tom Minka

See Also

hclust, plot_hclust_trace, hist.hclust, boxplot.hclust, break_ward, break.ts, merge_factor

Examples

x <- c(rnorm(700,-2,1.5),rnorm(300,3,0.5))
hc <- ward(x)
opar <- par(mfrow=c(2,1))
# use dev.new() in RStudio
plot_hclust_trace(hc)
hist(hc,x)
par(opar)

x <- c(rnorm(700,-2,0.5),rnorm(1000,2.5,1.5),rnorm(500,7,0.1))
hc <- ward(x)
opar <- par(mfrow=c(2,1))
plot_hclust_trace(hc)
hist(hc,x)
par(opar)

data(OrchardSprays)
x <- OrchardSprays$decrease
f <- factor(OrchardSprays$treatment)
# shuffle levels
#lev <- levels(OrchardSprays$treatment)
#f <- factor(OrchardSprays$treatment,levels=sample(lev))
hc <- ward(split(x,f))
# is equivalent to:
#n <- tapply(x,f,length)
#m <- tapply(x,f,mean)
#s <- tapply(x,f,var)*n
#hc <- ward(m,n,s)
boxplot(hc,split(x,f))

paulemms/datamining documentation built on March 1, 2023, 4:01 p.m.