optimal_phyloregion: Determine optimal number of clusters

View source: R/optimal_bioregion.R

optimal_phyloregionR Documentation

Determine optimal number of clusters

Description

This function divides the hierarchical dendrogram into meaningful clusters ("phyloregions"), based on the ‘elbow’ or ‘knee’ of an evaluation graph that corresponds to the point of optimal curvature.

Usage

optimal_phyloregion(x, method = "average", k = 20)

Arguments

x

a numeric matrix, data frame or “dist” object.

method

the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).

k

numeric, the upper bound of the number of clusters to compute. DEFAULT: 20 or the number of observations (if less than 20).

Value

a list containing the following as returned from the GMD package (Zhao et al. 2011):

  • k: optimal number of clusters (bioregions)

  • totbss: total between-cluster sum-of-square

  • tss: total sum of squares of the data

  • ev: explained variance given k

References

Salvador, S. & Chan, P. (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. Proceedings of the Sixteenth IEEE International Conference on Tools with Artificial Intelligence, pp. 576–584. Institute of Electrical and Electronics Engineers, Piscataway, New Jersey, USA.

Zhao, X., Valen, E., Parker, B.J. & Sandelin, A. (2011) Systematic clustering of transcription start site landscapes. PLoS ONE 6: e23409.

Examples

data(africa)
tree <- africa$phylo
bc <- beta_diss(africa$comm)
(d <- optimal_phyloregion(bc[[1]], k=15))
plot(d$df$k, d$df$ev, ylab = "Explained variances",
  xlab = "Number of clusters")
lines(d$df$k[order(d$df$k)], d$df$ev[order(d$df$k)], pch = 1)
points(d$optimal$k, d$optimal$ev, pch = 21, bg = "red", cex = 3)
points(d$optimal$k, d$optimal$ev, pch = 21, bg = "red", type = "h")

phyloregion documentation built on Aug. 15, 2023, 9:07 a.m.