fviz_nbclust: Dertermining and Visualizing the Optimal Number of Clusters

Description Usage Arguments Value Author(s) See Also Examples

Description

Partitioning methods, such as k-means clustering require the users to specify the number of clusters to be generated.

Read more: Determining the optimal number of clusters

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
fviz_nbclust(
  x,
  FUNcluster = NULL,
  method = c("silhouette", "wss", "gap_stat"),
  diss = NULL,
  k.max = 10,
  nboot = 100,
  verbose = interactive(),
  barfill = "steelblue",
  barcolor = "steelblue",
  linecolor = "steelblue",
  print.summary = TRUE,
  ...
)

fviz_gap_stat(
  gap_stat,
  linecolor = "steelblue",
  maxSE = list(method = "firstSEmax", SE.factor = 1)
)

Arguments

x

numeric matrix or data frame. In the function fviz_nbclust(), x can be the results of the function NbClust().

FUNcluster

a partitioning function which accepts as first argument a (data) matrix like x, second argument, say k, k >= 2, the number of clusters desired, and returns a list with a component named cluster which contains the grouping of observations. Allowed values include: kmeans, cluster::pam, cluster::clara, cluster::fanny, hcut, etc. This argument is not required when x is an output of the function NbClust::NbClust().

method

the method to be used for estimating the optimal number of clusters. Possible values are "silhouette" (for average silhouette width), "wss" (for total within sum of square) and "gap_stat" (for gap statistics).

diss

dist object as produced by dist(), i.e.: diss = dist(x, method = "euclidean"). Used to compute the average silhouette width of clusters, the within sum of square and hierarchical clustering. If NULL, dist(x) is computed with the default method = "euclidean"

k.max

the maximum number of clusters to consider, must be at least two.

nboot

integer, number of Monte Carlo ("bootstrap") samples. Used only for determining the number of clusters using gap statistic.

verbose

logical value. If TRUE, the result of progress is printed.

barfill, barcolor

fill color and outline color for bars

linecolor

color for lines

print.summary

logical value. If true, the optimal number of clusters are printed in fviz_nbclust().

...

optionally further arguments for FUNcluster()

gap_stat

an object of class "clusGap" returned by the function clusGap() [in cluster package]

maxSE

a list containing the parameters (method and SE.factor) for determining the location of the maximum of the gap statistic (Read the documentation ?cluster::maxSE). Allowed values for maxSE$method include:

  • "globalmax": simply corresponds to the global maximum, i.e., is which.max(gap)

  • "firstmax": gives the location of the first local maximum

  • "Tibs2001SEmax": uses the criterion, Tibshirani et al (2001) proposed: "the smallest k such that gap(k) >= gap(k+1) - s_k+1". It's also possible to use "the smallest k such that gap(k) >= gap(k+1) - SE.factor*s_k+1" where SE.factor is a numeric value which can be 1 (default), 2, 3, etc.

  • "firstSEmax": location of the first f() value which is not larger than the first local maximum minus SE.factor * SE.f[], i.e, within an "f S.E." range of that maximum.

  • see ?cluster::maxSE for more options

Value

Author(s)

Alboukadel Kassambara alboukadel.kassambara@gmail.com

See Also

fviz_cluster, eclust

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
set.seed(123)

# Data preparation
# +++++++++++++++
data("iris")
head(iris)
# Remove species column (5) and scale the data
iris.scaled <- scale(iris[, -5])


# Optimal number of clusters in the data
# ++++++++++++++++++++++++++++++++++++++
# Examples are provided only for kmeans, but
# you can also use cluster::pam (for pam) or
#  hcut (for hierarchical clustering)
 
### Elbow method (look at the knee)
# Elbow method for kmeans
fviz_nbclust(iris.scaled, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)

# Average silhouette for kmeans
fviz_nbclust(iris.scaled, kmeans, method = "silhouette")

### Gap statistic
library(cluster)
set.seed(123)
# Compute gap statistic for kmeans
# we used B = 10 for demo. Recommended value is ~500
gap_stat <- clusGap(iris.scaled, FUN = kmeans, nstart = 25,
 K.max = 10, B = 10)
 print(gap_stat, method = "firstmax")
fviz_gap_stat(gap_stat)
 
# Gap statistic for hierarchical clustering
gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 10)
fviz_gap_stat(gap_stat)

 

Example output

Loading required package: ggplot2
Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = iris.scaled, FUNcluster = kmeans, K.max = 10, B = 10,     nstart = 25)
B=10 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstmax'): 3
          logW   E.logW       gap     SE.sim
 [1,] 4.534565 4.753100 0.2185345 0.03145767
 [2,] 4.021316 4.489937 0.4686203 0.02397553
 [3,] 3.806577 4.297333 0.4907552 0.03038244
 [4,] 3.699263 4.141120 0.4418565 0.02263960
 [5,] 3.589284 4.049903 0.4606189 0.02153819
 [6,] 3.519726 3.967399 0.4476734 0.02451182
 [7,] 3.448288 3.899672 0.4513843 0.02816061
 [8,] 3.398210 3.846276 0.4480656 0.02557573
 [9,] 3.334279 3.800104 0.4658256 0.02313226
[10,] 3.250246 3.758406 0.5081600 0.02195875

factoextra documentation built on April 2, 2020, 1:09 a.m.