eclust: Visual enhancement of clustering analysis

Description Usage Arguments Value Author(s) See Also Examples

View source: R/eclust.R

Description

Provides solution for enhancing the workflow of clustering analyses and ggplot2-based elegant data visualization. Read more: Visual enhancement of clustering analysis.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
eclust(
  x,
  FUNcluster = c("kmeans", "pam", "clara", "fanny", "hclust", "agnes", "diana"),
  k = NULL,
  k.max = 10,
  stand = FALSE,
  graph = TRUE,
  hc_metric = "euclidean",
  hc_method = "ward.D2",
  gap_maxSE = list(method = "firstSEmax", SE.factor = 1),
  nboot = 100,
  verbose = interactive(),
  seed = 123,
  ...
)

Arguments

x

numeric vector, data matrix or data frame

FUNcluster

a clustering function including "kmeans", "pam", "clara", "fanny", "hclust", "agnes" and "diana". Abbreviation is allowed.

k

the number of clusters to be generated. If NULL, the gap statistic is used to estimate the appropriate number of clusters. In the case of kmeans, k can be either the number of clusters, or a set of initial (distinct) cluster centers.

k.max

the maximum number of clusters to consider, must be at least two.

stand

logical value; default is FALSE. If TRUE, then the data will be standardized using the function scale(). Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's standard deviation.

graph

logical value. If TRUE, cluster plot is displayed.

hc_metric

character string specifying the metric to be used for calculating dissimilarities between observations. Allowed values are those accepted by the function dist() [including "euclidean", "manhattan", "maximum", "canberra", "binary", "minkowski"] and correlation based distance measures ["pearson", "spearman" or "kendall"]. Used only when FUNcluster is a hierarchical clustering function such as one of "hclust", "agnes" or "diana".

hc_method

the agglomeration method to be used (?hclust): "ward.D", "ward.D2", "single", "complete", "average", ...

gap_maxSE

a list containing the parameters (method and SE.factor) for determining the location of the maximum of the gap statistic (Read the documentation ?cluster::maxSE).

nboot

integer, number of Monte Carlo ("bootstrap") samples. Used only for determining the number of clusters using gap statistic.

verbose

logical value. If TRUE, the result of progress is printed.

seed

integer used for seeding the random number generator.

...

other arguments to be passed to FUNcluster.

Value

Returns an object of class "eclust" containing the result of the standard function used (e.g., kmeans, pam, hclust, agnes, diana, etc.).

It includes also:

The "eclust" class has method for fviz_silhouette(), fviz_dend(), fviz_cluster().

Author(s)

Alboukadel Kassambara alboukadel.kassambara@gmail.com

See Also

fviz_silhouette, fviz_dend, fviz_cluster

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Load and scale data
data("USArrests")
df <- scale(USArrests)

# Enhanced k-means clustering
# nboot >= 500 is recommended
res.km <- eclust(df, "kmeans", nboot = 2)
# Silhouette plot
fviz_silhouette(res.km)
# Optimal number of clusters using gap statistics
res.km$nbclust
# Print result
 res.km
 
## Not run: 
 # Enhanced hierarchical clustering
 res.hc <- eclust(df, "hclust", nboot = 2) # compute hclust
  fviz_dend(res.hc) # dendrogam
  fviz_silhouette(res.hc) # silhouette plot

## End(Not run)
 

Example output

Loading required package: ggplot2
Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
  cluster size ave.sil.width
1       1   30          0.43
2       2   20          0.37
[1] 2
K-means clustering with 2 clusters of sizes 30, 20

Cluster means:
     Murder    Assault   UrbanPop       Rape
1 -0.669956 -0.6758849 -0.1317235 -0.5646433
2  1.004934  1.0138274  0.1975853  0.8469650

Clustering vector:
       Alabama         Alaska        Arizona       Arkansas     California 
             2              2              2              1              2 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             2              1              1              2              2 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             1              1              2              1              1 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
             1              1              2              1              2 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
             1              2              1              2              2 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             1              1              2              1              1 
    New Mexico       New York North Carolina   North Dakota           Ohio 
             2              2              2              1              1 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
             1              1              1              1              2 
  South Dakota      Tennessee          Texas           Utah        Vermont 
             1              2              2              1              1 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
             1              1              1              1              1 

Within cluster sum of squares by cluster:
[1] 56.11445 46.74796
 (between_SS / total_SS =  47.5 %)

Available components:

 [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
 [6] "betweenss"    "size"         "iter"         "ifault"       "clust_plot"  
[11] "silinfo"      "nbclust"      "data"         "gap_stat"    
  cluster size ave.sil.width
1       1   19          0.37
2       2   31          0.42

factoextra documentation built on April 2, 2020, 1:09 a.m.