silhouette_analysis: Wrapper function to perform silhouette analysis on different...
In m-jahn/R-tools: Utility and wrapper functions for bioinformatics work

View source: R/silhouette_analysis.R

silhouette_analysis

R Documentation

Wrapper function to perform silhouette analysis on different cluster numbers

Description

Silhouette analysis identifies the number of clusters that have highest explanatory power. It tries to answer the question of how many different clusters are required to optimally separate all clusters from their neighbors. Good cluster separation results in a higher average silhouette width, the decisive metric to judge cluster number. This function applies silhouette analysis iteratively for a vector of different cluster numbers and stores the result in a list.

Usage

silhouette_analysis(
  mat,
  cluster_object = NULL,
  n_clusters = 2:10,
  n_repeats = 5,
  plot = TRUE
)

Arguments

`mat`	(numeric matrix) data matrix that clustering was performed on (or will be performed using k-means clustering)
`cluster_object`	(hclust) a cluster object obtained from running hclust(), optional
`n_clusters`	(numeric) a vector of cluster numbers for which silhouette analysis is performed
`n_repeats`	(numeric) scalar indicating the number of random permutations to perform analysis (default: 5)
`plot`	(logical) if the function should return a list of summary plots also. Default is TRUE

Details

Prerequesite for silhouette analysis is a cluster object that can be obtained by e.g. running hclust(d = dist(mat), method = "ward.D"). The alternative is to supply no cluster object, then the function performs a kmeans() clustering for the indicated number of clusters.

Value

A list with five objects data: silhouette analysis data for each iteration, data_summary: silhouette analysis data concise summary, optimal_n_clust: optimal number of clusters, plot_clusters: plot silhouette widths for all number of clusters separately, plot_summary: plot silhouette widths summary

Examples

# generate a random matrix that we use for clustering with the 
# format of 100 rows (e.g. determined gene expression) and 10 
# columns (conditions)
mat <- matrix(rnorm(1000), ncol = 10)

# we can perform clustering on this matrix using e.g. hclust:
# there is clearly no good separation between different clusters of 'genes'
clust <- hclust(dist(mat))
plot(clust)

# perform silhouette analysis for 2 to 10 different clusters
sil_result <- silhouette_analysis(mat, n_clusters = 2:10)

# plot results
print(sil_result$plot_clusters, split = c(1,1,2,1), more = TRUE)
print(sil_result$plot_summary, split = c(2,1,2,1))

m-jahn/R-tools documentation built on Feb. 5, 2023, 1:05 p.m.