In hestiri/kluster: A package for scalable approximation of the number of clusters

This is a full example of using the kluster package.

# install kluster
devtools::install_github("hestiri/kluster")

# #loading the required packages.
if (!require("easypackages")) install.packages('easypackages', repos = "http://cran.rstudio.com/")
packages("kluster","mclust","dplyr","DT","ggplot2","tictoc","clusterGeneration","factoextra","apcluster","fpc","vegan",prompt = F)

#loading the data
data(Breast_Cancer_Wisconsin)

The kluster package has three main functions:

kluster is the main kluster function. If an algorithm is not pre-defined, it will use the best implementation of kluster (most frequent product on BIC) for the production purpose. If a sample size is not pre-defined, it will use the recommended sample size (if n> 3000, sample size = 500, otherwise, sample size = 100) as default. If an iteration is not pre-set, it will iterate 100 times, as recommended through our simulation analyses.
kluster_sim performs simulation analysis to compare results of applying the original algorithm with kluster products. If a specific algorithm is not specified by the user, it will perform all original cluster number approximation algorithms and their associated kluster forms and will provide data for comparative analysis of the results as well as the processing time. The actual number of clusters needs to be provided for the function to calculate approximation error. Please not that if the dataset is large (i.2., > 50k), the original algorithms may not work and R will crash.
kluster_eval performs evaluation analysis on kluster implementations. If a specific algorithm is not specified by the user, it will perform the kluster implementations of all cluster number approximation algorithms will provide data for evaluation of the best algorithms as well as the processing time. The actual number of clusters needs to be provided for the function to calculate approximation error.

I will use data(Breast_Cancer_Wisconsin) for this demo. Below is a plot of the 2 clusters of benign and malignant tumors by texture mean in this dataset.

ggplot(dat)+
  geom_point(aes(x=area_mean, y= texture_mean,colour=diagnosis),size=3,alpha=0.3)+
  ggtitle("") +
  theme_bw() +
  theme(panel.grid.major.y = element_line(colour = "gray"),
        panel.grid.minor.y = element_blank(),
        axis.line = element_line(size=0.5, colour = "black"),
        panel.border = element_blank(), panel.background = element_blank(),
        plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
        text=element_text(family="Tahoma", face = "bold"),
        axis.text.x=element_text(colour="black", size = 10, face="plain"),
        axis.text.y=element_text(colour="black", size = 10, face="plain"),
        legend.position="bottom")

Now let's get the best estimate using the main kluster function -- 10 itertations of the BIC algorithm on 100 random samples drawn from data with replacement.

kluster.results <- kluster(data = dat[,c("area_mean","texture_mean")],
            iter_klust = 10, 
            smpl=100, 
            algorithm = "BIC")

The result is a dataframe including both mean and most frequent kluster product on BIC.

kluster.results

In production, use f_BIC_k directly use the most frequent output:

kluster.results$f_BIC_k

There are 4 algorithms implemented in kluster evaluation functions: BIC (Bayesian Information Criterion), PAMK (Partitioning Around Medoids), CAL (Calinski and Harabasz index), and AP (Affinity Propagation).

To test performance of the other algorithms against a known gold standard number of clusters implementation results:

eval <- data.frame(kluster_eval(data = dat[,c("area_mean","texture_mean")], 
                              clusters = 2,#known gold standard number of clusters
                              iter_sim = 1,#number of simulation iterations if need be more than 1
                              iter_klust = 10,#iteration for each algorithm
                              algorithm = "Default", #select analysis algorithm from BIC, PAMK, CAL, and AP
                              smpl = 100)$sim)

datatable(eval, options = list(pageLength = 8), filter = 'bottom')

e_mean and e_freq represent the respective error terms for the mean and most frequent kluster products on each algorithms.

Now we can test results of kluster across with the implementation of original algorithms using the kluster_sim function.

simulation.results <- kluster_sim(data = dat[,c("area_mean","texture_mean")], 
                       clusters = 2, 
                       iter_sim = 2, 
                       iter_klust = 10, 
                       smpl = 100, 
                       algorithm = "Default")$sim

datatable(simulation.results, options = list(pageLength = 12), filter = 'bottom')

e is the error term -- i.e., the difference between the known number of cluster and the approximated value based on each algorithm. Processing time (ptime) is also demonstrated.

email your questions/suggestions to estiri.hossein at gmail

hestiri/kluster documentation built on May 28, 2019, 8:55 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com