This is a full example of using the kluster package.
# install kluster devtools::install_github("hestiri/kluster") # #loading the required packages. if (!require("easypackages")) install.packages('easypackages', repos = "http://cran.rstudio.com/") packages("kluster","mclust","dplyr","DT","ggplot2","tictoc","clusterGeneration","factoextra","apcluster","fpc","vegan",prompt = F) #loading the data data(Breast_Cancer_Wisconsin)
The kluster package has three main functions:
kluster is the main kluster function. If an algorithm is not pre-defined, it will use the best implementation of kluster (most frequent product on BIC) for the production purpose. If a sample size is not pre-defined, it will use the recommended sample size (if n> 3000, sample size = 500, otherwise, sample size = 100) as default. If an iteration is not pre-set, it will iterate 100 times, as recommended through our simulation analyses.
kluster_sim performs simulation analysis to compare results of applying the original algorithm with kluster products. If a specific algorithm is not specified by the user, it will perform all original cluster number approximation algorithms and their associated kluster forms and will provide data for comparative analysis of the results as well as the processing time. The actual number of clusters needs to be provided for the function to calculate approximation error. Please not that if the dataset is large (i.2., > 50k), the original algorithms may not work and R will crash.
kluster_eval performs evaluation analysis on kluster implementations. If a specific algorithm is not specified by the user, it will perform the kluster implementations of all cluster number approximation algorithms will provide data for evaluation of the best algorithms as well as the processing time. The actual number of clusters needs to be provided for the function to calculate approximation error.
I will use data(Breast_Cancer_Wisconsin) for this demo. Below is a plot of the 2 clusters of benign and malignant tumors by texture mean in this dataset.
ggplot(dat)+ geom_point(aes(x=area_mean, y= texture_mean,colour=diagnosis),size=3,alpha=0.3)+ ggtitle("") + theme_bw() + theme(panel.grid.major.y = element_line(colour = "gray"), panel.grid.minor.y = element_blank(), axis.line = element_line(size=0.5, colour = "black"), panel.border = element_blank(), panel.background = element_blank(), plot.title = element_text(size = 14, family = "Tahoma", face = "bold"), text=element_text(family="Tahoma", face = "bold"), axis.text.x=element_text(colour="black", size = 10, face="plain"), axis.text.y=element_text(colour="black", size = 10, face="plain"), legend.position="bottom")
Now let's get the best estimate using the main kluster function -- 10 itertations of the BIC algorithm on 100 random samples drawn from data with replacement.
kluster.results <- kluster(data = dat[,c("area_mean","texture_mean")], iter_klust = 10, smpl=100, algorithm = "BIC")
The result is a dataframe including both mean and most frequent kluster product on BIC.
kluster.results
In production, use f_BIC_k directly use the most frequent output:
kluster.results$f_BIC_k
There are 4 algorithms implemented in kluster evaluation functions: BIC (Bayesian Information Criterion), PAMK (Partitioning Around Medoids), CAL (Calinski and Harabasz index), and AP (Affinity Propagation).
To test performance of the other algorithms against a known gold standard number of clusters implementation results:
eval <- data.frame(kluster_eval(data = dat[,c("area_mean","texture_mean")], clusters = 2,#known gold standard number of clusters iter_sim = 1,#number of simulation iterations if need be more than 1 iter_klust = 10,#iteration for each algorithm algorithm = "Default", #select analysis algorithm from BIC, PAMK, CAL, and AP smpl = 100)$sim)
datatable(eval, options = list(pageLength = 8), filter = 'bottom')
e_mean and e_freq represent the respective error terms for the mean and most frequent kluster products on each algorithms.
Now we can test results of kluster across with the implementation of original algorithms using the kluster_sim function.
simulation.results <- kluster_sim(data = dat[,c("area_mean","texture_mean")], clusters = 2, iter_sim = 2, iter_klust = 10, smpl = 100, algorithm = "Default")$sim
datatable(simulation.results, options = list(pageLength = 12), filter = 'bottom')
e is the error term -- i.e., the difference between the known number of cluster and the approximated value based on each algorithm. Processing time (ptime) is also demonstrated.
email your questions/suggestions to estiri.hossein at gmail
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.