This is a full example of using the kluster
package.
# install kluster devtools::install_github("hestiri/kluster") # #loading the required packages. if (!require("easypackages")) install.packages('easypackages', repos = "http://cran.rstudio.com/") packages("kluster","mclust","dplyr","DT","ggplot2","tictoc","clusterGeneration","factoextra","apcluster","fpc","vegan",prompt = F) #loading the data data(Breast_Cancer_Wisconsin)
The kluster
package has three main functions:
kluster
is the main kluster function. If an algorithm is not pre-defined, it will use the best implementation of kluster (most frequent product on BIC) for the production purpose. If a sample size is not pre-defined, it will use the recommended sample size (if n> 3000, sample size = 500, otherwise, sample size = 100) as default. If an iteration is not pre-set, it will iterate 100 times, as recommended through our simulation analyses.
kluster_sim
performs simulation analysis to compare results of applying the original algorithm with kluster products. If a specific algorithm is not specified by the user, it will perform all original cluster number approximation algorithms and their associated kluster forms and will provide data for comparative analysis of the results as well as the processing time. The actual number of clusters needs to be provided for the function to calculate approximation error. Please not that if the dataset is large (i.2., > 50k), the original algorithms may not work and R will crash.
kluster_eval
performs evaluation analysis on kluster implementations. If a specific algorithm is not specified by the user, it will perform the kluster implementations of all cluster number approximation algorithms will provide data for evaluation of the best algorithms as well as the processing time. The actual number of clusters needs to be provided for the function to calculate approximation error.
I will use data(Breast_Cancer_Wisconsin)
for this demo. Below is a plot of the 2 clusters of benign and malignant tumors by texture mean in this dataset.
ggplot(dat)+ geom_point(aes(x=area_mean, y= texture_mean,colour=diagnosis),size=3,alpha=0.3)+ ggtitle("") + theme_bw() + theme(panel.grid.major.y = element_line(colour = "gray"), panel.grid.minor.y = element_blank(), axis.line = element_line(size=0.5, colour = "black"), panel.border = element_blank(), panel.background = element_blank(), plot.title = element_text(size = 14, family = "Tahoma", face = "bold"), text=element_text(family="Tahoma", face = "bold"), axis.text.x=element_text(colour="black", size = 10, face="plain"), axis.text.y=element_text(colour="black", size = 10, face="plain"), legend.position="bottom")
Now let's get the best estimate using the main kluster
function -- 10 itertations of the BIC algorithm on 100 random samples drawn from data with replacement.
kluster.results <- kluster(data = dat[,c("area_mean","texture_mean")], iter_klust = 10, smpl=100, algorithm = "BIC")
The result is a dataframe including both mean and most frequent kluster product on BIC.
kluster.results
In production, use f_BIC_k
directly use the most frequent output:
kluster.results$f_BIC_k
There are 4 algorithms implemented in kluster evaluation functions: BIC (Bayesian Information Criterion), PAMK (Partitioning Around Medoids), CAL (Calinski and Harabasz index), and AP (Affinity Propagation).
To test performance of the other algorithms against a known gold standard number of clusters implementation results:
eval <- data.frame(kluster_eval(data = dat[,c("area_mean","texture_mean")], clusters = 2,#known gold standard number of clusters iter_sim = 1,#number of simulation iterations if need be more than 1 iter_klust = 10,#iteration for each algorithm algorithm = "Default", #select analysis algorithm from BIC, PAMK, CAL, and AP smpl = 100)$sim)
datatable(eval, options = list(pageLength = 8), filter = 'bottom')
e_mean
and e_freq
represent the respective error terms for the mean and most frequent kluster
products on each algorithms.
Now we can test results of kluster
across with the implementation of original algorithms using the kluster_sim
function.
simulation.results <- kluster_sim(data = dat[,c("area_mean","texture_mean")], clusters = 2, iter_sim = 2, iter_klust = 10, smpl = 100, algorithm = "Default")$sim
datatable(simulation.results, options = list(pageLength = 12), filter = 'bottom')
e
is the error term -- i.e., the difference between the known number of cluster and the approximated value based on each algorithm. Processing time (ptime
) is also demonstrated.
email your questions/suggestions to estiri.hossein at gmail
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.