This is a full example of using the kluster package.

# install kluster
devtools::install_github("hestiri/kluster")

# #loading the required packages.
if (!require("easypackages")) install.packages('easypackages', repos = "http://cran.rstudio.com/")
packages("kluster","mclust","dplyr","DT","ggplot2","tictoc","clusterGeneration","factoextra","apcluster","fpc","vegan",prompt = F)

#loading the data
data(Breast_Cancer_Wisconsin)

The kluster package has three main functions:

I will use data(Breast_Cancer_Wisconsin) for this demo. Below is a plot of the 2 clusters of benign and malignant tumors by texture mean in this dataset.

ggplot(dat)+
  geom_point(aes(x=area_mean, y= texture_mean,colour=diagnosis),size=3,alpha=0.3)+
  ggtitle("") +
  theme_bw() +
  theme(panel.grid.major.y = element_line(colour = "gray"),
        panel.grid.minor.y = element_blank(),
        axis.line = element_line(size=0.5, colour = "black"),
        panel.border = element_blank(), panel.background = element_blank(),
        plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
        text=element_text(family="Tahoma", face = "bold"),
        axis.text.x=element_text(colour="black", size = 10, face="plain"),
        axis.text.y=element_text(colour="black", size = 10, face="plain"),
        legend.position="bottom")

Now let's get the best estimate using the main kluster function -- 10 itertations of the BIC algorithm on 100 random samples drawn from data with replacement.

kluster.results <- kluster(data = dat[,c("area_mean","texture_mean")],
            iter_klust = 10, 
            smpl=100, 
            algorithm = "BIC")

The result is a dataframe including both mean and most frequent kluster product on BIC.

kluster.results

In production, use f_BIC_k directly use the most frequent output:

kluster.results$f_BIC_k

There are 4 algorithms implemented in kluster evaluation functions: BIC (Bayesian Information Criterion), PAMK (Partitioning Around Medoids), CAL (Calinski and Harabasz index), and AP (Affinity Propagation).

To test performance of the other algorithms against a known gold standard number of clusters implementation results:

eval <- data.frame(kluster_eval(data = dat[,c("area_mean","texture_mean")], 
                              clusters = 2,#known gold standard number of clusters
                              iter_sim = 1,#number of simulation iterations if need be more than 1
                              iter_klust = 10,#iteration for each algorithm
                              algorithm = "Default", #select analysis algorithm from BIC, PAMK, CAL, and AP
                              smpl = 100)$sim)
datatable(eval, options = list(pageLength = 8), filter = 'bottom')

e_mean and e_freq represent the respective error terms for the mean and most frequent kluster products on each algorithms.

Now we can test results of kluster across with the implementation of original algorithms using the kluster_sim function.

simulation.results <- kluster_sim(data = dat[,c("area_mean","texture_mean")], 
                       clusters = 2, 
                       iter_sim = 2, 
                       iter_klust = 10, 
                       smpl = 100, 
                       algorithm = "Default")$sim
datatable(simulation.results, options = list(pageLength = 12), filter = 'bottom')

e is the error term -- i.e., the difference between the known number of cluster and the approximated value based on each algorithm. Processing time (ptime) is also demonstrated.

email your questions/suggestions to estiri.hossein at gmail



hestiri/kluster documentation built on May 28, 2019, 8:55 p.m.