cat(" <style> samp { color: red; background-color: #EEEEEE; } </style> ") cat(" <style> samp2 { color: black; font-style: italic; background-color: #EEEEEE; } </style> ")
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
OTclust is an R package for computing a mean partition of an ensemble of clustering results by optimal transport alignment (OTA) and for assessing uncertainty at the levels of both partition and individual clusters. To measure uncertainty, set relationships between clusters in multiple clustering results are revealed. Functions are provided to compute the Covering Point Set (CPS), Cluster Alignment and Points based (CAP) separability, and Wasserstein distance between partitions.
library(OTclust) data(sim1)
Here, we illustrate the usage of OTclust for an ensemble clustering based on a simulated toy example,
C=4 load('ens.data.rda') load('OTA.rda')
# the number of clusters. C = 4 # generate an ensemble of perturbed partitions. # if perturb_method is 1 then perturbed by bootstrap resampling, it it is 0, then perturbed by adding Gaussian noise. ens.data = ensemble(sim1$X, nbs=100, clust_param=C, clustering="kmeans", perturb_method=1)
To find a consensus partition, the function
# find mean partition and uncertainty statistics. ota = otclust(ens.data)
# calculate baseline method for comparison. kcl = kmeans(sim1$X,C) # align clustering results for convenience of comparison. compar = align(cbind(sim1$z,kcl$cluster,ota$meanpart)) lab.match = lapply(compar$weight,function(x) apply(x,2,which.max)) kcl.algnd = match(kcl$cluster,lab.match[[1]]) ota.algnd = match(ota$meanpart,lab.match[[2]])
# plot the result on two dimensional space. otplot(sim1$X,sim1$z,con=F,title='Truth') # ground truth otplot(sim1$X,kcl.algnd,con=F,title='Kmeans') # baseline method otplot(sim1$X,ota.algnd,con=F,title='Mean partition') # mean partition by OTclust
Here, as cluster-wise uncertainty measures, we briefly introduce the usage of topological relationship statistics of mean partitions, cluster alignment and points based (CAP) separability, and covering point sets (CPS). The detailed definition of the above statistics can be found in [1]. Moreover, if you want to carry out CPS Analysis, please next two sections.
# distance between ground truth and each partition wassDist(sim1$z,kmeans(sim1$X,C)$cluster) # baseline method wassDist(sim1$z,ota$meanpart) # mean partition by OTclust # Topological relationships between mean partition and ensemble clusters t(ota$match) # Cluster Alignment and Points based (CAP) separability ota$cap
# Covering Point Set(CPS) otplot(sim1$X,ota$cps[lab.match[[2]][1],],legend.labels=c('','CPS'),add.text=F,title='CPS for C1') otplot(sim1$X,ota$cps[lab.match[[2]][2],],legend.labels=c('','CPS'),add.text=F,title='CPS for C2') otplot(sim1$X,ota$cps[lab.match[[2]][3],],legend.labels=c('','CPS'),add.text=F,title='CPS for C3') otplot(sim1$X,ota$cps[lab.match[[2]][4],],legend.labels=c('','CPS'),add.text=F,title='CPS for C4')
The red area of the above plots indicates covering point set (CPS) for each cluster. The detail of the CPS analysis is addressed in the next section.
The functions that are going to be used in this section are
# CPS analysis on selection of visualization methods data(vis_pollen) c=visCPS(vis_pollen$vis, vis_pollen$ref)
After the computation, we have the return list c, which would be the input of function
# visualization of the result mplot(c,2) cplot(c,2)
Furthermore, if you want to see the statistics, you can simply view the return of
# overall tightness c$tight_all # cluster-wise tightness c$tight
In this section, the relevant functions are
# CPS Analysis on validation of clustering result data(YAN) y=clustCPS(YAN, k=7, l=FALSE, pre=FALSE, noi="after", cmethod="kmeans", dimr="PCA", vis="tsne") # visualization of the results mplot(y,4) cplot(y,4) # point-wise stability assessment p=pplot(y) p$v
If you want to try other clustering method rather than
[1] Jia Li, Beomseok Seo, and Lin Lin. "Optimal transport, mean partition, and uncertainty assessment in cluster analysis." Statistical Analysis and Data Mining: The ASA Data Science Journal 12.5 (2019): 359-377.
[2] Lixiang Zhang, Lin Lin, and Jia Li. "CPS analysis: self-contained validation of biomedical data clustering." Bioinformatics 36.11 (2020): 3516-3521.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.