Stability and Significance Examples"

knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  comment = "#>"
)
library(clustAnalytics)

Scoring functions

To compute the scoring functions of a graph and its clusters, we call scoring_functions. Here we try it for the well known Zachary's karate club graph, using the faction memberships as clusters. type=global gives us the scores of each cluster, while type=global gives us the weighted mean of the local scores, plus some additional global scores

data(karate, package="igraphdata")
scoring_functions(karate, V(karate)$Faction, type="local")
scoring_functions(karate, V(karate)$Faction, type="global")

Additionally, each of the scores is available individually as its own function, grouped together in the package as the cluster scoring functions family. They are simply called as follows, with the graph and the membership vector as arguments, and return a vector with the scores for each community:

cut_ratio(karate, V(karate)$Faction)

Significance

To showcase the randomization process, we apply it to the Zachary's karate club graph, with the default settings (positive weights with no upper bound, which suits this graph):

data(karate, package="igraphdata")
E(karate)
rewired_karate <- rewireCpp(karate, weight_sel = "max_weight")
E(rewired_karate)

If the graph is directed, the rewireCpp function automatically detects it and internally runs the implementation for directed graphs. The following example is a food network (where edges indicate predator-prey relationships) from the igraphdata package:

data(foodwebs, package="igraphdata")
rewired_ChesLower <- rewireCpp(foodwebs$ChesLower, weight_sel = "max_weight")

Now we compute the scoring functions for the karate club graph. By default the clustering algorithms are Louvain, label propagation and Walktrap, but the function can take any list of clustering algorithms for igraph graphs.

# this corresponds to the club each member ended up with after the split, 
# which we could consider the ground truth clustering for this graph.
significance_table_karate <- evaluate_significance(karate,
                                                   alg_list=list(Louvain=cluster_louvain, 
                                                                 "label prop"= cluster_label_prop, 
                                                                 walktrap=cluster_walktrap),
                                                   gt_clustering=V(karate)$Faction)
significance_table_karate

With the function evaluate_significance_r we compute the scoring functions as above, and we compare the results to those of a distribution of randomized graphs.

significance_table_karate_r <- evaluate_significance_r(karate, 
                                                       alg_list=list(Louvain=cluster_louvain, 
                                                                 "label prop"= cluster_label_prop, 
                                                                 walktrap=cluster_walktrap),
                                                       gt_clustering=V(karate)$Faction,
                                                       weight_sel="max_weight",
                                                       n_reps=10)

Now we generate a graph from a stochastic block model in which we set very strong clusters (the elements in the diagonal of the matrix are much larger than the rest, so the probability of intra-cluster edges is much higher than that of inter-cluster edges).

pm <- matrix (c(.3, .001, .001, .003,
                .001, .2, .005, .002,
                .001, .005, .2, .001,
                .003, .002, .001, .3), nrow=4, ncol=4)
g_sbm <- sample_sbm(100, pref.matrix=pm, block.sizes=c(25,25,25,25))
E(g_sbm)$weight <- 1
significance_table_sbm <- evaluate_significance(g_sbm)
significance_table_sbm

And as an example of usage for a complete weighted graphs with weights bounded between 0 and 1, we have a graph built from correlations of currency exchange rate returns, in particular from January 2009 with the 13 most traded currencies. In this case we have to set w_max to 1 to keep the upper bound when rewiring the edges.

data(package="clustAnalytics",g_forex)
significance_table_karate_r <- evaluate_significance_r(karate,  
                                                   gt_clustering=V(karate)$Faction,
                                                   weight_sel = "const_var",
                                                   n_reps=5, w_max=1)

Stability

Here we perform a nonparametric bootstrap to the karate club graph and the same selection of algorithms. For each instance, the set of vertices is resampled, the induced graph is obtained by taking the new set of vertices with the induced edges from the original graph, and the clustering algorithms are applied. Then, these results are compared to the induced original clusterings using several metrics: the variation of information (VI), normalized reduced mutual information (NRMI) and both adjusted and regular Rand index (Rand and adRand):

b_karate <- boot_alg_list(g=karate, return_data=FALSE, R=99) 
b_karate

And the same for the stochastic block model graph:

b_sbm <- boot_alg_list(g=g_sbm, return_data=FALSE, R=99)
b_sbm

We can clearly see that for all metrics, the results are much more stable, which makes sense because we created the sbm graph with very strong clusters.



Try the clustAnalytics package in your browser

Any scripts or data that you put into this service are public.

clustAnalytics documentation built on May 29, 2024, 12:18 p.m.