#' Optimizing \code{k} And \code{I} For Clique Percolation Community Detection
#'
#' Function for determining threshold value(s) (ratio of largest to second largest
#' community sizes, chi, entropy) of ranges of \code{k} and \code{I} values to help deciding
#' for optimal \code{k} and \code{I} values.
#'
#' @param W A qgraph object; see also \link[qgraph]{qgraph}
#' @param method A string indicating the method to use
#' (\code{"unweighted"}, \code{"weighted"}, or \code{"weighted.CFinder"}).
#' See \link{cpAlgorithm} for more information
#' @param k.range integer or vector of \code{k} value(s) for which threshold(s) are determined
#' See \link{cpAlgorithm} for more information
#' @param I.range integer or vector of \code{I} value(s) for which threshold(s) are determined
#' See \link{cpAlgorithm} for more information
#' @param threshold A string or vector indicating which threshold(s) to determine
#' (\code{"largest.components.ratio", "chi", "entropy"}); see Details
#'
#' @return
#' A data frame with columns for \code{k}, \code{I} (if \code{method = "weighted"}
#' or \code{method = "weighted.CFinder"}), number of communities, number of isolated
#' nodes, and results of the specified threshold(s).
#'
#' @details
#' Optimizing \code{k} (clique size) and \code{I} (Intensity threshold) in clique percolation
#' community detection is a difficult task. Farkas et al. (2007) recommend to look at the
#' ratio of the largest to second largest community sizes
#' (\code{threshold = "largest.components.ratio"}) for very large networks or
#' the variance of the community sizes when removing the community size of the largest
#' community (\code{threshold = "chi"}) for somewhat smaller networks. These thresholds were
#' derived from percolation theory. If \code{I} for a certain \code{k} is too high, no
#' community will be identified. If \code{I} is too low, a giant community with all nodes
#' emerges. Just above this \code{I}, the distribution of community sizes often follows a
#' power law, which constitutes a broad community sizes distribution. Farkas et al. (2007)
#' point out, that for such \code{I}, the ratio of the largest to second largest community
#' sizes is approximately 2, constituting one way to optimize \code{I} for each possible
#' \code{k}. For somewhat smaller networks, the ratio can be rather unstable. Instead,
#' Farkas et al. (2007, p.8) propose to look at the variance of the community sizes after
#' removing the largest community. The idea is that when \code{I} is rather low, one giant
#' community and multiple equally small ones occur. Then, the variance of the community
#' sizes of the small communities (removing the giant community) is low. When \code{I}
#' is high, only a few equally small communities will occur. Then, the variance of the
#' community sizes (after removing the largest community) will also be low. In between,
#' the variance will at some point be maximal, namely when the community size
#' distribution is maximally broad (power law-distributed). Thus, the maximal variance
#' could be used to optimize \code{I} for various \code{k}.
#'
#' For very small networks, optimizing \code{k} and \code{I} based on the distribution of the
#' community sizes will be impossible, as too few communities will occur. Another possible
#' threshold for such networks is based on the entropy of the community sizes
#' (\code{threshold = "entropy"}). Entropy can be interpreted as an indicator of how
#' surprising the respective solution is. The formula used here is based on Shannon
#' Information, namely
#' \deqn{-\sum_{i=1}^N p_i * \log_2 p_i}
#' with \eqn{p_i} being the probability that a node is part of community \eqn{i}. For instance,
#' if there are two communities, one of size 5 and one of size 3, the result would be
#' \deqn{-((5/8 * \log_2 5/8) + (3/8 * \log_2 3/8)) = 1.46}
#' When calculating entropy, the isolated nodes identified by clique percolation are treated as
#' a separate community. If there is only one community or only isolated nodes, entropy is
#' zero, indicating that the surprisingness is low. As compared to the ratio and chi
#' thresholds, entropy favors communities that are equal in size. Thus, it should not be
#' used for larger networks for which a broader community size distribution is preferred.
#' Note that the entropy threshold has not been validated for clique percolation as of now.
#' Initial simulation studies indicate that it consistently detects surprising community
#' partitions in smaller networks especially if there are cliques of larger \code{k}.
#'
#' Ratio thresholds can be determined only if there are at least two communities. Chi threshold
#' can be determined only if there are at least three communities. If there are not enough
#' communities for the respective threshold, their values are NA in the data frame.
#' Entropy can always be determined.
#'
#' @examples
#' ## Example for unweighted networks
#'
#' # create qgraph object
#' W <- matrix(c(0,1,1,1,0,0,0,0,
#' 0,0,1,1,0,0,0,0,
#' 0,0,0,0,0,0,0,0,
#' 0,0,0,0,1,1,1,0,
#' 0,0,0,0,0,1,1,0,
#' 0,0,0,0,0,0,1,0,
#' 0,0,0,0,0,0,0,1,
#' 0,0,0,0,0,0,0,0), nrow = 8, ncol = 8, byrow = TRUE)
#' W <- Matrix::forceSymmetric(W)
#' W <- qgraph::qgraph(W)
#'
#' # determine entropy threshold for k = 3 and k = 4
#' results <- cpThreshold(W = W, method = "unweighted", k.range = c(3,4), threshold = "entropy")
#'
#' ## Example for weighted networks; three large communities with I = 0.3, 0.2, and 0.1, respectively
#'
#' # create qgraph object
#' W <- matrix(c(0,0.10,0,0,0,0,0.10,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
#' 0,0,0.10,0,0,0,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
#' 0,0,0,0.10,0,0,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
#' 0,0,0,0,0.10,0,0,0.10,0.20,0,0,0,0,0.20,0.20,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0.10,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0,0.10,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0,0,0,0,0.20,0,0,0,0,0.20,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0,0,0,0,0,0.20,0,0,0,0.20,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0,0,0,0,0,0,0.20,0,0,0.20,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0.20,0,0.20,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0.20,0.20,0.30,0,0,0,0,0.30,0.30,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.20,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0,0,0,0.30,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0,0,0.30,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0,0.30,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0.30,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0.30,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,
#' 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), nrow = 22, ncol = 22, byrow = TRUE)
#' W <- Matrix::forceSymmetric(W)
#' W <- qgraph::qgraph(W, layout = "spring", edge.labels = TRUE)
#'
#' # determine ratio, chi, and entropy thresholds for k = 3 and I from 0.3 to 0.09
#' results <- cpThreshold(W = W, method = "weighted", k.range = 3,
#' I.range = c(seq(0.3, 0.09, by = -0.01)),
#' threshold = c("largest.components.ratio","chi","entropy"))
#'
#' @references
#' Farkas, I., Abel, D., Palla, G., & Vicsek, T. (2007). Weighted network modules.
#' \emph{New Journal of Physics, 9}, 180-180. http://doi.org/10.1088/1367-2630/9/6/180
#'
#' @author Jens Lange, \email{lange.jens@@outlook.com}
#'
#' @export cpThreshold
cpThreshold <- function(W, method = c("unweighted","weighted","weighted.CFinder"), k.range, I.range,
threshold = c("largest.components.ratio","chi","entropy")){
if (methods::is(W, "qgraph") == TRUE) {
Wmat <- qgraph::getWmat(W)
} else {
Wmat <- W
}
#function for chi formula
formula_chi <- function(size_comm){
size_comm_sort <- sort(size_comm)
size_comm_sort_not_max <- size_comm_sort[-length(size_comm_sort)]
value_chi <- c()
for (v in 1:length(size_comm_sort_not_max)) {
denominator <- sum(size_comm_sort_not_max[-v])^2
value_chi[v] <- (size_comm_sort_not_max[v]^2)/denominator
}
result_formula_chi <- sum(value_chi)
return(result_formula_chi)
}
#function for entropy formula
formula_entropy <- function(size_all){
value_ent <- c()
for (w in 1:length(size_all)) {
value_ent[w] <- -((size_all[w] / sum(size_all)) * log2((size_all[w] / sum(size_all))))
}
result_formula_ent <- sum(value_ent)
return(result_formula_ent)
}
#if input is weighted network...
#all threshold strategies can be implemented and they depend on k.range and I.range
if (method == "weighted" | method == "weighted.CFinder") {
ratio <- c()
chi <- c()
entropy <- c()
modularity <- c()
I_cp <- c()
k_cp <- c()
community <- c()
isolated <- c()
count <- 1
min_I <- min(I.range)
min_k <- min(k.range)
weight_threshold = min_I ** (min_k * (min_k - 1) / 2)
Wmat[Wmat < weight_threshold] = 0
progress_bar <- utils::txtProgressBar(min = 0, max = length(k.range)*length(I.range),
style = 3) #progress bar definition
progress_bar_counter <- 0 #counter for progress bar
for (k in k.range) {
all_k_cliques <- calculate_all_clique_intensities(Wmat, k)
#print(paste(length(all_k_cliques$cliques), "cliques found"))
# Restrict to cliques above minimum threshold
all_k_cliques <- threshold_cliques(all_k_cliques, min_I)
#print(paste(length(all_k_cliques$cliques), "cliques found above minimum intensity threshold considered"))
for (i in I.range) {
results <- cpAlgorithmRaw(Wmat, k = k, method = method, I = as.numeric(as.character(i)), all_k_cliques = all_k_cliques)
#if there are at least two communities...
#ratio threshold can be determined if requested
if (length(results$list.of.communities.numbers) > 1 & "largest.components.ratio" %in% threshold) {
size_dist_lcr <- c()
for (j in 1:length(results$list.of.communities.numbers)) {
size_dist_lcr[j] <- length(results$list.of.communities.numbers[[j]])
}
distinct_sizes_lcr <- length(size_dist_lcr)
ratio[count] <- sort(size_dist_lcr)[distinct_sizes_lcr] / sort(size_dist_lcr)[distinct_sizes_lcr - 1]
}
#if there are fewer than two communities...
#ratio threshold cannot be determined even if requested
if (length(results$list.of.communities.numbers) < 2 & "largest.components.ratio" %in% threshold) {
ratio[count] <- NA
}
#if there are at least three communities...
#chi threshold can be determined if requested
if (length(results$list.of.communities.numbers) > 2 & "chi" %in% threshold) {
size_dist_chi <- c()
for (j in 1:length(results$list.of.communities.numbers)) {
size_dist_chi[j] <- length(results$list.of.communities.numbers[[j]])
}
chi[count] <- formula_chi(size_dist_chi)
}
#if there are fewer than three communities...
#chi threshold cannot be determined even if requested
if (length(results$list.of.communities.numbers) < 3 & "chi" %in% threshold) {
chi[count] <- NA
}
#entropy can always be calculated
#when all nodes are isolated, entropy is simply zero, because all nodes are isolated and the formula is 1 * log2(1) = 0
#entropy formula needs vector with sizes of each community and number of isolated nodes
#when there are shared nodes, probabilities in formula would add up to value > 1
#to circumvent that, shared nodes are divided between their communities (e.g., two communities share node then each community gets only 0.5 instead of 1)
#code below creates vector with all necessary sizes and subtracts respective values from communities with shared nodes
if ("entropy" %in% threshold) {
if (length(results$list.of.communities.numbers) > 0) {
size_dist_ent <- c()
for (j in 1:length(results$list.of.communities.numbers)) {
size_dist_ent[j] <- length(results$list.of.communities.numbers[[j]])
}
} else {size_dist_ent <- c()}
if (length(results$shared.nodes.numbers) > 0) {
for (s in results$shared.nodes.numbers) {
counter <- c()
for (c in 1:length(results$list.of.communities.numbers)) {
if (s %in% results$list.of.communities.numbers[[c]]) {
counter <- c(counter,c)
}
}
for (a in counter) {
size_dist_ent[a] <- size_dist_ent[a] - (1 - 1/length(counter))
}
}
} else {size_dist_ent <- size_dist_ent}
if (length(results$isolated.nodes.numbers) > 0) {
isolated_ent <- length(results$isolated.nodes.numbers)
} else {isolated_ent <- c()}
size_combined <- c(size_dist_ent,isolated_ent)
entropy[count] <- formula_entropy(size_combined)
}
if ("modularity" %in% threshold) {
modularity[count] <- weighted_modularity(results$list.of.communities.numbers, Wmat)
}
I_cp[count] <- i
k_cp[count] <- k
community[count] <- length(results$list.of.communities.numbers)
isolated[count] <- length(results$isolated.nodes.numbers)
count <- count + 1
#progress bar update
progress_bar_counter <- progress_bar_counter + 1
utils::setTxtProgressBar(progress_bar, progress_bar_counter)
}
}
}
#if input is unweighted network...
#all threshold strategies can be implemented and they depend only on k.range
if (method == "unweighted") {
ratio <- c()
chi <- c()
entropy <- c()
modularity <- c()
k_cp <- c()
community <- c()
isolated <- c()
count <- 1
progress_bar <- utils::txtProgressBar(min = 0, max = length(k.range),
style = 3) #progress bar definition
progress_bar_counter <- 0 #counter for progress bar
for (k in k.range) {
results <- cpAlgorithmRaw(Wmat, k = k, method = method)
#if there are at least two communities...
#ratio threshold can be determined if requested
if (length(results$list.of.communities.numbers) > 1 & "largest.components.ratio" %in% threshold) {
size_dist_lcr <- c()
for (j in 1:length(results$list.of.communities.numbers)) {
size_dist_lcr[j] <- length(results$list.of.communities.numbers[[j]])
}
distinct_sizes_lcr <- length(size_dist_lcr)
ratio[count] <- sort(size_dist_lcr)[distinct_sizes_lcr] / sort(size_dist_lcr)[distinct_sizes_lcr - 1]
}
#if there are fewer than two communities...
#ratio threshold cannot be determined even if requested
if (length(results$list.of.communities.numbers) < 2 & "largest.components.ratio" %in% threshold) {
ratio[count] <- NA
}
#if there are at least three communities...
#chi threshold can be determined if requested
if (length(results$list.of.communities.numbers) > 2 & "chi" %in% threshold) {
size_dist_chi <- c()
for (j in 1:length(results$list.of.communities.numbers)) {
size_dist_chi[j] <- length(results$list.of.communities.numbers[[j]])
}
chi[count] <- formula_chi(size_dist_chi)
}
#if there are fewer than three communities...
#chi threshold cannot be determined even if requested
if (length(results$list.of.communities.numbers) < 3 & "chi" %in% threshold) {
chi[count] <- NA
}
#entropy can always be calculated
#when all nodes are isolated, entropy is simply zero, because all nodes are isolated and the formula is 1 * log2(1) = 0
#entropy formula needs vector with sizes of each community and number of isolated nodes
#when there are shared nodes, probabilities in formula would add up to value > 1
#to circumvent that, shared nodes are divided between their communities (e.g., two communities share node then each community gets only 0.5 instead of 1)
#code below creates vector with all necessary sizes and subtracts respective values from communities with shared nodes
if ("entropy" %in% threshold) {
if (length(results$list.of.communities.numbers) > 0) {
size_dist_ent <- c()
for (j in 1:length(results$list.of.communities.numbers)) {
size_dist_ent[j] <- length(results$list.of.communities.numbers[[j]])
}
} else {size_dist_ent <- c()}
if (length(results$shared.nodes.numbers) > 0) {
for (s in results$shared.nodes.numbers) {
counter <- c()
for (c in 1:length(results$list.of.communities.numbers)) {
if (s %in% results$list.of.communities.numbers[[c]]) {
counter <- c(counter,c)
}
}
for (a in counter) {
size_dist_ent[a] <- size_dist_ent[a] - (1 - 1/length(counter))
}
}
} else {size_dist_ent <- size_dist_ent}
if (length(results$isolated.nodes.numbers) > 0) {
isolated_ent <- length(results$isolated.nodes.numbers)
} else {isolated_ent <- c()}
size_combined <- c(size_dist_ent,isolated_ent)
entropy[count] <- formula_entropy(size_combined)
}
if ("modularity" %in% threshold) {
modularity[count] <- weighted_modularity(results$list.of.communities.numbers, Wmat)
}
k_cp[count] <- k
community[count] <- length(results$list.of.communities.numbers)
isolated[count] <- length(results$isolated.nodes.numbers)
count <- count + 1
#progress bar update
progress_bar_counter <- progress_bar_counter + 1
utils::setTxtProgressBar(progress_bar, progress_bar_counter)
}
}
#return data frame with respective values and add thresholds depending on request
#first, create data set with information about simulation parameters depending on method
if (method == "weighted" | method == "weighted.CFinder") {
data <- data.frame(cbind(k_cp,I_cp,community,isolated))
names(data) <- c("k","Intensity","Number.of.Communities","Number.of.Isolated.Nodes")
}
if (method == "unweighted") {
data <- data.frame(cbind(k_cp,community,isolated))
names(data) <- c("k","Number.of.Communities","Number.of.Isolated.Nodes")
}
#second, add required thresholds
if ("largest.components.ratio" %in% threshold) {
names_data_lcr <- c(names(data),"Ratio.Threshold")
data <- data.frame(cbind(data,ratio))
names(data) <- names_data_lcr
}
if ("chi" %in% threshold) {
names_data_chi <- c(names(data),"Chi.Threshold")
data <- data.frame(cbind(data,chi))
names(data) <- names_data_chi
}
if ("entropy" %in% threshold) {
names_data_ent <- c(names(data),"Entropy.Threshold")
data <- data.frame(cbind(data,entropy))
names(data) <- names_data_ent
}
if ("modularity" %in% threshold) {
names_data_ent <- c(names(data),"Modularity.Threshold")
data <- data.frame(cbind(data,modularity))
names(data) <- names_data_ent
}
return(data)
}
weighted_modularity=function(comms, A) {
m=sum(A)
edgeweights=rowSums(A)
deg=outer(edgeweights,edgeweights,"*")
Cij=matrix(0,nrow(A),ncol(A),dimnames=dimnames(A))
for(i in seq_along(comms))
Cij[comms[[i]],comms[[i]]]=1
diag(Cij)=0
sum((A-deg/(2*m))*Cij)/(2*m)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.