ExoLabel | R Documentation |
Runs Fast Label Propagation using disk space for constant memory complexity.
ExoLabel(edgelistfiles,
outfile=tempfile(),
mode=c("undirected", "directed"),
add_self_loops=FALSE,
ignore_weights=FALSE,
normalize_weights=FALSE,
iterations=0L,
inflation=1.05,
return_table=FALSE,
consensus_cluster=FALSE,
verbose=interactive(),
sep='\t',
tempfiledir=tempdir())
edgelistfiles |
Character vector of files to be processed. Each entry should be a machine-interpretable path to an edgelist file. See Details for expected format. |
outfile |
File to write final clusters to. Optional, defaults to a temporary file. |
mode |
String specifying whether edges should be interpreted as undirected (default) or directed. Can be "undirected", "directed", or an unambiguous abbreviation. |
add_self_loops |
Should self loops be added to the network? If |
ignore_weights |
Should weights be ignored? If |
normalize_weights |
Should weights be normalized? If |
iterations |
Maximum number of times to process each node. If set to zero or |
inflation |
Inflation parameter for edges. See "Algorithm Convergence" below for a description of this parameter. Higher values speed up algorithm convergence but produce smaller clusters. Defaults to 1.05; set to 1.0 to disable inflation. |
return_table |
Should result of clustering be returned as a file, or a |
consensus_cluster |
Should consensus clustering be used? If |
verbose |
Should status messages (output, progress, etc.) be displayed while running? |
sep |
Character that separates entries on a line in each file in |
tempfiledir |
Character vector corresponding to the location where temporary files used during execution should be stored. Defaults to R's |
Very large graphs require too much RAM for processing on some machines. In a graph containing billions of nodes and edges, loading the entire structure into RAM is rarely feasible. This implementation uses disk space for storing representations of each graph. While this is slower than computing on RAM, it allows this algorithm to scale to graphs of enormous size while only using a comparatively small amount of memory. See "Memory Consumption" for details on the total disk/memory consumption of ExoLabel.
This function expects a set of edgelist files, provided as a vector of filepaths. Each entry in the file is expected to be in the following:
VERTEX1<sep>VERTEX2<sep>WEIGHT<linesep>
This line defines a single edge between vertices VERTEX1
and VERTEX2
with weight WEIGHT
. VERTEX1
and VERTEX2
are strings corresponding to vertex names, WEIGHT
is a numeric value that can be interpreted as a double
. The separators <sep>
and <linesep>
correspond to the arguments sep
and linesep
, respectively. The default arguments work for standard .tsv
formatting, i.e., a file of three columns of tab-separated values.
If ignore_weight=TRUE
, the file can be formatted as:
VERTEX1<sep>VERTEX2<linesep>
Note that the v1 v2 w
format is still accepted for ignore_weight=FALSE
, but the specified weights will be ignored.
If return_table=TRUE
, returns a data.frame
object with two columns. The first column contains the name of each vertex, and the second column contains the cluster it was assigned to.
If return_table=FALSE
, returns a character vector of length 1. This vector contains the path to the file where clusters were written to. The file is formatted as a .tsv
, with each line containing two tab separated columns (vertex name, assigned cluster)
One of the main issues of Label Propagation algorithms is that they can fail to converge. Consider an unweighted directed graph with four nodes connected in a loop. That is, A->B, B->C, C->D, D->A
. If A,C
are in cluster 1 and B,D
are in cluster 2, this algorithm could keep processing all the nodes in a loop and never converge. To solve this issue, we introduce two measures for convergence: inflation
and iterations
.
iterations
is the simpler parameter to understand. If iterations=x
, then we only allow the algorithm to process each node x
times. Once a given node has been seen x
times, it is no longer updated.
inflation
gradually increases the influence of stronger weighted edges as we see a node more. In other words, the more often we see a node, the more bias we add towards its strongest weighted edges. For each node, we use the following weighting:
w' = w^{1+log_2(n-1)}
Here w
is the original edge weight, w'
is the new edge weight, and n
is the number of times we've already processed the node. After this transformation, the edge weights are renormalized, meaning that large weights tend to get larger, and small weights tend to get smaller. Logarithms prevent the exponents from growing too large, and base 2 is chosen for computational efficiency.
Consensus clustering can be enabled by setting consensus_cluster=TRUE
. Consensus clustering runs ExoLabel on the input graph multiple times, transforming weight values according to a sigmoid function. By default, this runs nine times for sigmoids with scale 0.5 and shape c(0,0.2,0.4,0.6,0.8,1.0,1.33,1.67,2.0)
, collapsing weights below 0.1 to zero. The resulting clusters form a network such that the edge weight between any two nodes connected in the initial graph is the proportion of clusters they shared over clustering runs. This network is used for a final label propagation run, which identifies the consensus clusters. Users can specify a numeric vector as input to consensus_cluster
, which will override the default shape parameters and number of iterations.
While this algorithm can scale to very large graphs, it does have some internal limitations. First, nodes must be comprised of no more than 254 characters. If this limitation is restrictive, please feel free to contact me. Alternatively, you can increase the size yourself by changing the definition of MAX_NODE_NAME_SIZE
in src/outmem_graph.c
. This limitation is provided to decrease memory overhead and improve runtime, but arbitrary values are possible.
Second, nodes are indexed using 64-bit unsigned integers, with 0 reserved for other values. This means that the maximum possible number of nodes available is 2^64-2, which is about 18.5 quintillion.
Third, this algorithm uses disk space to store large objects. As such, please ensure you have sufficient disk space for the graph you intend to process. I've tried to put safeguards in the code itself, but funky stuff can happen when the OS runs out of space. See "Memory Consumption" for details on the total disk/memory consumption of ExoLabel.
Let v
be the number of unique nodes, d
the average outdegree of nodes, and l
the average length of node labels.
Specific calculations for memory/disk consumption are detailed below. In summary, the absolute worst case memory consumption is roughly (24l+34)v
bytes, and the disk consumption during computation is (8+12d)v
bytes. The final table returned consumes (2+l+\log_{10}{v})v
bytes.
ExoLabel builds a trie to keep track of vertex names. Each internal node of the trie consumes 24 bytes, and each leaf node consumes 16 bytes. The lowest possible RAM consumption of the trie (if every label is length l
and shares the same prefix of length l-1
) is roughly 40v
bytes, and the maximum RAM consumption (if no two node labels have any prefix in common) is (24l + 16)v
bytes. We can generalize this to estimate the total memory consumption as roughly (24(l-p)+16)v
, where p
is the average length of common prefix between any two node labels.
ExoLabel also uses a number of internal caches to speed up read/writes from files. These caches take less than 100MB of RAM in total. It also uses an internal queue for processing nodes, which consumes roughly 10v
bytes. It also uses an internal index of size 8v
bytes.
As for disk space, ExoLabel transforms the graph into a CSR-compressed network, which is split across three files: a header, a neighbors list, and a weights list. The header file contains v+1
entries of 8 bytes, and the other two files consume a total of 12 bytes per outgoing edge. The number of edges to record is vd
. Thus, the total disk consumption in bytes is 8(v+1) + 12vd \approx (8+12d)v
.
The final table returned is a tab-separated table containing vertex names and cluster numbers in human-readable format. Each line consumes at most l + 2 + \log_{10}{v}
bytes. In the worst case, the number of clusters is equal to the number of vertices, which have \log_{10}{v}
digits. The average number of digits is close to the number of digits of the largest number due to how number of digits scale with numbers. The extra two bytes are for the tab and newline characters. Thus, the total size of the file is at most (2+l+\log_{10}{v})v
bytes.
Aidan Lakshman <AHL27@pitt.edu>
Traag, V.A., Subelj, L. Large network community detection by fast label propagation. Sci Rep 13, 2701 (2023). https://doi.org/10.1038/s41598-023-29610-z
EstimateExoLabel
num_verts <- 20L
num_edges <- 20L
all_verts <- sample(letters, num_verts)
all_edges <- vapply(seq_len(num_edges),
\(i) paste(c(sample(all_verts, 2L),
as.character(round(runif(1),3))),
collapse='\t'),
character(1L))
edgefile <- tempfile()
if(file.exists(edgefile)) file.remove(edgefile)
writeLines(all_edges, edgefile)
res <- ExoLabel(edgefile, return_table=TRUE)
print(res)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.