ExoLabel | R Documentation |
Runs Fast Label Propagation using disk space for constant memory complexity.
ExoLabel(edgelistfiles, outfile=tempfile(),
mode=c("undirected", "directed"),
add_self_loops=FALSE,
ignore_weights=FALSE,
normalize_weights=TRUE,
iterations=0L,
return_table=FALSE,
consensus_cluster=FALSE,
verbose=interactive(),
sep='\t',
tempfiledir=tempdir(),
cleanup_files=TRUE)
edgelistfiles |
Character vector of files to be processed. Each entry should be a machine-interpretable path to an edgelist file. See Details for expected format. |
outfile |
File to write final clusters to. Optional, defaults to a temporary file. |
mode |
String specifying whether edges should be interpreted as undirected (default) or directed. Can be "undirected", "directed", or an unambiguous abbreviation. |
add_self_loops |
Should self loops be added to the network? If |
ignore_weights |
Should weights be ignored? If |
normalize_weights |
Should weights be normalized? If |
iterations |
Number of iterations to run fast label propagation algorithm for. Set to a value of 0 or less for infinite iterations. |
return_table |
Should result of clustering be returned as a file, or a |
consensus_cluster |
Should consensus clustering be used? If |
verbose |
Should status messages (output, progress, etc.) be displayed while running? |
sep |
Character that separates entries on a line in each file in |
tempfiledir |
Character vector corresponding to the location where temporary files used during execution should be stored. Defaults to R's |
cleanup_files |
Should intermediary files be deleted when the process completes? Note that |
Very large graphs require too much RAM for processing on some machines. In a graph containing billions of nodes and edges, loading the entire structure into RAM is rarely feasible. This implementation uses disk space for storing representations of each graph. While this is slower than computing on RAM, it allows this algorithm to scale to graphs of enormous size while only taking a small amount of RAM. Local benchmarking resulting in a plateau of approximately 100MB of RAM consumption for arbitrarily sized networks. If your graph is small enough to fit into RAM, consider using the LP_igraph()
function instead.
This function expects a set of edgelist files, provided as a vector of filepaths. Each entry in the file is expected to be in the following:
VERTEX1<sep>VERTEX2<sep>WEIGHT<linesep>
This line defines a single edge between vertices VERTEX1
and VERTEX2
with weight WEIGHT
. VERTEX1
and VERTEX2
are strings corresponding to vertex names, WEIGHT
is a numeric value that can be interpreted as a double
. The separators <sep>
and <linesep>
correspond to the arguments sep
and linesep
, respectively. The default arguments work for standard .tsv
formatting, i.e., a file of three columns of tab-separated values.
If ignore_weight=TRUE
, the file can be formatted as:
VERTEX1<sep>VERTEX2<linesep>
Note that the v1 v2 w
format is still accepted for ignore_weight=FALSE
, but the specified weights will be ignored.
Consensus clustering can be enabled by setting consensus_cluster=TRUE
. Consensus clustering runs this algorithm on the graph multiple times, transforming weight values according to a sigmoid function. By default, this runs nine times for sigmoids with scale 0.5 and shape c(0,0.2,0.4,0.6,0.8,1.0,1.33,1.67,2.0)
, collapsing weights below 0.1 to zero. The resulting clusters form a network such that the edge weight between any two nodes connected in the initial graph is the proportion of clusters they shared over clustering runs. This network is used for a final label propagation run, which identifies the consensus clusters. Users can specify a numeric vector as input to consensus_cluster
, which will override the default shape parameters and number of iterations.
If return_table=TRUE
, returns a data.frame
object with two columns. The first column contains the name of each vertex, and the second column contains the cluster it was assigned to.
If return_table=FALSE
, returns a character vector of length 1. This vector contains the path to the file where clusters were written to. The file is formatted as a .tsv
, with each line containing two tab separated columns (vertex name, assigned cluster)
While this algorithm can scale to very large graphs, it does have some internal limitations. First, nodes must be comprised of no more than 254 characters. If this limitation is restrictive, please feel free to contact me. Alternatively, you can increase the size yourself by changing the definition of MAX_NODE_NAME_SIZE
in src/outmem_graph.c
. This limitation is provided to decrease memory overhead and improve runtime, but arbitrary values are possible.
Second, nodes are indexed using 64-bit unsigned integers, with 0 reserved for other values. This means that the maximum possible number of nodes available is 2^64-2, which is about 18.5 quintillion.
Third, this algorithm uses disk space to store large objects. As such, please ensure you have sufficient disk space for the graph you intend to process. I've tried to put safeguards in the code itself, but funky stuff can happen when the OS runs out of space.
The disk space required by this algorithm is equivalent to 56V + LV + 12E
, where V
the number of unique vertices, L
the average length of each vertex label, and E
the number of edges. We can instead represent this purely in terms of vertices by setting D
to the average degree of each node. The total disk space required is thus (56+L+12D)V
. If you'd like to estimate the disk consumption of this algorithm, use the EstimateExoLabel
function.
Aidan Lakshman <AHL27@pitt.edu>
Traag, V.A., Subelj, L. Large network community detection by fast label propagation. Sci Rep 13, 2701 (2023). https://doi.org/10.1038/s41598-023-29610-z
EstimateExoLabel
num_verts <- 20L
num_edges <- 20L
all_verts <- sample(letters, num_verts)
all_edges <- vapply(seq_len(num_edges),
\(i) paste(c(sample(all_verts, 2L),
as.character(round(runif(1),3))),
collapse='\t'),
character(1L))
edgefile <- tempfile()
if(file.exists(edgefile)) file.remove(edgefile)
writeLines(all_edges, edgefile)
res <- ExoLabel(edgefile, return_table=TRUE)
print(res)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.