ExoLabel: ExoLabel: Out of Memory Fast Label Propagation

View source: R/ExoLabel.R

ExoLabelR Documentation

ExoLabel: Out of Memory Fast Label Propagation

Description

Runs Fast Label Propagation using disk space for constant memory complexity.

Usage

ExoLabel(edgelistfiles, outfile=tempfile(),
              mode=c("undirected", "directed"),
              add_self_loops=FALSE,
              ignore_weights=FALSE,
              normalize_weights=TRUE,
              iterations=0L,
              return_table=FALSE,
              consensus_cluster=FALSE,
              verbose=interactive(),
              sep='\t',
              tempfiledir=tempdir(),
              cleanup_files=TRUE)

Arguments

edgelistfiles

Character vector of files to be processed. Each entry should be a machine-interpretable path to an edgelist file. See Details for expected format.

outfile

File to write final clusters to. Optional, defaults to a temporary file.

mode

String specifying whether edges should be interpreted as undirected (default) or directed. Can be "undirected", "directed", or an unambiguous abbreviation.

add_self_loops

Should self loops be added to the network? If TRUE, adds self loops of weight 1.0 to all vertices. If set to numeric value w, adds self loops of weight w to all nodes. Note that this takes place prior to edge weight normalization (if requested).

ignore_weights

Should weights be ignored? If TRUE, all edges will be treated as an edge of weight 1. Must be set to TRUE if any of edgelistfiles are two-column tables (start->end only, lacking a weights column).

normalize_weights

Should weights be normalized? If TRUE, each vertex's edge weights are normalized such that the outgoing edges have weight 1. This normalization is done after adding self loops.

iterations

Number of iterations to run fast label propagation algorithm for. Set to a value of 0 or less for infinite iterations.

return_table

Should result of clustering be returned as a file, or a data.frame object? If FALSE, returns a character vector corresponding to the path of outfile. If TRUE, parses outfile using read.table and returns the result. Not recommended for very large graphs.

consensus_cluster

Should consensus clustering be used? If TRUE, runs the clustering algorithm nine times and forms a consensus clustering based on the agreement of each run. Can be set to a double to control the number of iterations, see Details for more information.

verbose

Should status messages (output, progress, etc.) be displayed while running?

sep

Character that separates entries on a line in each file in edgelistfiles. Defaults to tab, as would be expected in a .tsv formatted file. Set to ',' for a .csv file.

tempfiledir

Character vector corresponding to the location where temporary files used during execution should be stored. Defaults to R's tempdir.

cleanup_files

Should intermediary files be deleted when the process completes? Note that outfile will only be deleted if return_table=TRUE AND cleanup_files=TRUE.

Details

Very large graphs require too much RAM for processing on some machines. In a graph containing billions of nodes and edges, loading the entire structure into RAM is rarely feasible. This implementation uses disk space for storing representations of each graph. While this is slower than computing on RAM, it allows this algorithm to scale to graphs of enormous size while only taking a small amount of RAM. Local benchmarking resulting in a plateau of approximately 100MB of RAM consumption for arbitrarily sized networks. If your graph is small enough to fit into RAM, consider using the LP_igraph() function instead.

This function expects a set of edgelist files, provided as a vector of filepaths. Each entry in the file is expected to be in the following:

VERTEX1<sep>VERTEX2<sep>WEIGHT<linesep>

This line defines a single edge between vertices VERTEX1 and VERTEX2 with weight WEIGHT. VERTEX1 and VERTEX2 are strings corresponding to vertex names, WEIGHT is a numeric value that can be interpreted as a double. The separators <sep> and <linesep> correspond to the arguments sep and linesep, respectively. The default arguments work for standard .tsv formatting, i.e., a file of three columns of tab-separated values.

If ignore_weight=TRUE, the file can be formatted as:

VERTEX1<sep>VERTEX2<linesep>

Note that the v1 v2 w format is still accepted for ignore_weight=FALSE, but the specified weights will be ignored.

Consensus clustering can be enabled by setting consensus_cluster=TRUE. Consensus clustering runs this algorithm on the graph multiple times, transforming weight values according to a sigmoid function. By default, this runs nine times for sigmoids with scale 0.5 and shape c(0,0.2,0.4,0.6,0.8,1.0,1.33,1.67,2.0), collapsing weights below 0.1 to zero. The resulting clusters form a network such that the edge weight between any two nodes connected in the initial graph is the proportion of clusters they shared over clustering runs. This network is used for a final label propagation run, which identifies the consensus clusters. Users can specify a numeric vector as input to consensus_cluster, which will override the default shape parameters and number of iterations.

Value

If return_table=TRUE, returns a data.frame object with two columns. The first column contains the name of each vertex, and the second column contains the cluster it was assigned to.

If return_table=FALSE, returns a character vector of length 1. This vector contains the path to the file where clusters were written to. The file is formatted as a .tsv, with each line containing two tab separated columns (vertex name, assigned cluster)

Warning

While this algorithm can scale to very large graphs, it does have some internal limitations. First, nodes must be comprised of no more than 254 characters. If this limitation is restrictive, please feel free to contact me. Alternatively, you can increase the size yourself by changing the definition of MAX_NODE_NAME_SIZE in src/outmem_graph.c. This limitation is provided to decrease memory overhead and improve runtime, but arbitrary values are possible.

Second, nodes are indexed using 64-bit unsigned integers, with 0 reserved for other values. This means that the maximum possible number of nodes available is 2^64-2, which is about 18.5 quintillion.

Third, this algorithm uses disk space to store large objects. As such, please ensure you have sufficient disk space for the graph you intend to process. I've tried to put safeguards in the code itself, but funky stuff can happen when the OS runs out of space.

The disk space required by this algorithm is equivalent to 56V + LV + 12E, where V the number of unique vertices, L the average length of each vertex label, and E the number of edges. We can instead represent this purely in terms of vertices by setting D to the average degree of each node. The total disk space required is thus (56+L+12D)V. If you'd like to estimate the disk consumption of this algorithm, use the EstimateExoLabel function.

Author(s)

Aidan Lakshman <AHL27@pitt.edu>

References

Traag, V.A., Subelj, L. Large network community detection by fast label propagation. Sci Rep 13, 2701 (2023). https://doi.org/10.1038/s41598-023-29610-z

See Also

EstimateExoLabel

Examples

num_verts <- 20L
num_edges <- 20L
all_verts <- sample(letters, num_verts)
all_edges <- vapply(seq_len(num_edges),
      \(i) paste(c(sample(all_verts, 2L),
                   as.character(round(runif(1),3))),
                 collapse='\t'),
                    character(1L))
edgefile <- tempfile()
if(file.exists(edgefile)) file.remove(edgefile)
writeLines(all_edges, edgefile)
res <- ExoLabel(edgefile, return_table=TRUE)
print(res)

npcooley/SynExtend documentation built on May 17, 2024, 1:50 p.m.