EstimateExoLabel: Estimate ExoLabel Disk Consumption
In npcooley/SynExtend: Tools for Comparative Genomics

EstimateExoLabel

R Documentation

Estimate ExoLabel Disk Consumption

Description

Estimate the total disk consumption for ExoLabel.

Usage

EstimateExoLabel(num_v, avg_degree=2,
              is_undirected=TRUE,
              num_edges=num_v*avg_degree,
              node_name_length=10L)

Arguments

`num_v`	Integer; approximate number of total unique nodes in the network.
`avg_degree`	Numeric; average degree of nodes in the network (i.e., the average number of neighbors for each node)
`is_undirected`	Logical; indicates whether edges are undirected (`TRUE`) or directed (`FALSE`). Undirected edges consume twice as much disk space internally because they need to be recorded twice.
`num_edges`	Integer; approximate total number of edges in the network.
`node_name_length`	Integer; approximate average length of each node name, in characters.

Details

This function provides a rough estimate of the total disk space required to run ExoLabel for a given input network. Only one of avg_degree and num_edges must be provided. The function prints out the estimated size of the original edgelist files, the estimated disk space and RAM to be consumed by ExoLabel, and the approximate ratio of disk space relative to the original file.

node_name_length specifies the average length of the node names–since the names themselves must be stored on disk, this contributes to the overall size. For relatively short node names (1-16 characters) this has a negligible impact on overall disk consumption, though it may impact the worst-case RAM consumption. Expected RAM consumption is determined by the average prefix length a random pair of vertex labels have in common, and should be closer to the minimum usage in most scenarios (see ExoLabel for more details).

Value

Invisibly returns a vector of length six, showing the estimated RAM consumption, estimated input edgelist file size, estimated disk consumption using in-place sort (use_fast_sort=FALSE), estimated disk consumption using fast sort (use_fast_sort=TRUE), estimated final file size, and ratio of the input file size to total ExoLabel disk usage. All values denote bytes.

Note

Estimating the average node label size is challenging, and unfortunately it does have a relatively large effect on the estimated edgelist file size. This function should be used for rough estimations of sizing, not absolute values. Errors in estimation of rough node name size will have a larger impact on edgelist file estimation than on the ExoLabel disk usage, so users can have higher confidence in estimated ExoLabel consumption.

Author(s)

Aidan Lakshman <AHL27@pitt.edu>

Examples

# 100,000 nodes, average degree 2
EstimateExoLabel(num_v=100000, avg_degree=2)

# 10,000 nodes, 50,000 edges
EstimateExoLabel(num_v=10000, num_edges=50000)

npcooley/SynExtend documentation built on June 8, 2025, 5:24 a.m.