EstimateExoLabel: Estimate ExoLabel Disk Consumption

View source: R/ExoLabel.R

EstimateExoLabelR Documentation

Estimate ExoLabel Disk Consumption

Description

Estimate the total disk consumption for ExoLabel.

Usage

EstimateExoLabel(num_v, avg_degree=2,
              is_undirected=TRUE,
              num_edges=num_v*avg_degree,
              node_name_length=10L)

Arguments

num_v

Approximate number of total unique nodes in the network.

avg_degree

Average degree of each node in the network.

is_undirected

Logical indicating whether edges are directed or undirected. Undirected edges consume twice as much disk space internally because they need to be recorded twice.

num_edges

Approximate total number of edges in the network.

node_name_length

Approximate average length of each node name, in characters.

Details

This function provides a rough estimate of the total disk space required to run ExoLabel for a given input network. avg_degree and num_edges need not both be specified. The function prints out the estimated size of the original edgelist files, the estimated disk space and RAM to be consumed by ExoLabel, and the approximate ratio of disk space relative to the original file.

node_name_length specifies the average length of the node names–since the names themselves must be stored on disk, this contributes to the overall size. For relatively short node names (1-16 characters) this has a negligible impact on overall disk consumption, though it may impact the worst-case RAM consumption. Expected RAM consumption is determined by the average prefix length a random pair of vertex labels have in common, and should be closer to the minimum usage in most scenarios (see ExoLabel for more details on this).

Value

Invisibly returns a vector of length six, showing the estimated RAM consumption, estimated input edgelist file size, estimated disk consumption using in-place sort (use_fast_sort=FALSE), estimated disk consumption using fast sort (use_fast_sort=TRUE), estimated final file size, and ratio of the input file size to total ExoLabel disk usage. All values denote bytes.

Note

Estimating the average node label size is challenging, and unfortunately it does have a relatively large effect on the estimated edgelist file size. This function should be used for rough estimations of sizing, not absolute values. Errors in estimation of rough node name size will have a larger impact on edgelist file estimation than on the ExoLabel disk usage, so users can have higher confidence in estimated ExoLabel consumption.

Author(s)

Aidan Lakshman <AHL27@pitt.edu>

See Also

ExoLabel

Examples

# 100,000 nodes, average degree 2
EstimateExoLabel(num_v=100000, avg_degree=2)

# 10,000 nodes, 50,000 edges
EstimateExoLabel(num_v=10000, num_edges=50000)

npcooley/SynExtend documentation built on Dec. 20, 2024, 4:03 p.m.