get_chunk_paths: Create Hive-style partition paths
In svraka/asmisc: Miscellaneous Utility Functions

get_chunk_paths

R Documentation

Create Hive-style partition paths

To be used with read_delim_chunked_to_dataset().

get_chunk_paths(
  dataset_base_name,
  file_nrow,
  chunk_size,
  chunk_col_name = "chunk",
  chunk_file_name = "data.parquet"
)

`dataset_base_name`	Path of the directory to write the Hive partitioned Parquet files to.
`file_nrow`	Number of data rows in `file`. As there is no reliable and cross-platform way to get the exact number of lines in a compressed file, this has to be set manually to calculate the number of chunks and the names of partitions. Use `wc` on a Unix-like system to determine row count (`⁠zcat file.gz \| wc -l⁠`, or similar). Only count rows considered as data, otherwise the dataset's partitioning scheme will have empty directories. This does not result in errors but it is undesirable for human-readability. Subtract from the row count any header row(s), or the number of lines skipped with the `skip` (again, `⁠zcat file.gz \| head⁠`, or similar can be useful).
`chunk_size`	The number of rows to include in each chunk
`chunk_col_name`	Name of the column indicating partition numbers in the Hive-style partition structure.
`chunk_file_name`	Name of the individual Parquet files in the Hive-style partition structure.