get_chunk_paths: Create Hive-style partition paths

View source: R/arrow.R

get_chunk_pathsR Documentation

Create Hive-style partition paths

Description

To be used with read_delim_chunked_to_dataset().

Usage

get_chunk_paths(
  dataset_base_name,
  file_nrow,
  chunk_size,
  chunk_col_name = "chunk",
  chunk_file_name = "data.parquet"
)

Arguments

dataset_base_name

Path of the directory to write the Hive partitioned Parquet files to.

file_nrow

Number of data rows in file. As there is no reliable and cross-platform way to get the exact number of lines in a compressed file, this has to be set manually to calculate the number of chunks and the names of partitions. Use wc on a Unix-like system to determine row count (⁠zcat file.gz | wc -l⁠, or similar). Only count rows considered as data, otherwise the dataset's partitioning scheme will have empty directories. This does not result in errors but it is undesirable for human-readability. Subtract from the row count any header row(s), or the number of lines skipped with the skip (again, ⁠zcat file.gz | head⁠, or similar can be useful).

chunk_size

The number of rows to include in each chunk

chunk_col_name

Name of the column indicating partition numbers in the Hive-style partition structure.

chunk_file_name

Name of the individual Parquet files in the Hive-style partition structure.

Value

A character vector with the paths to the partitions.


svraka/asmisc documentation built on June 12, 2025, 12:04 p.m.