read_delim_chunked_to_dataset: Read a delimited file by chunks and write into Hive-style...

View source: R/arrow.R

read_delim_chunked_to_datasetR Documentation

Read a delimited file by chunks and write into Hive-style Parquet files

Description

Read a single delimited file in chunks using readr::read_delim_chunked() and save chunks in Parquet files under a simple Hive-style partitioned directory (i.e. dataset_base_name/chunk=XX/data.parquet) to be used as the source of a multi-file Apache Arrow dataset.

Usage

read_delim_chunked_to_dataset(
  file,
  dataset_base_name,
  file_nrow,
  chunk_size,
  processing_function = NULL,
  chunk_col_name = "chunk",
  chunk_file_name = "data.parquet",
  ...
)

write_single_partition_dataset(
  df,
  dataset_base_name,
  chunk_col_name = "chunk",
  chunk_file_name = "data.parquet"
)

Arguments

file

Either a path to a file, a connection, or literal data (either a single string or a raw vector).

Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. Files starting with ⁠http://⁠, ⁠https://⁠, ⁠ftp://⁠, or ⁠ftps://⁠ will be automatically downloaded. Remote gz files can also be automatically downloaded and decompressed.

Literal data is most useful for examples and tests. To be recognised as literal data, the input must be either wrapped with I(), be a string containing at least one new line, or be a vector containing at least one string with a new line.

Using a value of clipboard() will read from the system clipboard.

dataset_base_name

Path of the directory to write the Hive partitioned Parquet files to.

file_nrow

Number of data rows in file. As there is no reliable and cross-platform way to get the exact number of lines in a compressed file, this has to be set manually to calculate the number of chunks and the names of partitions. Use wc on a Unix-like system to determine row count (⁠zcat file.gz | wc -l⁠, or similar). Only count rows considered as data, otherwise the dataset's partitioning scheme will have empty directories. This does not result in errors but it is undesirable for human-readability. Subtract from the row count any header row(s), or the number of lines skipped with the skip (again, ⁠zcat file.gz | head⁠, or similar can be useful).

chunk_size

The number of rows to include in each chunk

processing_function

A function that takes each chunk and does arbitrary data processing on it before writing the resulting data frame into its Parquet partition.

chunk_col_name

Name of the column indicating partition numbers in the Hive-style partition structure.

chunk_file_name

Name of the individual Parquet files in the Hive-style partition structure.

...

Passed to readr::read_delim_chunked()

df

A data frame

Details

The main goal of this function is to read a single, large, unpartitioned delimited file into a partitioned Arrow dataset on a RAM limited machine. Therefore these Arrow partitions have no inherent meaning. Although processing_function allows flexible changes during reading in, this function was intended to be used in workflows where only minimal data processing is done and the original structure of the delimited files is kept unchanged. Thus read_delim_chunked_to_dataset will create a partitioning that keeps the original row order from the delimited file. However, within partition ordering can be changed through processing_function.

Value

Invisibly return a tibble with parsing problems caught by readr (see readr::problems()). NULL if no parsing problems occurred.

Functions

  • write_single_partition_dataset():

See Also

vignette(topic = "dataset", package = "arrow") on how to use multi-file Apache Arrow datasets.


svraka/asmisc documentation built on June 12, 2025, 12:04 p.m.