read_delim_chunked_to_dataset | R Documentation |
Read a single delimited file in chunks using
readr::read_delim_chunked()
and save chunks in Parquet files
under a simple Hive-style partitioned directory (i.e.
dataset_base_name/chunk=XX/data.parquet
) to be used as the source
of a multi-file Apache Arrow dataset.
read_delim_chunked_to_dataset(
file,
dataset_base_name,
file_nrow,
chunk_size,
processing_function = NULL,
chunk_col_name = "chunk",
chunk_file_name = "data.parquet",
...
)
write_single_partition_dataset(
df,
dataset_base_name,
chunk_col_name = "chunk",
chunk_file_name = "data.parquet"
)
file |
Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, the input must be either wrapped with Using a value of |
dataset_base_name |
Path of the directory to write the Hive partitioned Parquet files to. |
file_nrow |
Number of data rows in |
chunk_size |
The number of rows to include in each chunk |
processing_function |
A function that takes each chunk and does arbitrary data processing on it before writing the resulting data frame into its Parquet partition. |
chunk_col_name |
Name of the column indicating partition numbers in the Hive-style partition structure. |
chunk_file_name |
Name of the individual Parquet files in the Hive-style partition structure. |
... |
Passed to |
df |
A data frame |
The main goal of this function is to read a single, large,
unpartitioned delimited file into a partitioned Arrow dataset on a
RAM limited machine. Therefore these Arrow partitions have no
inherent meaning. Although processing_function
allows flexible
changes during reading in, this function was intended to be used in
workflows where only minimal data processing is done and the
original structure of the delimited files is kept unchanged. Thus
read_delim_chunked_to_dataset
will create a partitioning that
keeps the original row order from the delimited file. However,
within partition ordering can be changed through
processing_function
.
Invisibly return a tibble with parsing problems caught by
readr (see readr::problems()
). NULL
if no parsing
problems occurred.
write_single_partition_dataset()
:
vignette(topic = "dataset", package = "arrow")
on how to use
multi-file Apache Arrow datasets.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.