read_delim_chunked_to_dataset: Read a delimited file by chunks and write into Hive-style...
In svraka/asmisc: Miscellaneous Utility Functions

read_delim_chunked_to_dataset

R Documentation

Read a delimited file by chunks and write into Hive-style Parquet files

Description

Read a single delimited file in chunks using readr::read_delim_chunked() and save chunks in Parquet files under a simple Hive-style partitioned directory (i.e. dataset_base_name/chunk=XX/data.parquet) to be used as the source of a multi-file Apache Arrow dataset.

Usage

read_delim_chunked_to_dataset(
  file,
  dataset_base_name,
  file_nrow,
  chunk_size,
  processing_function = NULL,
  chunk_col_name = "chunk",
  chunk_file_name = "data.parquet",
  ...
)

write_single_partition_dataset(
  df,
  dataset_base_name,
  chunk_col_name = "chunk",
  chunk_file_name = "data.parquet"
)

Arguments

`file`	Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in `.gz`, `.bz2`, `.xz`, or `.zip` will be automatically uncompressed. Files starting with `⁠http://⁠`, `⁠https://⁠`, `⁠ftp://⁠`, or `⁠ftps://⁠` will be automatically downloaded. Remote gz files can also be automatically downloaded and decompressed. Literal data is most useful for examples and tests. To be recognised as literal data, the input must be either wrapped with `I()`, be a string containing at least one new line, or be a vector containing at least one string with a new line. Using a value of `clipboard()` will read from the system clipboard.
`dataset_base_name`	Path of the directory to write the Hive partitioned Parquet files to.
`file_nrow`	Number of data rows in `file`. As there is no reliable and cross-platform way to get the exact number of lines in a compressed file, this has to be set manually to calculate the number of chunks and the names of partitions. Use `wc` on a Unix-like system to determine row count (`⁠zcat file.gz \| wc -l⁠`, or similar). Only count rows considered as data, otherwise the dataset's partitioning scheme will have empty directories. This does not result in errors but it is undesirable for human-readability. Subtract from the row count any header row(s), or the number of lines skipped with the `skip` (again, `⁠zcat file.gz \| head⁠`, or similar can be useful).
`chunk_size`	The number of rows to include in each chunk
`processing_function`	A function that takes each chunk and does arbitrary data processing on it before writing the resulting data frame into its Parquet partition.
`chunk_col_name`	Name of the column indicating partition numbers in the Hive-style partition structure.
`chunk_file_name`	Name of the individual Parquet files in the Hive-style partition structure.
`...`	Passed to `readr::read_delim_chunked()`
`df`	A data frame

Details

The main goal of this function is to read a single, large, unpartitioned delimited file into a partitioned Arrow dataset on a RAM limited machine. Therefore these Arrow partitions have no inherent meaning. Although processing_function allows flexible changes during reading in, this function was intended to be used in workflows where only minimal data processing is done and the original structure of the delimited files is kept unchanged. Thus read_delim_chunked_to_dataset will create a partitioning that keeps the original row order from the delimited file. However, within partition ordering can be changed through processing_function.

Value

Invisibly return a tibble with parsing problems caught by readr (see readr::problems()). NULL if no parsing problems occurred.

Functions

write_single_partition_dataset():

svraka/asmisc
Miscellaneous Utility Functions

read_delim_chunked_to_dataset: Read a delimited file by chunks and write into Hive-style...
In svraka/asmisc: Miscellaneous Utility Functions

Read a delimited file by chunks and write into Hive-style Parquet files

Description

Usage

Arguments

Details

Value

Functions

See Also

Related to read_delim_chunked_to_dataset in svraka/asmisc...

R Package Documentation

Browse R Packages

We want your feedback!

svraka/asmisc Miscellaneous Utility Functions

read_delim_chunked_to_dataset: Read a delimited file by chunks and write into Hive-style... In svraka/asmisc: Miscellaneous Utility Functions

Read a delimited file by chunks and write into Hive-style Parquet files

Description

Usage

Arguments

Details

Value

Functions

See Also

Related to read_delim_chunked_to_dataset in svraka/asmisc...

R Package Documentation

Browse R Packages

We want your feedback!

svraka/asmisc
Miscellaneous Utility Functions

read_delim_chunked_to_dataset: Read a delimited file by chunks and write into Hive-style...
In svraka/asmisc: Miscellaneous Utility Functions