View source: R/dat-to-arrow-formats.R
dat_to_datasets | R Documentation |
Some large files in the SRE are too large to fit into memory. dat_to_datasets
reads files into memory in smaller chunks (controlled by the chunk_size
argument)
and converts them into Arrow Datasets. All ...
argument are passed to dipr::read_dat
which is where column types can be specified.
dat_to_datasets( data_path, data_dict, chunk_size = 1e+06, path, partitioning, tz = "UTC", date_format = "%AD", time_format = "%AT", ... )
data_path |
A path or a vector of paths to a |
data_dict |
A data.frame with |
chunk_size |
The number of rows to include in each chunk. The value of this
parameter you choose will depend on both the number of rows in the data you are
trying to process and the RAM available. You can check the RAM available using
|
path |
string path, URI, or |
partitioning |
|
tz |
what timezone should datetime fields use? Default UTC. This is recommended to avoid timezone pain, but remember that the data is in UTC when doing analysis. See OlsonNames() for list of available timezones. |
date_format |
date format for columns where date format is not specified in |
time_format |
time format for columns where time format is not specified in |
... |
Arguments passed on to
|
## Not run: data_dict_path <- dipr_example("starwars-dict.txt") dict <- read.table(data_dict_path) dat_path <- dipr_example("starwars-fwf.dat.gz") ## Create a partitioned datasets in the "bar" folder dat_to_datasets( data_path = dat_path, data_dict = dict, path = "starwars_arrow", partitioning = "species", chunk_size = 2) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.