The goal of purrrow is to provide out-of-memory data collation into Arrow datasets.
It provides a set of functions with a logic similar to purrr, except
that the result is an arrow dataset on disk. Each of these functions
iterates the function passed to .f
over .x
and builds an arrow
dataset on disk that contains all the data returned by .f
as it
iterates over .x
.
For a primer on Arrow datasets and how to work with them in dplyr, see
the
vignette
in Arrow (vignette('dataset', 'arrow')
).
This has two advantages:
purrr::map_dfr()
followed by
arrow::write_dataset()
, you do not need to have all the data in
memory at one time; the binding into one dataset happens out of
memory.As in {purrr}
, the functions come in multiple flavours, with a suffix
indicating the output:
marrow_dir
: returns the directory in which the resulting dataset
is stored (which is also its .path
param)marrow_ds
: returns an arrow Dataset
objectmarrow_files
: returns paths to all the files in the arrow dataset.
This can be useful for tracking the files in make-like workflows,
e.g. {targets}
.This is not yet on CRAN. You can install the development version from GitHub with:
# install.packages("remotes")
remotes::install_github("petrbouchal/purrrow")
The basic logic is as follows:
library(purrrow)
months <- unique(airquality$Month)
part_of_mpg <- function(month) {
airquality[airquality$Month==month,]
}
td <- file.path(tempdir(), "arrowmp")
aq_arrow_dir <- marrow_dir(.x = months, .f = part_of_mpg,
.partitioning = "Month",
.path = td)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
open_dataset(aq_arrow_dir)
#> FileSystemDataset with 5 Parquet files
#> Ozone: int32
#> Solar.R: int32
#> Wind: double
#> Temp: int32
#> Day: int32
#> Month: int32
#>
#> See $metadata for additional Schema metadata
td <- file.path(tempdir(), "arrowmp2")
aq_arrow_ds <- marrow_ds(.x = months, .f = part_of_mpg,
.partitioning = "Month",
.path = td)
aq_arrow_ds
#> FileSystemDataset with 5 Parquet files
#> Ozone: int32
#> Solar.R: int32
#> Wind: double
#> Temp: int32
#> Day: int32
#> Month: int32
#>
#> See $metadata for additional Schema metadata
td <- file.path(tempdir(), "arrowmp3")
aq_arrow_files <- marrow_files(.x = months, .f = part_of_mpg,
.partitioning = "Month",
.path = td)
aq_arrow_files
#> [1] "/var/folders/c8/pj33jytj233g8vr0tw4b2h7m0000gn/T//Rtmpc5tgdL/arrowmp3/Month=5/part-2.parquet"
#> [2] "/var/folders/c8/pj33jytj233g8vr0tw4b2h7m0000gn/T//Rtmpc5tgdL/arrowmp3/Month=6/part-3.parquet"
#> [3] "/var/folders/c8/pj33jytj233g8vr0tw4b2h7m0000gn/T//Rtmpc5tgdL/arrowmp3/Month=7/part-0.parquet"
#> [4] "/var/folders/c8/pj33jytj233g8vr0tw4b2h7m0000gn/T//Rtmpc5tgdL/arrowmp3/Month=8/part-1.parquet"
#> [5] "/var/folders/c8/pj33jytj233g8vr0tw4b2h7m0000gn/T//Rtmpc5tgdL/arrowmp3/Month=9/part-4.parquet"
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
all_equal(aq_arrow_ds %>% collect(), airquality)
#> [1] TRUE
arrow::write_dataset()
[TO DO]write_dataset()
, marrow_*()
functions do not
infer partitioning from grouping in the dataset returned by .f. This
is intentional so as not to introduce confusion in the partitioning.purrr::*_dfr()
’s .id
parameter yet [TO DO]Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.