knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

purrrow

Lifecycle: experimental R-CMD-check pkgdown

The goal of purrrow is to provide out-of-memory data collation into Arrow datasets.

It provides a set of functions with a logic similar to purrr, except that the result is an arrow dataset on disk. Each of these functions iterates the function passed to .f over .x and builds an arrow dataset on disk that contains all the data returned by .f as it iterates over .x.

For a primer on Arrow datasets and how to work with them in dplyr, see the vignette in Arrow (vignette('dataset', 'arrow')).

This has two advantages:

  1. it is shorthand compared to building and manually writing and then collating a bunch of arrow datasets
  2. compared to using purrr::map_dfr() followed by arrow::write_dataset(), you do not need to have all the data in memory at one time; the binding into one dataset happens out of memory.

As in {purrr}, the functions come in multiple flavours, with a suffix indicating the output:

Installation

This is not yet on CRAN. You can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("petrbouchal/purrrow")

Getting started

The basic logic is as follows:

  1. pass a function as the .f argument to

Examples

library(purrrow)
months <- unique(airquality$Month)
part_of_mpg <- function(month) {
  airquality[airquality$Month==month,]
}
td <- file.path(tempdir(), "arrowmp")
aq_arrow_dir <- marrow_dir(.x = months, .f = part_of_mpg,
                           .partitioning = "Month",
                           .path = td)
library(arrow)
open_dataset(aq_arrow_dir)
td <- file.path(tempdir(), "arrowmp2")
aq_arrow_ds <- marrow_ds(.x = months, .f = part_of_mpg,
                         .partitioning = "Month",
                         .path = td)
aq_arrow_ds
td <- file.path(tempdir(), "arrowmp3")
aq_arrow_files <- marrow_files(.x = months, .f = part_of_mpg,
                               .partitioning = "Month",
                               .path = td)
aq_arrow_files
library(dplyr)
all_equal(aq_arrow_ds %>% collect(), airquality)

Status and limitations



petrbouchal/purrrow documentation built on March 1, 2021, 12:07 a.m.