knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

rddt

The R package rddt is an attempt at providing a native distributed data.frame to R, inspired by distributed Dataframes in Spark. An R package with similar intent and scope is big.data.table. The main difference between the two R packages is how closely the data structure is coupled to the technology providing parallelism. While big.data.table builds on Rserve, rddt provides a layer of abstraction with backend implementations for parallel fork clusters and snow` MPI clusters.

Installation

You can install the development version of rddt from GitHub by running

source("https://install-github.me/nbenn/rddt")

Alternatively, if you have the remotes package available and are interested in the latest release, you can install from GitHub using install_github() as

# install.packages("remotes")
remotes::install_github("nbenn/rddt@*release")

Example

Distributed data.frames can be instantiated as rddt objects either by calling rddt(), as_rddt() or read_rddt(). If all data is available on the master process, it can be distributed as follows

library(rddt)
set_cl(fork_cluster, n_nodes = 2L)

# if the individual columns are available as vectors
dat <- rddt(
  a = rnorm(n = 1e5),
  b = sample(letters, size = 1e5, TRUE)
)

# if a complete data.frame type structure is available
dat <- as_rddt(nycflights13::flights, partition_by = c("origin", "dest"))
print(dat, n = 5)

In most practical settings it will probably make most sense to have each process read its share of the data from file in parallel instead of reading all data on the master process and subsequently distributing the data.

# set up files to be read
tmp <- split(data.table::as.data.table(nycflights13::flights), by = "month")
files <- file.path(tempdir(), paste0("nyc_fllights_", names(tmp), ".csv"))
invisible(Map(write.csv, tmp, files))

dat <- read_rddt(files, read.csv, partition = "month")
print(dat, n = 5)

# cleanup
unlink(files)


nbenn/rddt documentation built on May 7, 2019, 3:10 p.m.