knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The R package rddt
is an attempt at providing a native distributed data.frame
to R, inspired by distributed Dataframe
s in Spark. An R package with similar intent and scope is big.data.table
. The main difference between the two R packages is how closely the data structure is coupled to the technology providing parallelism. While big.data.table
builds on Rserve
, rddt
provides a layer of abstraction with backend implementations for parallel
fork clusters and snow
` MPI clusters.
You can install the development version of rddt from GitHub by running
source("https://install-github.me/nbenn/rddt")
Alternatively, if you have the remotes
package available and are interested in the latest release, you can install from GitHub using install_github()
as
# install.packages("remotes") remotes::install_github("nbenn/rddt@*release")
Distributed data.frame
s can be instantiated as rddt
objects either by calling rddt()
, as_rddt()
or read_rddt()
. If all data is available on the master process, it can be distributed as follows
library(rddt) set_cl(fork_cluster, n_nodes = 2L) # if the individual columns are available as vectors dat <- rddt( a = rnorm(n = 1e5), b = sample(letters, size = 1e5, TRUE) ) # if a complete data.frame type structure is available dat <- as_rddt(nycflights13::flights, partition_by = c("origin", "dest")) print(dat, n = 5)
In most practical settings it will probably make most sense to have each process read its share of the data from file in parallel instead of reading all data on the master process and subsequently distributing the data.
# set up files to be read tmp <- split(data.table::as.data.table(nycflights13::flights), by = "month") files <- file.path(tempdir(), paste0("nyc_fllights_", names(tmp), ".csv")) invisible(Map(write.csv, tmp, files)) dat <- read_rddt(files, read.csv, partition = "month") print(dat, n = 5) # cleanup unlink(files)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.