README.md
In hypothesci/cloudburst: Cloud-Based Distributed Compute for R

cloudburst

Cloudburst brings cloud compute resources to your R project, allowing data scientists and engineers to build complex data processing pipelines with whatever resources are required.

There's no requirement to pre-provision clusters of machines, or configure auto-scaling to keep costs down for variable workloads; compute resources are spun up exactly as necessary, and stopped once their work is done.

You can split your work up into stages that look just like functions, flowing data through your process, and Cloudburst will automatically wire them up into a DAG for optimal parallel execution where possible.

Currently, Amazon Web Services (AWS) is the only supported provider, leveraging Fargate/ECS for compute and S3 for transient storage.

library(magrittr)

# you need to initialise a provider; we'll use AWS for this example
cloudburst::init_aws(
   # s3 is the default storage backend for AWS; we need this to marshal results between stages
  storage_bucket = "my-s3-bucket",
   # let's indicate which cluster we're running in and which subnets to use for ECS
  compute_cluster = "data",
  compute_subnets = c("subnet-abcdef", "subnet-ghijkl"),
  compute_assign_public_ip = T,
  # you need a Docker image, see the "Managing Dependencies" section below
  compute_image = "12345.dkr.ecr.us-east-1.amazonaws.com/my-cloudburst-image:latest",
  # the execution role can just be the default ECS execution role for your account
  compute_execution_role = "arn:aws:iam::12345.role/ecsTaskExecutionRole",
  # the task role gives your R code access to any AWS services it might need, like S3
  compute_task_role = "arn:aws:iam::12345:role/my-cloudburst-role"
)

# variables are transparently made available to stages as needed
num_observations <- 1000

# let's pretend we've got two stages that build large datasets somehow
get_data_x <- cloudburst::stage(cpu = 1024, memory = 2048, function() {
  data.frame(x = runif(num_observations))
})

get_data_y <- cloudburst::stage(cpu = 1024, memory = 2048, function() {
  data.frame(y = rnorm(num_observations))
})

# and a third stage that does some "intensive" computation over the two
build_model <- cloudburst::stage(cpu = 2048, memory = 4096, function(data_x, data_y) {
  data <- cbind(data_x, data_y)
  lm(y ~ x, data)
})

# stages are called just like regular functions
# we just have to call 'execute' at the end to bring the result back to R
build_model(get_data_x(), get_data_y()) %>%
  cloudburst::execute("super-complex-pipeline") -> result

This would spin up two tasks to run get_data_x and get_data_y in parallel, each with 1 vCPU and 2GB of RAM, and then a third task on completion of both those stages to build the linear model, with 2 vCPUs and 4GB of RAM.

On completion, if we were to inspect result, we'd see a standard linear model, just as we'd expect from running lm in a normal R process.

Most projects aren't just base R; they require packages installed from CRAN or elsewhere, so we need to make sure that those same packages are available, no matter where your R code is being executed. To do this, we can use Packrat to track our dependencies, and Docker to bundle up an R environment with all the same packages as you're using locally.

We can tie this all together and automate it using the containr package.

You can use containr::docker_deploy to automatically create a Docker image based on your installed version of R with all your Packrat dependencies, and push it to a Docker repository of your choosing. For example, with the example above in which we used AWS ECR to store our container, we could just run containr::docker_deploy("12345.dkr.ecr.us-east-1.amazonaws.com/my-cloudburst-image:latest") from inside our project to bundle up all our required packages alongside the correct version of R.

This package and some of its dependencies are not yet available on CRAN and so must be installed directly from GitHub.

lapply(c("aws.ecs", "containr", "cloudburst"), function(p) remotes::install_github(paste0("hypothesci/", p)))

hypothesci/cloudburst documentation built on Nov. 4, 2019, 1:51 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hypothesci/cloudburst
Cloud-Based Distributed Compute for R

README.md
In hypothesci/cloudburst: Cloud-Based Distributed Compute for R

cloudburst

Demo

Managing Dependencies

Installation

R Package Documentation

Browse R Packages

We want your feedback!

hypothesci/cloudburst Cloud-Based Distributed Compute for R

README.md In hypothesci/cloudburst: Cloud-Based Distributed Compute for R

cloudburst

Demo

Managing Dependencies

Installation

R Package Documentation

Browse R Packages

We want your feedback!

hypothesci/cloudburst
Cloud-Based Distributed Compute for R

README.md
In hypothesci/cloudburst: Cloud-Based Distributed Compute for R