An R package to do parallel processing on Amazon, (more) easily. Born 2016, at the Brisbane ROpenSci Unconference. This is a work in progress, and is currently in development.
Authors: - Dan Pagendam - Jonathan Carroll - Daniel Thomas - Zoé van Havre - Cameron Roach - Felix Leung - Suren Rathnayake
Automatically sets up and starts a cluster of AWS workers, does parallel processing, and saves the output to S3 Bucket.
# Install
devtools::install_github("ropenscilabs/snowball")
snowball
takes the location of data, a user defined function, and some basic instructions to set up and run virtual machines in parallel on Amazon, and save results in an S3 bucket.
.rds
filesnowball(function, bucketName, ...)
Save a .snowball file into your current working directory with the following configuration,
AWS_ACCESS_KEY_ID: \<YOURACCESSSKEYID>
AWS_SECRET_ACCESS_KEY: \<YOURSECRETACCESSKEY>
AWS_DEFAULT_REGION: \<YOURDEFAULTREGION>
Next, run snowball_setup
to set global variables.
snowball_setup(config_file, echo)
Start an AWS instance with buckets, while setting up the data/feature split
snowpack(fn, listItem, bucketNameString, rdsInputObjectString, rdsOutputString)
Give data location and user function
throwSnowball(...)
combine all results into one file
avalanche(...)
Check out the Snow and Snowfall package documentations.
We assume you have a (very) basic understanding of what an S3 Bucket is (it's like dropbox, for data). Click here for info from Amazon.. It is very easy to create a bucket. You just click create bucket
.
Setting up the 'bucket policy allowing an IAM user full access' is harder:
- In the top left of an AWS window click on Services
, then IAM
, then click on the user you want to give access to (you, most likely).
- copy the User ARN into your clipboard.
- go to the newly created bucket, click on Properties
- click on add policy
, which opens a window called "AWS Policy Generator"
- Select policy type: S3 Bucket Policy
- AWS Services should be Amazon S3,
- Actions: tick All Actions
.
- Paste your ARN into principal (I know... logical.)
- Paste this (with YOUR bucket name) into the ARN box: arn:aws:s3:::bucketName
- Click Add Statement
, copy the contents to clipboard.
Go back to bucket page, click "Edit bucket policy" and paste clipboard into this.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.