The AWSParallel package provides functionality to perform parallel
evaluation using AWS infrastructure, most importantly EC2. It also
internally uses StarCluster
to deploy jobs on SGE. It extends
BatchJobsParam
class in BiocParallel, and works with the same range of R
and Bioconductor objects as BatchJobsParam
.
The goal of the AWSParallel package is allow the user to create a cluster
of EC2 machines on AWS, and run bplapply
from one "master" node (EC2
instance) to submit jobs to a set of "worker" nodes (multiple EC2
instances). It is important to note that, both master and worker nodes are
AWS EC2 machines. A side-effect of the way we configure the required
software to enable batch job submission is, the "master" and "workers"
which act as the cluster, need to be spawned (started) from a Bioconductor
AMI. The user will have to start an instance, manually, and use this
machine as cluster starter (primary machine where AWSParallel is being
run).
This package requires that the user have an Amazon AWS account that costs money and requires a credit card to access. The AWS credentials provided by the user also need access, to other AWS services as well, namely,
1. IAM 2. VPC 3. EC2
We leave the responsibility to the user to figure out AWS, although many helpful tutorials are pointed out in the References section.
The quick start guide assumes you have your AWS access key, and secret
key in ~/.aws/credentials
. Please refer to the detailed section of
the vignette if these settings are not present. We are expecting at
this point, that you have launched the AMI (ami-18c0f562) provided by Bioconductor.
You have to use the AMI created by Bioconductor, which includes starcluster and Bioc-devel to use this package. This will be your HOST instance.
Load the AWSParallel library and create an AWSBatchJobsParam
object. This step is needed to setup your cluster on AWS.
The AMI required for correct configuration (Bioc-devel with
starcluster) is "ami-18c0f562". The AWS credntials are needed on the
launched instance as well. They need to be in the default location, or
the path needs to be specified. Specify the instance type of your AWS
Cluster, the same instance type will apply to your master and
workers. Specify the subnet you want to use from your AWS account,
note that the master and workers need to be on the same
subnet. Specify the SSH key pair, if you don't have a key you use,
just create a new one for AWSParallel, and use that throughout. The
quickstart will launch a small instance for demonstration t2.micro
as larger instances cost more money.
## Load the library library(AWSParallel) ## Construct AWSBatchJobsParam class aws <- AWSBatchJobsParam(workers = 2, awsCredentialsPath = "~/.aws/credentials", awsInstanceType = "t2.micro", awsSubnet = "subnet-d66a05ec", awsAmiId = "ami-18c0f562", awsSshKeyPair = "mykey", awsProfile="default") ## Print object to show structure aws
Start, use, and stop a cluster.
## This is a conditional param used to evaluate the vignette awsDetectMaster <- function() { ## Get list of all instances on AWS account instances <- aws.ec2::describe_instances() ## Get hostname of local machine code is being run on hostname <- system2("hostname", stdout=TRUE) hostname <- gsub("-",".", sub("ip-","", hostname)) bool <- FALSE for (i in seq_along(instances)) { instancesSet = instances[[i]][["instancesSet"]] for (j in seq_along(instancesSet)) { privateIpAddress <- instancesSet[[j]][["privateIpAddress"]] if (privateIpAddress == hostname) { bool <- TRUE } } } bool } onMaster <- awsDetectMaster()
We now should start an AWS Cluster from our HOST instance. The setup step usually takes a few mins to start the cluster. The code chunk below shows you the functionality of controlling the AWS Cluster from the HOST node.
You can setup an AWS cluster with a master and a few workers using
bpsetup()
. Given that our current configuration has workers=2
, the
cluster contains 1 master and 1 worker node. This step
produces quite a bit of verbose output about the launch of your
cluster (don't be alarmed).
You can suspend the AWS cluster using bpsuspend()
, this stops
your AWS cluster but does not terminate the instances in the cluster.
You can teardown or terminate the AWS cluster using
bpteardown()
, this will terminate your AWS cluster and remove any
data on the master or worker nodes that are not saved. You should use
this option with caution.
Setup the AWS cluster using bpsetup
## Setup AWS cluster (takes a few mins) bpsetup(aws)
Once the cluster is setup, you should log into your master node of you AWS cluster. This can be done using the following step on your host's command line.
NOTE: This is not an R command, you need to exit your R session and use this command once your AWS Cluster has successfully launched. If this command fails, then your key has not been setup properly or the previous previous commands haven't been used correctly.
starcluster sshmaster -u ubuntu awsparallel
Once you have logged into your master node, you may launch your jobs on the cluster. Start a new R session,
## Load the AWSParallel library library(AWSParallel) ## Get the AWSBatchJobsParam which is already registered on your master node aws <- registered()[[1]] ## Test the bplapply command with some function, this function ## just prints out the hostname of the machine. FUN <- function(i) system("hostname", intern=TRUE) ## Run a bplapply command with FUN, set the BPPARAM to aws ## This will submit jobs to your AWS cluster xx <- bplapply(1:100, FUN, BPPARAM = aws) ## See the hostname of your cluster and how the jobs have been divided. table(unlist(xx))
Once you are done with submitting jobs on the AWS cluster, you need to suspend, or teardown, to avoid being charged by amazon when it is not in use. This is done on the HOST machine again.
Teardown or terminate the cluster once you have finished using it.
bpteardown(aws)
Settings required to get the package working on AWS.
To use AWSParallel, AWS credentials are a requirement. The credentials are given through AWS Identity and Access management - IAM. The user needs an AWS access key and an AWS secret key.
These AWS credentials should be stored on your HOST machine accessible
at ~/.aws/credentials
, as given in the documentation for
configuring AWS credentials.
Example AWS credentials, which need to be in the file, "~/.aws/credentials".
[default] aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
The AWS credentials provided to the package need access to a few components of the AWS account,
The setting [default]
which you see on the AWS credentials,
determines the profile of the AWS credentials. If you have multiple
profiles, be sure to specify the correct profile in the argument to
instantiate the AWSBatchJobsParam.
User's also need to create a KeyPair which needs to be accessible on the machine being used. The Key Pair can be created using the AWS-EC2 interface.
It can also be done programmatically using the AWSParallel
package
but the functionality is imported from aws.ec2
.
starcluster createkey mykey
The same key pair needs to be passed to your AWSBatchJobsParam. Notice that I pass in the same key "mykey" to my AWSBatchJobsParam as well.
Every AWS account has a default VPC created when the account is started. This VPC is usually contained to one AWS Region. Most of the Bioconductor AMI's are located on the AWS-Region US-EAST, so starting your account with a VPC located in that region makes most sense.
If the VPC is created(by the user or amazon default), the account gets
Subnets as well by default. For the AWSBatchJobsParam
class to be
created, the user has to specify the Subnet. If the subnet is not
given, we use, the first one on the AWS account.
NOTE: The HOST, the master instance, and the worker instances need to be on the same VPC and subnet with permissible security groups. Without this the connection established between the machines launched on AWS does not work.
This section describes the usage of AWSBatchJobsParam in a more detailed way. The steps below highlight the ideal way to use this package.
The user needs to start a HOST instance on AWS directly from his AWS account using the AWS UI to set up the host. This host instance will be used as an interface to control the AWS Cluster.
The HOST instance is just a machine to interact with your AWSCluster. It also allows windows users to be able to use the AWSParallel package without any issues.
This process can be done ONE time, and the instance can be stopped without being terminated. This HOST instance can be reused.
Create a new amazon EC2 instance which is going to be the HOST node, by choosing the AMI-ID with starcluster and bioc-devel, as given on this page, http://bioconductor.org/help/bioconductor-cloud-ami/#ami_ids.
The size of the HOST can be a t2. micro, which is a free tier of the AWS instance sizes.
Follow the steps in the ec2 management console to launch the image.
You are required to create a Keypair if you don't have one already. This can be done using this AWS-EC2 interface console.
Name your HOST instance. This is important for getting your instances settings. Call it "AWSParallelHOST", to make it easier to recognize.
SSH into the instance, which will be
ssh -i ~/.ssh/mykey.pem ubuntu@34.239.248.175
Once you are logged in, there are a few things you need to set
up. Install or update the AWSParallel
package in your R prompt.
biocLite(`AWSParallel`)
If you have a dependency error installing AWSParallel, because of
missing dependencies, i.e aws.ec2
and aws.signature
, try
install.packages( "aws.ec2", repos = c(getOption("repos"), "http://cloudyr.github.io/drat") )
Copy your AWS credentials to this machine, by writing your
credentials in the ~/.aws/credentials
directory.
Copy you AWS SSH keypair file (.pem) file to the machine as
well. The .pem
file needs to have permissions to read only, i.e
run chmod 400 AWSparallel-test-keypair.pem
if you get a
permissions error.
Choosing a t2.micro is enough for this instance, as it is only required to switch on and off your AWS Cluster.
If you have data which needs to be used on the AWS Cluster, you'd transfer it on to the HOST before you teardown(terminate) your AWS Cluster. Remember the storage size has no relation with instance compute size. If you add a storage volume to you t2.micro, it will cost money, and does not remain in the free tier anymore.
It used to start an AWS Cluster and takes as an argument the AWSBatchJobsParam object, which needs to be initialized with the correct credentials.
The output of a successful bpstart()
is quite verbose as it acts as
the interface to starcluster and the users AWS account to start the
master and worker nodes, and set them up with the correct
configuration.
bpstart(aws)
If the user intends to reuse the AWS Cluster after finishing the compute jobs, the cluster can be suspended and restarted at a later time. The bpsuspend functionality is similar to the stop in the AWS account.
bpsuspend(aws)
Resume your AWS Cluster from the HOST machine using bpresume
with
the registered AWS param.
When you log into your HOST node, your object for your
AWSBatchJobsParam should register automatically after loading the
library(AWSParallel)
. It should be the first object in your
registry.
bpresume
will fail if the cluster was previously torn down. You
might have to bpsetup
again.
library(AWSParallel) aws <- registered()[[1]] bpresume(aws)
NOTE: bpresume
is not currently functional. It will be operational
in other iterations of the package.
Purge your AWS cluster from the HOST, using bpteardown
. This removes
the master and the worker nodes. Remember that there is no coming back
after this step, your cluster is permanently lost.
bpteardown(aws)
Start by SSH-ing into to the master node of your AWS Cluster
starcluster sshmaster -u ubuntu awsparallel
Once you are logged into the master node of your AWS cluster, you can start an R session and start sending jobs through the AWSBatchJobsParam available on your registry.
library(AWSParallel) aws <- registered()[[1]] FUN <- function(i) system("hostname", intern=TRUE) xx <- bplapply(1:100, FUN, BPPARAM=aws) table(unlist(xx))
The size of an AWS-EC2 Instance gives you access to the required amount of compute power. Larger instances usually have a higher capacity for computing, but also cost more money. The AWS Pricing is given in the documentation, and we recommend you take a look at it.
The Bioconductor AMI's have been built using the m4.xlarge machine. So ideally to run a large computation, and use every package available in Bioconductor you would use your worker of size m4.xlarge. If you are using a limited set of packages, or you just need to run a job in parallel, it would be easier to take a look at the Instance types and decide the appropriate size for your needs.
The size of the master and worker nodes are going to be the same in this version of the package. In the following releases we will allow for master and workers to be of different size.
There should be no difference for a windows user vs any other operating system as the entire process takes place in the AWS ecosystem.
If you choose to keep your cluster isolated from any other work you have going on your VPC. Please create a new VPC and use the subnets in that VPC to start your AWSParallel param.
Give the AWSParallel param, the new subnet, and a security group as described in the security group section.
It is important that this cluster configuration is used only to launch parallel jobs. If you launch a large job only on the "master" node, there is no increment in speed if the job is not parallelized. It is vital that large jobs are not launched on the master node.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.