The AWSParallel package provides functionality to perform parallel
evaluation using AWS infrastructure, most importantly EC2. It extends
the SnowParam
class in BiocParallel, and works with the same range
of R and Bioconductor objects as SnowParam
.
The goal of the AWSParallel package is allow the user to create a
cluster of EC2 machines on AWS, and run bplapply
from one "master"
node (EC2 instance) to send a task to a set of "worker" nodes
(multiple EC2 instances). It is important to note, that both master
and worker nodes are on EC2. The user will have to start an instance,
manually, and use this machine as master.
This package requires that the user have an Amazon AWS account that costs money, and requires a credit card to access. The AWS credentials provided by the user also need access, to other AWS services as well, namely,
1. IAM 2. VPC 3. EC2
The quick start guide assumes you have your AWS access key, and secret
key in ~/.aws/credentials
. Please refer to the detailed section of
the vignette if these settings are not present.
Load the AWSParallel library and create an AWSSnowParam
library(AWSParallel) # Number of workers workers = 2 ## bioc-release 3.6 on US-EAST-2 (Ohio region), this is not default. ## Look up AMI's you want to use, at ## http://bioconductor.org/help/bioconductor-cloud-ami/#ami_ids image <- "ami-a90c23cc" ## SSH key pair awsSshKeyPair = getOption("aws_ssh_key_pair") ## Launch a small instance for demonstration `t2.micro`, ## larger instances cost more money. cluster <- AWSSnowParam( workers=workers, awsInstanceType="t2.micro", awsAmiId= image, awsSshKeyPair = awsSshKeyPair )
Start, use, and stop a cluster.
## This is a conditional param used to evaluate the vignette awsDetectMaster <- function() { ## Get list of all instances on AWS account instances <- aws.ec2::describe_instances() ## Get hostname of local machine code is being run on hostname <- system2("hostname", stdout=TRUE) hostname <- gsub("-",".", sub("ip-","", hostname)) bool <- FALSE for (i in seq_along(instances)) { instancesSet = instances[[i]][["instancesSet"]] for (j in seq_along(instancesSet)) { privateIpAddress <- instancesSet[[j]][["privateIpAddress"]] if (privateIpAddress == hostname) { bool <- TRUE } } } bool } onMaster <- FALSE
Eval only on master
## Start instance bpstart(cluster) ## Start an AWSParam job- ## Notice that the hostnames are different. We can run a computation now, and ## the job gets spread across the different machines, and collected back into ## the variable xx. xx <- bplapply(1:4, function(i) system("hostname", intern=TRUE), BPPARAM=cluster) ## Stop aws instance(s) bpstop(cluster)
Settings required to get the package working on AWS.
To use AWSParallel, AWS credentials are a requirement. The credentials are given through AWS Identity and Access management - IAM. The user needs an AWS access key and an AWS secret key.
These AWS credntials should be stored on your machine accessible at
~/.aws/credentials
, as given in the documentation for
configuring AWS credentials.
Example AWS credentials, which need to be in the file, "~/.aws/credentials".
[default] aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
The AWS credentials provided to the package need access to a few components of the AWS account,
User's also need to create a KeyPair which needs to be accessible on the machine being used. The Key Pair can be created using the AWS-EC2 interface.
It can also be done programmatically using the AWSParallel
package
but the functionality is imported from aws.ec2
.
library(AWSParallel) ## Use credentials in your ~/.aws/crentials file aws.signature::use_credentials() ## This saves the `.pem` file in your your path ~/AWSParallel-keypair.pem aws.ec2::create_keypair( keypair = "AWSParallel-key", path="~/AWSParallel-keypair.pem" ) ## Key pair to be passed in to your AWSSnowParam class awsSshKeyPair = "~/AWSParallel-keypair.pem"
Every AWS account has a default VPC created when the account is started. This VPC is usually contained to one AWS Region. Most of the Bioconductor AMI's are located on the AWS-Region US-EAST, so starting your account with a VPC located in that region makes most sense.
If the VPC is created(by the user or amazon default), the account gets
Subnets as well by default. For the AWSSnowParam
class to be
created, the user has to specify the Subnet. If the subnet is not
given, we use, the first one on the AWS account.
NOTE: The master instance and the worker instances need to be on the same VPC and subnet with permissible security groups. Without this the socket connection established with other machines launched on AWS does not work well.
Security groups are probably the most important AWS setting required in AWSParallel. It's easier if
Once you have the following AWS components set up, you are ready to use the package
We need to launch the Bioconductor AMI for the release version. Do
this by creating an AWSSnowParam
object
FIXME: this code chunk should be written to be evaluated
## Load the package library(AWSParallel) ## bioc-release 3.6 as found in https://www.bioconductor.org/config.yaml image <- "ami-a90c23cc" ## Number of workers to be started in the AWS cluster workers = 4 ## Set the AWS SSH key pair for your machine, or ## If you already have a keypair, just set the path_to_key, to the pem file. path_to_key = "~/aws-parallel.pem" ## TODO: This can also be automated awsSshKeyPair = aws.ec2::create_keypair("AWSParallelKeyPair", path_to_key) ## sg <- "sg-748dcd07" subnet <- "subnet-d66a05ec" ## Create AWS instance aws <- AWSSnowParam( workers=workers, awsInstanceType="t2.micro", awsSubnet = subnet, awsSecurityGroup = sg, awsAmiId= image, awsSshKeyPair = path_to_key, awsCredentialsPath="/home/ubuntu/credentials" ) aws ## Check if instance is up, awsInstanceStatus(aws)
Then perform diagnostics, start, use, and stop the cluster
## Start instance bpstart(aws) ## Return cluster which was started awsCluster() ## Check is instance is up awsInstanceStatus(aws) ## start an AWSParam job bplapply(1:4, function(i) system("hostname", intern=TRUE), BPPARAM=aws) ## Stop aws instance bpstop(aws)
The size of an AWS-EC2 Instance gives you access to the required amount of compute power. Larger instances usually have a higher capacity for computing, but also cost more money. The AWS Pricing is given in the documentation, and we recommend you take a look at it.
The Bioconductor AMI's have been built using the m4.xlarge machine. So ideally to run a large computation, and use every package available in Bioconductor you would use your worker of size m4.xlarge. If you are using a limited set of packages, or you just need to run a job in parallel, it would be easier to take a look at the Instance types and decide the appropriate size for your needs.
This process can be done ONE time, and the instance can be stopped without being terminated. This master instance can be reused.
Create a new amazon EC2 instance which is going to be the master node, by choosing the AMI-ID from this page, http://bioconductor.org/help/bioconductor-cloud-ami/#ami_ids. Follow the steps in the ec2 management console to launch the image.
You are required to create a Keypair if you don't have one already. This can be done using this AWS-EC2 interface console.
Name your master instance. This is important for getting your instances settings. Call it "AWSParallelMaster"
SSH into the instance, which will be
ssh -i ~/.ssh/AWSparallel-test-keypair.pem ubuntu@34.239.248.175
Once you are logged in, there are a few things you need to set
up. Install the AWSParallel
package in your R prompt.
biocLite(`AWSParallel`)
If you have a dependency error installing AWSParallel, because of
missing dependencies, i.e aws.ec2
and aws.signature
, try
install.packages( "aws.ec2", repos = c(getOption("repos"), "http://cloudyr.github.io/drat") )
Copy your AWS credentials to this machine, by writing your
credentials in the ~/.aws/credentials
directory.
Copy you AWS SSH keypair file (.pem) file to the machine as
well. The .pem
file needs to have permissions to read only, i.e
run chmod 400 AWSparallel-test-keypair.pem
if you get a
permissions error.
For windows machines, we recommend that you manually follow the steps to
prepare a master instance instead of using the awsLaunchMasterOnEc2
function.
Once you prepare the master instance, and are able to establish an SSH connection to the master you may use the rest of the package as described without any difference because of the platform.
If you choose to keep your cluster isolated from any other work you have going on your VPC. Please create a new VPC and use the subnets in that VPC to start your AWSParallel param.
Give the AWSParallel param, the new subnet, and a security group as described in the security group section.
If you want to get the verbose output of the SSH connections being
attempted to your worker nodes, please use the option
verbose=TRUE
when making your AWSSnowParam object.
It is important that this cluster configuration is used only to launch parallel jobs. If you launch a large job on the "master" node, there is no increment in speed. It is vital that large jobs are not launched on the master node.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.