Introduction

The AWSParallel package provides functionality to perform parallel evaluation using AWS infrastructure, most importantly EC2. It also internally uses StarCluster to deploy jobs on SGE. It extends BatchJobsParam class in BiocParallel, and works with the same range of R and Bioconductor objects as BatchJobsParam.

The goal of the AWSParallel package is allow the user to create a cluster of EC2 machines on AWS, and run bplapply from one "master" node (EC2 instance) to submit jobs to a set of "worker" nodes (multiple EC2 instances). It is important to note that, both master and worker nodes are AWS EC2 machines. A side-effect of the way we configure the required software to enable batch job submission is, the "master" and "workers" which act as the cluster, need to be spawned (started) from a Bioconductor AMI. The user will have to start an instance, manually, and use this machine as cluster starter (primary machine where AWSParallel is being run).

This package requires that the user have an Amazon AWS account that costs money and requires a credit card to access. The AWS credentials provided by the user also need access, to other AWS services as well, namely,

1. IAM
2. VPC
3. EC2

We leave the responsibility to the user to figure out AWS, although many helpful tutorials are pointed out in the References section.

Quick Start

The quick start guide assumes you have your AWS access key, and secret key in ~/.aws/credentials. Please refer to the detailed section of the vignette if these settings are not present. We are expecting at this point, that you have launched the AMI (ami-18c0f562) provided by Bioconductor.

You have to use the AMI created by Bioconductor, which includes starcluster and Bioc-devel to use this package. This will be your HOST instance.

Load the AWSParallel library and create an AWSBatchJobsParam object. This step is needed to setup your cluster on AWS.

The AMI required for correct configuration (Bioc-devel with starcluster) is "ami-18c0f562". The AWS credntials are needed on the launched instance as well. They need to be in the default location, or the path needs to be specified. Specify the instance type of your AWS Cluster, the same instance type will apply to your master and workers. Specify the subnet you want to use from your AWS account, note that the master and workers need to be on the same subnet. Specify the SSH key pair, if you don't have a key you use, just create a new one for AWSParallel, and use that throughout. The quickstart will launch a small instance for demonstration t2.micro as larger instances cost more money.

## Load the library
library(AWSParallel)

## Construct AWSBatchJobsParam class
aws <- AWSBatchJobsParam(workers = 2,
                  awsCredentialsPath = "~/.aws/credentials",
                  awsInstanceType = "t2.micro",
                  awsSubnet = "subnet-d66a05ec",
                  awsAmiId = "ami-18c0f562",
                  awsSshKeyPair = "mykey",
                  awsProfile="default")

## Print object to show structure
aws

Start, use, and stop a cluster.

## This is a conditional param used to evaluate the vignette
awsDetectMaster <- function()
{
        ## Get list of all instances on AWS account
    instances <- aws.ec2::describe_instances()
    ## Get hostname of local machine code is being run on
    hostname <- system2("hostname", stdout=TRUE)
    hostname <- gsub("-",".", sub("ip-","", hostname))
    bool <- FALSE
    for (i in seq_along(instances)) {
        instancesSet = instances[[i]][["instancesSet"]]
        for (j in seq_along(instancesSet)) {
            privateIpAddress <- instancesSet[[j]][["privateIpAddress"]]
            if (privateIpAddress == hostname) {
                bool <- TRUE
            }
        }
    }
    bool
}

onMaster <- awsDetectMaster()

We now should start an AWS Cluster from our HOST instance. The setup step usually takes a few mins to start the cluster. The code chunk below shows you the functionality of controlling the AWS Cluster from the HOST node.

You can setup an AWS cluster with a master and a few workers using bpsetup(). Given that our current configuration has workers=2, the cluster contains 1 master and 1 worker node. This step produces quite a bit of verbose output about the launch of your cluster (don't be alarmed).

You can suspend the AWS cluster using bpsuspend(), this stops your AWS cluster but does not terminate the instances in the cluster.

You can teardown or terminate the AWS cluster using bpteardown(), this will terminate your AWS cluster and remove any data on the master or worker nodes that are not saved. You should use this option with caution.

Setup AWS Cluster

Setup the AWS cluster using bpsetup

## Setup AWS cluster (takes a few mins)
bpsetup(aws)

Use AWS Cluster

Once the cluster is setup, you should log into your master node of you AWS cluster. This can be done using the following step on your host's command line.

NOTE: This is not an R command, you need to exit your R session and use this command once your AWS Cluster has successfully launched. If this command fails, then your key has not been setup properly or the previous previous commands haven't been used correctly.

starcluster sshmaster -u ubuntu awsparallel

Once you have logged into your master node, you may launch your jobs on the cluster. Start a new R session,

## Load the AWSParallel library
library(AWSParallel)

## Get the AWSBatchJobsParam which is already registered on your master node
aws <- registered()[[1]]

## Test the bplapply command with some function, this function 
## just prints out the hostname of the machine.
FUN <- function(i) system("hostname", intern=TRUE)

## Run a bplapply command with FUN, set the BPPARAM to aws
## This will submit jobs to your AWS cluster
xx <- bplapply(1:100, FUN, BPPARAM = aws)

## See the hostname of your cluster and how the jobs have been divided.
table(unlist(xx))

Once you are done with submitting jobs on the AWS cluster, you need to suspend, or teardown, to avoid being charged by amazon when it is not in use. This is done on the HOST machine again.

Teardown AWS Cluster

Teardown or terminate the cluster once you have finished using it.

bpteardown(aws)

AWS settings

Settings required to get the package working on AWS.

Get AWS Credentials

To use AWSParallel, AWS credentials are a requirement. The credentials are given through AWS Identity and Access management - IAM. The user needs an AWS access key and an AWS secret key.

These AWS credentials should be stored on your HOST machine accessible at ~/.aws/credentials, as given in the documentation for configuring AWS credentials.

Example AWS credentials, which need to be in the file, "~/.aws/credentials".

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

The AWS credentials provided to the package need access to a few components of the AWS account,

  1. IAM - This is used to get the credentials, change credentials, activate/deactivate credentials for security reasons.
  2. VPC - This is used to detect VPC's in the region, so that all the instances launched are within the same VPC, and same subnet.
  3. EC2 - This is used to launch, run, stop, and terminate instances as needed.

AWS Profile

The setting [default] which you see on the AWS credentials, determines the profile of the AWS credentials. If you have multiple profiles, be sure to specify the correct profile in the argument to instantiate the AWSBatchJobsParam.

AWS Key Pair

User's also need to create a KeyPair which needs to be accessible on the machine being used. The Key Pair can be created using the AWS-EC2 interface.

It can also be done programmatically using the AWSParallel package but the functionality is imported from aws.ec2.

starcluster createkey mykey

The same key pair needs to be passed to your AWSBatchJobsParam. Notice that I pass in the same key "mykey" to my AWSBatchJobsParam as well.

Subnet

Every AWS account has a default VPC created when the account is started. This VPC is usually contained to one AWS Region. Most of the Bioconductor AMI's are located on the AWS-Region US-EAST, so starting your account with a VPC located in that region makes most sense.

If the VPC is created(by the user or amazon default), the account gets Subnets as well by default. For the AWSBatchJobsParam class to be created, the user has to specify the Subnet. If the subnet is not given, we use, the first one on the AWS account.

NOTE: The HOST, the master instance, and the worker instances need to be on the same VPC and subnet with permissible security groups. Without this the connection established between the machines launched on AWS does not work.

Working with the AWSParallel - AWSBatchJobsParam

This section describes the usage of AWSBatchJobsParam in a more detailed way. The steps below highlight the ideal way to use this package.

Starting a HOST instance on AWS

The user needs to start a HOST instance on AWS directly from his AWS account using the AWS UI to set up the host. This host instance will be used as an interface to control the AWS Cluster.

The HOST instance is just a machine to interact with your AWSCluster. It also allows windows users to be able to use the AWSParallel package without any issues.

Steps to prepare HOST instance.

This process can be done ONE time, and the instance can be stopped without being terminated. This HOST instance can be reused.

  1. Create a new amazon EC2 instance which is going to be the HOST node, by choosing the AMI-ID with starcluster and bioc-devel, as given on this page, http://bioconductor.org/help/bioconductor-cloud-ami/#ami_ids.

    The size of the HOST can be a t2. micro, which is a free tier of the AWS instance sizes.

    Follow the steps in the ec2 management console to launch the image.

    Choose EC2 Instance Type

    Configure EC2 instance with security settings

    Add storage as per usage requirements

    You are required to create a Keypair if you don't have one already. This can be done using this AWS-EC2 interface console.

    Review and launch your HOST instance

  2. Name your HOST instance. This is important for getting your instances settings. Call it "AWSParallelHOST", to make it easier to recognize.

  3. SSH into the instance, which will be

    ssh -i ~/.ssh/mykey.pem ubuntu@34.239.248.175
    
  4. Once you are logged in, there are a few things you need to set up. Install or update the AWSParallel package in your R prompt.

    biocLite(`AWSParallel`)
    

    If you have a dependency error installing AWSParallel, because of missing dependencies, i.e aws.ec2 and aws.signature, try

    install.packages(
        "aws.ec2", repos = c(getOption("repos"), 
        "http://cloudyr.github.io/drat")
    )
    
  5. Copy your AWS credentials to this machine, by writing your credentials in the ~/.aws/credentials directory.

  6. Copy you AWS SSH keypair file (.pem) file to the machine as well. The .pem file needs to have permissions to read only, i.e run chmod 400 AWSparallel-test-keypair.pem if you get a permissions error.

HOST instance size

Choosing a t2.micro is enough for this instance, as it is only required to switch on and off your AWS Cluster.

If you have data which needs to be used on the AWS Cluster, you'd transfer it on to the HOST before you teardown(terminate) your AWS Cluster. Remember the storage size has no relation with instance compute size. If you add a storage volume to you t2.micro, it will cost money, and does not remain in the free tier anymore.

Functionality of HOST instance

bpstart()

It used to start an AWS Cluster and takes as an argument the AWSBatchJobsParam object, which needs to be initialized with the correct credentials.

The output of a successful bpstart() is quite verbose as it acts as the interface to starcluster and the users AWS account to start the master and worker nodes, and set them up with the correct configuration.

bpstart(aws)

bpsuspend()

If the user intends to reuse the AWS Cluster after finishing the compute jobs, the cluster can be suspended and restarted at a later time. The bpsuspend functionality is similar to the stop in the AWS account.

bpsuspend(aws)

bpresume()

Resume your AWS Cluster from the HOST machine using bpresume with the registered AWS param.

When you log into your HOST node, your object for your AWSBatchJobsParam should register automatically after loading the library(AWSParallel). It should be the first object in your registry.

bpresume will fail if the cluster was previously torn down. You might have to bpsetup again.

library(AWSParallel)
aws <- registered()[[1]]
bpresume(aws)

NOTE: bpresume is not currently functional. It will be operational in other iterations of the package.

bpteardown()

Purge your AWS cluster from the HOST, using bpteardown. This removes the master and the worker nodes. Remember that there is no coming back after this step, your cluster is permanently lost.

bpteardown(aws)

Using AWS Cluster from HOST instance

Start by SSH-ing into to the master node of your AWS Cluster

starcluster sshmaster -u ubuntu awsparallel

Using AWS Cluster from MASTER mode

Once you are logged into the master node of your AWS cluster, you can start an R session and start sending jobs through the AWSBatchJobsParam available on your registry.

library(AWSParallel)

aws <- registered()[[1]]

FUN  <- function(i) system("hostname", intern=TRUE)

xx  <- bplapply(1:100, FUN, BPPARAM=aws)

table(unlist(xx))

Choosing AWS EC2 Instance Size -- MASTER

The size of an AWS-EC2 Instance gives you access to the required amount of compute power. Larger instances usually have a higher capacity for computing, but also cost more money. The AWS Pricing is given in the documentation, and we recommend you take a look at it.

The Bioconductor AMI's have been built using the m4.xlarge machine. So ideally to run a large computation, and use every package available in Bioconductor you would use your worker of size m4.xlarge. If you are using a limited set of packages, or you just need to run a job in parallel, it would be easier to take a look at the Instance types and decide the appropriate size for your needs.

Choosing AWS EC2 Instance size -- WORKER

The size of the master and worker nodes are going to be the same in this version of the package. In the following releases we will allow for master and workers to be of different size.

Instructions for Windows machines.

There should be no difference for a windows user vs any other operating system as the entire process takes place in the AWS ecosystem.

Advanced Tips

  1. If you choose to keep your cluster isolated from any other work you have going on your VPC. Please create a new VPC and use the subnets in that VPC to start your AWSParallel param.

  2. Give the AWSParallel param, the new subnet, and a security group as described in the security group section.

  3. It is important that this cluster configuration is used only to launch parallel jobs. If you launch a large job only on the "master" node, there is no increment in speed if the job is not parallelized. It is vital that large jobs are not launched on the master node.

Session Info

sessionInfo()


Bioconductor/AWSParallel documentation built on May 28, 2019, 11:58 a.m.