Introduction

The AWSParallel package provides functionality to perform parallel evaluation using AWS infrastructure, most importantly EC2. It extends the SnowParam class in BiocParallel, and works with the same range of R and Bioconductor objects as SnowParam.

The goal of the AWSParallel package is allow the user to create a cluster of EC2 machines on AWS, and run bplapply from one "master" node (EC2 instance) to send a task to a set of "worker" nodes (multiple EC2 instances). It is important to note, that both master and worker nodes are on EC2. The user will have to start an instance, manually, and use this machine as master.

This package requires that the user have an Amazon AWS account that costs money, and requires a credit card to access. The AWS credentials provided by the user also need access, to other AWS services as well, namely,

1. IAM
2. VPC
3. EC2

Quick Start

The quick start guide assumes you have your AWS access key, and secret key in ~/.aws/credentials. Please refer to the detailed section of the vignette if these settings are not present.

Load the AWSParallel library and create an AWSSnowParam

library(AWSParallel)

# Number of workers
workers = 2

## bioc-release 3.6 on US-EAST-2 (Ohio region), this is not default.
## Look up AMI's you want to use, at 
## http://bioconductor.org/help/bioconductor-cloud-ami/#ami_ids
image <- "ami-a90c23cc"

## SSH key pair
awsSshKeyPair = getOption("aws_ssh_key_pair")

## Launch a small instance for demonstration `t2.micro`,
## larger instances cost more money.
cluster <- AWSSnowParam(
    workers=workers,
    awsInstanceType="t2.micro",
    awsAmiId= image,
    awsSshKeyPair = awsSshKeyPair
)

Start, use, and stop a cluster.

## This is a conditional param used to evaluate the vignette
awsDetectMaster <- function()
{
        ## Get list of all instances on AWS account
    instances <- aws.ec2::describe_instances()
    ## Get hostname of local machine code is being run on
    hostname <- system2("hostname", stdout=TRUE)
    hostname <- gsub("-",".", sub("ip-","", hostname))
    bool <- FALSE
    for (i in seq_along(instances)) {
        instancesSet = instances[[i]][["instancesSet"]]
        for (j in seq_along(instancesSet)) {
            privateIpAddress <- instancesSet[[j]][["privateIpAddress"]]
            if (privateIpAddress == hostname) {
                bool <- TRUE
            }
        }
    }
    bool
}

onMaster <- FALSE

Eval only on master

## Start instance
bpstart(cluster)

## Start an AWSParam job-
## Notice that the hostnames are different. We can run a computation now, and 
## the job gets spread across the different machines, and collected back into
## the variable xx.
xx <- bplapply(1:4, function(i) system("hostname", intern=TRUE), BPPARAM=cluster)

## Stop aws instance(s)
bpstop(cluster) 

AWS settings

Settings required to get the package working on AWS.

Get AWS Credntials

To use AWSParallel, AWS credentials are a requirement. The credentials are given through AWS Identity and Access management - IAM. The user needs an AWS access key and an AWS secret key.

These AWS credntials should be stored on your machine accessible at ~/.aws/credentials, as given in the documentation for configuring AWS credentials.

Example AWS credentials, which need to be in the file, "~/.aws/credentials".

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

The AWS credentials provided to the package need access to a few components of the AWS account,

  1. IAM - This is used to get the credentials, change credentials, activate/deactivate credentials for security reasons.
  2. VPC - This is used to detect VPC's in the region, so that all the instances launched are within the same VPC, and same subnet.
  3. EC2 - This is used to launch, run, and terminate instances as needed.

AWS Key Pair

User's also need to create a KeyPair which needs to be accessible on the machine being used. The Key Pair can be created using the AWS-EC2 interface.

It can also be done programmatically using the AWSParallel package but the functionality is imported from aws.ec2.

library(AWSParallel)
## Use credentials in your ~/.aws/crentials file
aws.signature::use_credentials()
## This saves the `.pem` file in your your path ~/AWSParallel-keypair.pem
aws.ec2::create_keypair(
    keypair = "AWSParallel-key", path="~/AWSParallel-keypair.pem"
)

## Key pair to be passed in to your AWSSnowParam class
awsSshKeyPair = "~/AWSParallel-keypair.pem"

VPC and Subnet (Optional)

Every AWS account has a default VPC created when the account is started. This VPC is usually contained to one AWS Region. Most of the Bioconductor AMI's are located on the AWS-Region US-EAST, so starting your account with a VPC located in that region makes most sense.

If the VPC is created(by the user or amazon default), the account gets Subnets as well by default. For the AWSSnowParam class to be created, the user has to specify the Subnet. If the subnet is not given, we use, the first one on the AWS account.

NOTE: The master instance and the worker instances need to be on the same VPC and subnet with permissible security groups. Without this the socket connection established with other machines launched on AWS does not work well.

Security Group settings (Optional)

Security groups are probably the most important AWS setting required in AWSParallel. It's easier if

Security Group Inbound Rules

Security Group Outbound Rules

Once you have the following AWS components set up, you are ready to use the package

Working with the package AWSParallel

We need to launch the Bioconductor AMI for the release version. Do this by creating an AWSSnowParam object

FIXME: this code chunk should be written to be evaluated

## Load the package
library(AWSParallel)

## bioc-release 3.6 as found in https://www.bioconductor.org/config.yaml
image <- "ami-a90c23cc"

## Number of workers to be started in the AWS cluster
workers = 4

## Set the AWS SSH key pair for your machine, or
## If you already have a keypair, just set the path_to_key, to the pem file.
path_to_key = "~/aws-parallel.pem"

## TODO: This can also be automated
awsSshKeyPair = aws.ec2::create_keypair("AWSParallelKeyPair", path_to_key)

## 
sg <- "sg-748dcd07"
subnet <- "subnet-d66a05ec"

## Create AWS instance
aws <- AWSSnowParam(
    workers=workers,
    awsInstanceType="t2.micro",
    awsSubnet = subnet,
    awsSecurityGroup = sg,
    awsAmiId= image,
    awsSshKeyPair = path_to_key,
    awsCredentialsPath="/home/ubuntu/credentials"
    )

aws
## Check if instance is up,
awsInstanceStatus(aws)

Then perform diagnostics, start, use, and stop the cluster

## Start instance
bpstart(aws)

## Return cluster which was started
awsCluster()

## Check is instance is up
awsInstanceStatus(aws)

## start an AWSParam job
bplapply(1:4, function(i) system("hostname", intern=TRUE), BPPARAM=aws)

## Stop aws instance
bpstop(aws)

Choosing AWS EC2 Instance Size

The size of an AWS-EC2 Instance gives you access to the required amount of compute power. Larger instances usually have a higher capacity for computing, but also cost more money. The AWS Pricing is given in the documentation, and we recommend you take a look at it.

The Bioconductor AMI's have been built using the m4.xlarge machine. So ideally to run a large computation, and use every package available in Bioconductor you would use your worker of size m4.xlarge. If you are using a limited set of packages, or you just need to run a job in parallel, it would be easier to take a look at the Instance types and decide the appropriate size for your needs.

Steps to prepare master instance.

This process can be done ONE time, and the instance can be stopped without being terminated. This master instance can be reused.

  1. Create a new amazon EC2 instance which is going to be the master node, by choosing the AMI-ID from this page, http://bioconductor.org/help/bioconductor-cloud-ami/#ami_ids. Follow the steps in the ec2 management console to launch the image.

    Choose EC2 Instance Type

    Configure EC2 instance with security settings

    Add storage as per usage requirements

    You are required to create a Keypair if you don't have one already. This can be done using this AWS-EC2 interface console.

    Review and launch your master instance

  2. Name your master instance. This is important for getting your instances settings. Call it "AWSParallelMaster"

  3. SSH into the instance, which will be

    ssh -i ~/.ssh/AWSparallel-test-keypair.pem ubuntu@34.239.248.175
    
  4. Once you are logged in, there are a few things you need to set up. Install the AWSParallel package in your R prompt.

    biocLite(`AWSParallel`)
    

    If you have a dependency error installing AWSParallel, because of missing dependencies, i.e aws.ec2 and aws.signature, try

    install.packages(
        "aws.ec2", repos = c(getOption("repos"), 
        "http://cloudyr.github.io/drat")
    )
    
  5. Copy your AWS credentials to this machine, by writing your credentials in the ~/.aws/credentials directory.

  6. Copy you AWS SSH keypair file (.pem) file to the machine as well. The .pem file needs to have permissions to read only, i.e run chmod 400 AWSparallel-test-keypair.pem if you get a permissions error.

Instructions for Windows machines.

For windows machines, we recommend that you manually follow the steps to prepare a master instance instead of using the awsLaunchMasterOnEc2 function.

Once you prepare the master instance, and are able to establish an SSH connection to the master you may use the rest of the package as described without any difference because of the platform.

Advanced Tips

  1. If you choose to keep your cluster isolated from any other work you have going on your VPC. Please create a new VPC and use the subnets in that VPC to start your AWSParallel param.

  2. Give the AWSParallel param, the new subnet, and a security group as described in the security group section.

  3. If you want to get the verbose output of the SSH connections being attempted to your worker nodes, please use the option verbose=TRUE when making your AWSSnowParam object.

  4. It is important that this cluster configuration is used only to launch parallel jobs. If you launch a large job on the "master" node, there is no increment in speed. It is vital that large jobs are not launched on the master node.

Session Info

sessionInfo()


Bioconductor/AWSParallel documentation built on May 28, 2019, 11:58 a.m.