In teamcgc/cgcR: NCI Cancer Genomics Cloud examples in R

knitr::opts_chunk$set(echo = TRUE)

FireCloud, developed and maintained by the Broad Institute, is one of the three National Cancer Institute's Cancer Genomics Cloud Pilots. https://software.broadinstitute.org/firecloud/. This documment provides step-by-step description of creating a project on the FireCloud and upload RNA-seq (BAM) data. Before completing this tutorial, please set up an user account and a billing account with the instructions on the Firecloud's website.

Account registration

https://software.broadinstitute.org/firecloud/guide/topic?name=firecloud-registration

- Go to <Portal.FireCloud.org>
- Click "Register"
- Sign In with a Gmail or Google Apps Account (e.g., FireCloudUser@gmail.com). When asked to allow FireCloud to view your email address and basic profile info, etc., please click Allow. An explanation of these requests is posted in the Help Forum. You will be prompted to enter New User Registration information.
- Please enter a Contact Email if it differs from your Gmail or Google Apps Account.
- Click Register
(After registering, you will see a message indicating your FireCloud account is inactive. A FireCloud administrator will activate your account within 24 hours.)

Billing

https://software.broadinstitute.org/firecloud/guide/topic?name=firecloud-google

Here is some information on estimating costs in FireCloud from the development team.

FireCloud runs on the Google Cloud Platform (GCP). All FireCloud costs (cloud storage, compute, data egress etc.) are ultimately billed via Google Billing Accounts. In order to be compatible with multiple cloud environments, institutional payment systems, and security requirements, the FireCloud interface does not directly display any part of the Google Billing Account interface.

Instead, FireCloud will connect a Google Billing Account to an associated FireCloud Billing Project. When you are using FireCloud, it is these FireCloud Billing Projects that you will see in the interface, and to which FireCloud will charge your usage costs.

In order to create or clone a new workspace in FireCloud, you must have access to at least one FireCloud Billing Project. To get started with a FireCloud Billing Project or Google Billing Account, you can check out this [Help Topic] https://software.broadinstitute.org/firecloud/guide/topic?name=firecloud-google in the FireCloud User Guide.

Google Cloud Platform (GCP) Pricing Structure

Data upload to GCP is free.
Data download from GCP is $120/TB with a 30% discount if over 10TB.
Compute costs about $0.05/core hour if less than 6.5GB RAM/core. More RAM will require running with more cores and will cost more.
Disk space (used only while VM is running) costs $40/TB/Month of total quota reserved.
General storage costs $26/TB/month.

Example - FireCloud testers ran a MuTect analysis on 1089 tumor/normal pairs - Compute: At $0.05/core hour and 23,305.4 core hours, the cost was $1,165.27 ($1.07 dollars/run) - Storage: There were 1089 output files with a total size of 0.001141 TB. At $26/TB/month, the storage/month was $0.03. Note: In this example, the pairs referred to BAMs that resided in the TCGA Open Access bucket, so no storage costs were incurred by the testers. - Download: To download these files, at $120/TB, the cost is $0.14. - Note: if the testers wanted to estimate costs, they could first run the analysis on a few pairs and calculate the core hours per run.

For more information, you can check out these links: - Google Compute Pricing - Google Storage Pricing - FireCloud Projects and Billing Accounts

FireCloud Basics

https://software.broadinstitute.org/firecloud/guide/topic?name=firecloud-basics

Login

https://portal.firecloud.org/
This tutorial is written using the Google Chrome as browser.

Click "Sign In"
Sign in with your google credential
(If you have multiple google accounts, log in in an incognito window by clicking File -> New Incognito Window)

TCGA access

After you log in, your profile is displayed.
If you have an NIH account, link your NIH account.

Click "Log-In to NIH to re-link your account"
Log in with your NIH credential
(You will need to renew/re-link your account periodically)

Create and Share New Workspace

https://software.broadinstitute.org/firecloud/guide/topic?name=firecloud-basics The FireCloud world is organized into workspaces.

A workspace is a computational sandbox where you can organize genomic data and tools, and run analyses. Users can create, share, and clone workspaces. Workspaces hold:

Data: pre-loaded or user-uploaded, open access or controlled access
Workflows: pre-loaded or user-created
Tools: pre-loaded or user-created
Results: from all runs, captured with provenance

After you log in, your workspaces will be displayed. Workspaces that are TCGA access-controlled are labeled "Restricted."

To create a new workspace

Click on the blue botton on the upper-right-hand-corner "Create New Workspace... +"
Make sure it is in the correct Google Project linked to the billing account (Refer to the Billing section above)
Enter workspace name "IRPworkshop"
Enter workspace description, if any.
If this project will contain TCGA control data, check the box "Workspace intended to contain NIH protected data." If not, leave it unchecked.
Click "Create Workspace"

Once a workspace is created, you can share with other users. Under "Summary" tab, under "Workspace Owner", you should see your gmail account listed. To share this workspace

Click "(Sharing...)" next to the gmail account
A pop-up window will appear
Click "Add new +"
Enter user gmail address
Set access level for the user (OWNER, WRITER, READER, NO ACCESS)
Click "Save"

Upload my data to the FireCloud

To upload file,

Click on the line under "Google Bucket" which opens the Google Cloud Storage browser for this workspace
You will be taken to the Google Cloud Platform
Click "Upload Files"
Select all the .bam and .bai files
Click "Open"
(Multiple file tends to only upload 3 files simoutaneously, repeat this process as necessary until all bam/bai files are uploaded and can be seen in the Google Console)

Once the files are uploaded to the Google console, you need to add them to the workspace.
To do this, you need two TSV files that describes the bam/bai files.
The first one, called "participants.tsv" has information about the participants and disease type.

participant_id/sample_id fields are set arbitrarily and should be unique.

participant.tsv

entity:participant_id   disease
G28029                  BRCA
G41676                  BRCA
G41659                  LCLL
G41707                  LCLL

The second one, called "sample.tsv" has information about the bam/bai files. sample.tsv has four columns, "sample_id", "BAM_index", "participant_id", and "BAM."
"sample_id" is arbitrarily set and should be unique.
"pariticpant_id" should match the participant_id in "participant.tsv" "BAM_index"/"BAM" are addresses where the bam/bai files are in the Google Cloud Storage.
The address should have the format "gs://[Google_Bucket]/[file_name]"
If your Google Bucket location is "fc-974b342c-b6f2-45e7-b901-xxxxxxxxxxxx" and file name is "NGS.bam", your address in that column should be "gs://c-974b342c-b6f2-45e7-b901-xxxxxxxxxxxx/NGS.bam" For clarity, full file address is not listed below.

sample.tsv

entity:sample_id  BAM_index     participant_id  BAM
G28029_sample     bai_address1  G28029          bam_address1  
G41676_sample     bai_address2  G41676          bam_address2  
G41659_sample     bai_address3  G41659          bam_address3  
G41707_sample     bai_address4  G41707          bam_address4

Once your have the two TSV files created, go back to the workspace.

Click on "Data" tab
Click "Import Data..."
Click "Importa from file"
Click "Choose file..."
Select "participant.tsv" that you created above
Click "Open"
Click "Upload" and you should see a "Upload Successful" message

Click "Import Data..."
Click "Importa from file"
Click "Choose file..."
Select "sample.tsv" that you created above
Click "Open"
Click "Upload" and you should see a "Upload Successful" message

That's it. You have uploaded the files to the workspace.

When you click on the address for the bam files, you will see

Object: G28029_pe.Aligned.sortedByCoord.out.sorted.bam
File size: 974.40 MBOpen (right-click to download)Warning: Downloading this file may incur a large data egress charge