In teamcgc/cgcR: NCI Cancer Genomics Cloud examples in R

Introduction

This tutorial describes steps involved in running STAR RNA-seq alignment on the FireCloud, include the following sections -

Creating a docker image of the tool (STAR aligner - https://github.com/alexdobin/STAR)
Creating a WiDdLe file to describe the workflow
Set up the alignment run on the FireCloud
Visualize the alignment using IGV (https://www.broadinstitute.org/igv/)

Please make sure you complete the following first.

Install the following:

Docker
Integrated Genome Viewer
samtools
Cromwell- Broad's workflow execution engine
Tutorial Files - zipped content below
chr21.fa - reference FASTA - containing only human hg 19 chr21 sequence
chr21.hg19.gtf - gene annotation file for chr21
sm2G28029pe1.fq - BRAC cancer cell line from CCLE - RNAseq pair-end #1 (random downsampled containg ~610K reads)
sm2G28029pe2.fq - BRAC cancer cell line from CCLE - RNAseq pair-end #2 (random downsampled containg ~610K reads)
tsvfiles.xlsx - template for creating tsv files for the FireCloud

Tutorial

Creating a Docker image of STAR (Optional)

There are several different ways of creating a docker image.
I used a Dockerfile, which is essentially a list of instructions to tell Docker how to build the image.

The STAR image is built and stored in the dockerhub - https://hub.docker.com/r/stevetsa/staralign/
Skip to the next section if you are just using the image, not creating a new one.

# This is the Dockerfile

FROM ubuntu:latest
RUN rm /bin/sh && ln -s /bin/bash /bin/sh
MAINTAINER Steve Tsang <mylagimail2004@yahoo.com>
RUN apt-get update

RUN apt-get install --yes \
 build-essential \
 gcc-multilib \
 apt-utils \
 zlib1g-dev

RUN apt-get install -y git
# Get latest STAR source from offical website https://github.com/alexdobin/STAR
RUN git clone https://github.com/alexdobin/STAR.git
WORKDIR /STAR

# Build STAR
#RUN pwd
RUN make STAR

# To include STAR-Fusion
RUN git submodule update --init --recursive

# If you have a TeX environment, you may like to build the documentation
# make manual
ENV PATH /STAR/source:$PATH
````
Build the image using the Dockerfile.

```{sh, eval=FALSE}
#building Docker image from Dockerfile
docker build -t stevetsa/staralign:v2.52a .

#push image to the dockerhub -a is author's name; -m is a commit message; and "4629cdb394d3" is the container ID
docker commit -m "first commit" -a "Steve Tsa" 4629cdb394d3 stevetsa/staralign:v2.52a
docker push stevetsa/staralign

Creating a WiDdLe file to describe the workflow

STAR is a fast aligner but uses a lot of memory. This tutorial is designed to run downsampled RNAseq data aligning only to chromosome 21. If you plan to run full RNA-seq data against full human genome, resource allocation in the runtime block needs to be changed accordingly.

task staralign {

  #List of Input Files
  File RefHg19
  File hg19GTF
  File read1
  File read2

  command <<<

    mkdir ./refgenome

    ####  Generate reference genome stored in the refgenome directory
    STAR --runThreadN 8 --runMode genomeGenerate --genomeDir ./refgenome --genomeFastaFiles ${RefHg19} --sjdbGTFfile ${hg19GTF}
    ####  Alignment and output BAM file with smG28029wdl2 prefix 
    STAR --runThreadN 8 --genomeDir ./refgenome --readFilesIn ${read1} ${read2} --outFileNamePrefix smG28029wdl2 --outSAMtype BAM SortedByCoordinate

  >>>

  runtime {
    ####  Docker image and resource allocation
    docker : "stevetsa/staralign:v2.52a"
    memory: "4G"
    cpu: "8"
  }

  output {
    File response_star = stdout()
    File outbam = "smG28029wdl2Aligned.sortedByCoord.out.bam"
  }
}


workflow star {
  File RefHg19
  File hg19GTF
  File read1
  File read2

  call staralign {
    input:
      RefHg19 = RefHg19,
      hg19GTF = hg19GTF,
      read1 = read1,
      read2 = read2
  }
} 

````

### WDL creation and testing

Create the JSON file.
```{sh, eval=FALSE}
java -jar /Users/<path file>/wdltool-0.5.jar inputs StarAlign_cfg2.wdl > StarAlign_cfg2.json

Modify the json file to include the location/names of the input files.

{
  "star.RefHg19": "chr21.fa",
  "star.hg19GTF": "chr21.hg19.gtf",
  "star.read1": "sm2G28029pe1.fq",
  "star.read2": "sm2G28029pe2.fq"
}

Testing the WDL file locally ```{sh, eval=FALSE} java -jar /Users//cromwell-0.20.jar run StarAlign_cfg2.wdl StarAlign_cfg2.json

If the run is successfully completed locally, you are ready to run this WDL file on the FireCloud.

Use the following command to push the WDL file to the FireCloud method repo.  
More info [here](https://github.com/broadinstitute/firecloud-cli).

```{sh, eval=FALSE}
docker run --rm -it -v "$HOME"/.config:/.config broadinstitute/firecloud-cli gcloud auth login

docker run --rm -it -v "$HOME"/.config:/.config -v "$PWD":/working broadinstitute/firecloud-cli firecloud -m push StarAlign_cfg2.wdl -t Workflow -s <your_FireCloud_email_account> -y "STARv2.52a RNAseq Alignment"

Set up STAR alignment run on the FireCloud

For detailed description of each step, please refer to FireCloud intro

Log in to FireCloud in a private browser such as Chrom Incognito Window <https://portal.firecloud.org>  
Create a new workspace called "StarAlignTest"  
Under the Summary tab, click on the Google Bucket  
Download and unzip [Tutorial Files](https://s3.amazonaws.com/teamcgc.nci.nih.gov/TutorialFiles/FireCloudRNAseq.zip)  
Upload the content of the folder, except tsvfiles.xlsx, to the Google bucket  

In FireCloud, Click on "Summary" tab
Add new "Workspace Attributes"
   RefHg19 -> gs://XXXXXXXXXXXXXXXXXXXXXXXX (copy-and-paste url from Google bucket)
   hg19GTF -> gs://XXXXXXXXXXXXXXXXXXXXXXXX (copy-and-paste url from Google bucket)

Open tsvfiles.xlsx in any spreadsheet application, save the participant.tsv sheet as a tab-delimited text file called "participant.tsv"  
In the sample.tsv sheet, add the urls of the fastq files in the fastq1/fastq2 columns and save as a tab-delimited text file called "sample.tsv"  

In FireCloud, click on "Data" tab and click "Import Data"  
Import participant.tsv and sample.tsv into workspace  

Click on "Method Configurations" tab  
Click "Import Configuration" and click on "StarAlign_cfg2 Snapshot ID: 1"
Change "Root Entity Type" to "Sample", click "Import" and a new page will appear

Click "Edit this page"
Make sure "Root Entity Type" is "Sample"
Change "Inputs"      
   star.hg19GTF -> workspace.chr21.hg19.gtf
   star.read1 -> this.fastq1         
   star.read2 -> this.fastq2
   star.RefHg19 -> workspace.chr21.fa
Change "Outputs""   
   star.staralign.outbam -> this.bam
   star.staralign.response_star -> this.response_star

Click "Launch Analysis"

````

You will be able to monitor the progree under the "Monitor" tab.  

After the run is successfully completed, you will be able to see two additional columns in "Data" -> "Sample."  
The "bam" column contains the BAM file and the "response_star"" column contains STDOUT output.

## Visualize the alignment using IGV (optional)

To visualize alignment, you need samtools (http://samtools.sourceforge.net/) and IGV  (https://www.broadinstitute.org/igv/).

Download the bam file to a local machine

```{sh, eval=FALSE}
samtools index smG28029wdl2Aligned.sortedByCoord.out.bam

A smG28029wdl2Aligned.sortedByCoord.out.bam.bai file will be created and you should be able to load the alignment in IGV.
Remember, only chr21 will have aligned reads.

teamcgc/cgcR documentation built on May 31, 2019, 7:56 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com