title: "SB-CGC-Overview-RNASeq-WF-WS2.rmd" author: "Durga Addepalli" date: "May 02, 2016" output: html_document

INTRODUCTION

This guide will get you started to use Seven Bridges API with the R client package sbgr, and guide you through the steps needed to run the "RNA-seq Alignment - STAR" and RNA-Seq alignment pipeline on the Seven Bridges Genomics platform.

The following primary steps will be covered: - Getting access to SB CGC - Setting up and sharing projects - File management - running the tools and workflows - downloading the results.

download the [R markdown file] and load it to your Rstudio or your favorite R IDE.

download.file("https://github.com/teamcgc/teamcgc.github.io/....SBCGC-R-CreateWorkflow.Rmd", destfile = "~/bioc-workflow.Rmd")

Installing the Package

Download and install the sevenbridges package from Bioconductor

source("http://bioconductor.org/biocLite.R")
useDevel(devel = TRUE)
biocLite("sevenbridges")

Alternatively, If you cannot install it because you are not running latest R, you can install the package from github:

## Install from github for development version  
# install.packages("devtools") if devtools was not installed
source("http://bioconductor.org/biocLite.R")
# install.packages("devtools") if devtools was not installed
library(devtools)
install_github("sbg/sevenbridges-r", build_vignettes=TRUE, 
  repos=BiocInstaller::biocinstallRepos(),
  dependencies=TRUE)

you can always browse the vignette

browseVignettes(package = 'sevenbridges')

Register on NCI Cancer Genomics Cloud

You can find login/registration on NCI Cancer Genomics Cloud homepage, follow the signup tutorial if you have an ERA Commons or NIH account.

Get your authentication Token

After you login, you can get your authentication under your account setting and 'developer' tab tutorial

Create a project under your account via API R client

Now lets use the 'sevenbridges' packages you just installed to create a project.

create an Auth object in order to create a project

Almost everything starts from this object.

On the SB platform/CGC GUI, Auth is your account, and it contains - projects, - billing groups, - users / collaborations - project contains tasks, apps, files etc.

To create an Auth object, run the command below and replace "fake_token" with your own token.

library(sevenbridges)
a <- Auth(token = "fake_token", url = "https://cgc-api.sbgenomics.com/v2/")

To create a project, you need to know your billing group id, cost related to this project will be charged from this billing group. - FOR THIS workshop – free credits

(b <- a$billing())
a$billing(id = b$id, breakdown = TRUE)

Infomation about a user

This call returns information about your account.

a$user()

Now let's create a new project called "IRP-SBCGC-WS2", save it to a 'p' object for convenient usage for any call related to this project.

(p <- a$project_new("IRP-SBCGC-WS2", b$id, 
                    description = "This project is for IRP-CGC R-Workshop@CBIIT"))

Now check it on CGC, you will see a fresh new project is created.

List all Projects

This call returns a list of all projects you are a member of. Each project’s project_id and URL on the CGC will be returned.

a$project()

To list the projects owned by and accessible to a particular user

a$project(owner = "Durga")

To get details about project(s), use detail = TRUE

a$project(detail = TRUE)
a$project(owner ="sdavis2", detail = TRUE)

Search by Partial Match Project Name

SB CGC also supports partial name match in this interface. The first argument for the call is “name”, users can provide part of the name.

a$project("data")

Delete a Project

Only single project could be deleted by call $delete(), so please pay attention to the returned object from a$project(), sometimes if you are using partial matching by name, it will return a list.

To delete it, just call, but I will suggest you keep it for following tutorial : )

a$project("IRP-SBCGC-WS2")$delete()

## will delete all projects matching the name
delete(a$project("IRP-SBCGC-WS2_donnot_delete_me"))

List all Files

For each file, the call returns: Its ID, Its filename. The project is specified as a query parameter in the call.

p$file()

To list all files belonging to a project (irp-cgc-with-r):

n <- a$project(id = "Durga/irp-cgc-with-r")
n$file()

n$file("bam", project = n, detail = TRUE)

Copy file/s

Let’s try to copy a file from DemoData

You must provide id file id, or list/vector of files ids. project parameter: project id.

To copy a single file

pid <- a$project("IRP-SBCGC-WS2")$id
fid <- "56f4592ae4b070c30853bbcb"
a$project("DemoData")$file(id = fid)$copyTo(pid)

To Copy multiple files

fid <- "5788524ce4b035347bd5c54c"
fid2 <- "57884b7fe4b01d432ccb8d97"
fid3 <- "57885ecce4b035347bd5c553"
a$copyFile(c(fid, fid2, fid3), project = pid)
p$file()

Delete file(s)

## Delete file
a$project("IRP-SBCGC-WS2")$file()[[1]]$delete()
## confirm the deletion
a$project("IRP-SBCGC-WS2")$file()

## Search for specific file by pattern match and delete, this can delete multiple files too.
a$project("IRP-SBCGC-WS2")$file("bam")$delete()

## You can also delete a group of files or FilesList object, be careful with this function!
a$project("IRP-SBCGC-WS2")$file("bam")

## delete all of them
delete(a$project("IRP-SBCGC-WS2")$file("fastq"))
a$project("IRP-SBCGC-WS2")$file("fastq")

Download Files: To get the download information, basically a url, please use

a$project("IRP-SBCGC-WS2")$file(id = "56f4592ae4b070c30853bbcb")$download_url()

## To download directly from R, use download call directly from single File object.
fid <- a$project("IRP-SBCGC-WS2")$file(id ="56f4592ae4b070c30853bbcb")$id
a$project("IRP-SBCGC-WS2")$file(id = fid)$download("~/Downloads/")

##To download all files from a project.
a$project("IRP-SBCGC-WS2")$download("~/Downloads")

## Use download function for FilesList object to save your time
fls <- a$project("IRP-SBCGC-WS2")$file()
download(fls, "~/Downloads/")

Upload Files

Seven Bridges platforms provide couple different ways for data import • command line uploader • graphic UI uploader • from ftp, http etc from interface directly • api uploader that you can directly call with sevenbridges package API client uploader is working like this, simply call project$upload function to upload a file a file list or a folder recursively…

fl <- system.file("extdata", "sample1.fastq", package = "sevenbridges")

## by default load .meta for the file
p$upload(fl, overwrite = TRUE)
## pass metadata
p$upload(fl, overwrite = TRUE, metadata = list(library_id = "testid2", platform = "Illumina x11"))
## rename
p$upload(fl, overwrite = TRUE, name = "sample_new_name.fastq", 
         metadata = list(library_id = "new_id"))

Lets Search and Copy some Apps into our Project IRP-SBCGC-WS2

You can list all Apps available to you.

a$app()
## or show details
a$app(detail = TRUE)

You can search by name, partial name or ID

## pattern match
a$app(name = "get_http")

## unique id
a$app(id = "tsang/temp/txfr")

## get a specific revision from an app
a$app(id = aid, revision = 0)

To list all apps belonging to one project use project argument

a$project("irp-cgc-with-r")$app()

## or alternatviely
a$app(project = pid)

To run the example RNA-Seq alignment STAR workflow, we need two apps in our project: get-http_file RNA-Seq Alignment STAR

Copy these Apps to your Project

This call copies the specified app to the specified project. The app should be from a project that you can access; this could be an app that has been uploaded to the CGC by a project member, or a publicly available app that has been copied to the project.

Need two arguments project: id character name: optional, to re-name your app

Copy app "get-http_file"

app.gets3fl <- a$copyApp(id =  "Durga/irp-cgc-with-r/Get-http-file-WS2", project = pid, name = "get-http-f-WS2")
apid <- app.gets3fl$id
## check it is copied
a$app(project = pid)

Search and copy the PUBLIC workflow RNA-Seq Alignment STAR

a$app("STAR", visibility = "public", complete = TRUE)

app.rnawf <- a$copyApp(id =  "djordje_klisic/public-apps-by-seven-bridges/rna-seq-alignment-star/3", project = pid, name = "STAR-WF-WS2")

rwid <- app.rnawf$id
## check it is copied
a$app(project = pid)

Upload files from Amazon bucket to our project

Here we demonstrate how to upload files directly from a S3 bucket to your CGC project. We will be using the Get_Http_file app to get files. ......

To create a draft task, you need to call the task_add function from Project object. And you need to pass following arguments

p$task_add(name = "getHttpF-WS2", 
           description = "Using get_http_file app to get files from S3 bucket", 
           app=apid,
           inputs = list(url = "http://teamcgc.nci.nih.gov.s3-website-us-east-1.amazonaws.com/20160325SBGworkshop/FastqFiles/G28029_pe_1.fastq", ofname = "G28029_pe_1.fastq"))

## confirm
p$task(status = "draft")

Run the task

This call runs (executes) the specified task. Only tasks whose status is “DRAFT” may be run.

tsk <- p$task(status = "draft")
tsk$run()
## run update without information just return latest information
tsk$update()

Run the RNA-Seq Alignment STAR Workflow

Now we will run the RNA-seq Alignment STAR workflow (on which we worked in the last workshop) taking the BAM file uploaded from the previous task, as an input.

WorkFlow Description:

  1. What tools are inbuilt in this workflow?

    • Picard Sam to FastQ (Tool)
    • SBG FASTQ Quality Detector: FASTQ Quality Scale Detector detects which quality encoding scheme was used in your reads and automatically enters the proper value in the "Quality Scale" metadata field.
    • STAR Genome Generate: STAR Genome Generate is a tool that generates genome index files. One set of files should be generated per each genome/annotation combination. Once produced, these files could be used as long as genome/annotation combination stays the same. Also, STAR Genome Generate which produced these files and STAR aligner using them must be the same toolkit version.
    • Picard SortSam: Picard SortSam sorts the input SAM or BAM. Input and output formats are determined by the file extension.
    • RNA-seq Alignment - STAR (Workflow)
  2. What Input, Output and Parameter you want to run the workflow

  3. Input

    • GTF file
    • BAM File
    • Genome Fasta Files

    • Output

    • Aligned reads (BAM)
    • Unmapped reads (fastQ)
    • Splice Junctions
    • Reads per Gene (TAB)
    • Chimeric Junctions (Junction)
    • Chimeric Alignments (SAM)
    • Aligned Sorted by Coord (BAI)
    • Log Files
    • Unpaired (fastQ)
    • First Strand (fastQ)
    • Second Strand (fastQ)

Running the RNA-Seq Alignment STAR which takes FASTQ file/s as input

fastq1 <- p$file(name = "G28029_pe_1.fastq")
fastq2 <- p$file(name = "G28029_pe_2.fastq")

## Set metadata / Pairend info

fastq1$setMeta(Paired_end = "1")
fastq2$setMeta(Paired_end = "2")


(fastq.in <- p$file(name = "G28029"))
(gtf.in <- p$file(".gtf"))
(gtf.in1 = new('FilesList',listData=list(gtf.in)))
(fasta.in <- p$file("HG19_Broad_variant.fasta"))


tsk = p$task_add(name = "RNA-WF_WS2", 
           description = "Testing the RNA-seq STAR workflow", 
           app = rwid,
           inputs =  list(fastq = fastq.in, 
                          genomeFastaFiles = fasta.in,
                          sjdbGTFfile = list(gtf.in1)))




## confirm
p$task(status = "draft")

## Run the Task
tsk1 <- p$task(status = "draft")
tsk1$run()
tsk1$update()
tsk1$monitor()

Running the RNA-Seq Alignment STAR which takes a BAM file as input

## Copy the app
app.rnawfbam <- a$copyApp(id = "Durga/irp-cgc-with-r/rna-seq-alignment-star-baseline-20160217", project = pid, name = "star-wfl-inbam")
rwbid <- app.rnawfbam$id
## check if it is copied
a$app(project = pid)

## Run the app
(bam.in <- p$file("Galaxy14-data13downsampled.bam"))
(gtf.in <- p$file(".gtf"))
(fasta.in <- p$file("HG19_Broad_variant.fasta"))

tsk3 = p$task_add(name = "rwfl-inbamtest2-R", 
           description = "Testing the RNA-seq STAR workflow with input BAM file", 
           app = rwbid,
           inputs =  list(input_file = bam.in, 
                          genomeFastaFiles = fasta.in,
                          sjdbGTFfile = sevenbridges:::FilesList(gtf.in)))

## confirm
p$task(status = "draft")

## Run the Task
tsk3 <- p$task(status = "draft")
tsk3$run()
tsk3$update()
tsk3$monitor()

BATCH - Running the RNA-Seq Alignment STAR which takes multiple BAM file as input

## Copy the app
## another app for batch
app.rnawfbamb <- a$copyApp(id = "Durga/irp-cgc-with-r/rna-seq-alignment-star-baseline-20160217", project = pid, name = "star-wf-inbam-batch-UIcopy")

rwbbid <- app.rnawfbamb$id
## check if it is copied
a$app(project = pid)

## Run the app

(bam.in <- p$file("downsampled.bam"))
(gtf.in <- p$file(".gtf"))
(fasta.in <- p$file("HG19_Broad_variant.fasta"))

tsk3 = p$task_add(name = "rwf-inbam_batch-R", 
           description = "Testing the RNA-seq STAR workflow with multiple input BAM files", 
           app = rwbbid,
           batch = batch(input = "input_file"),
           inputs =  list(input_file = bam.in, 
                          genomeFastaFiles = fasta.in,
                          sjdbGTFfile = sevenbridges:::FilesList(gtf.in)))

## confirm
p$task(status = "draft")

## Run the Task
tsk3 <- p$task(status = "draft")
tsk3$run()
tsk3$update()
tsk3$monitor()

To download all files from a project.

a$project("IRP-SBCGC-WS2")$download("~/Downloads")

## To Download specific files use their Ids
fid <- a$project("IRP-SBCGC-WS2")$file(id ="56f4592ae4b070c30853bbcb")$id
a$project("IRP-SBCGC-WS2")$file(id = fid)$download("~/Downloads/")

More tutorials

After you install the package

browseVignettes("sevenbridges")

or on Bioconductor devel branch sevenbridges page



teamcgc/cgcR documentation built on May 31, 2019, 7:56 a.m.