BiocStyle::markdown(css.files = "custom.css")
code <- function(...) {
    cat(paste(..., sep = "\n"))
}

code2 <- function(...) {
    cat(paste("```markdown", ..., "\n", "```", sep = "\n"))
}

Introduction

How hard it is to create a data analysis workflow, deploy it and distribute it? sometimes it's quite hard. For example, here is Stanford huge seq pipeline. Each node represents a tool with it's own dependencies, written in different languages. And a working analysis flow could as complicated as shown.

huge-seq

We are facing problems

rabix is an open-source toolkit for developing and running portable workflows based on the Common Workflow Language specification and Docker, rabix is developed and maintained by Seven Bridges Genomics engineers.

Common workflow languange is a community wise effort to create specifications that enable reproducible, portable data analysis flow. There is an Bioconductor package called cwl developed to provide implementation of objects and it is based on draf2, full details are described in the website.

rabix Bioconductor package is built based on package cwl, with its own adapter used for rabix interface. With rabix package, you can

In this tutorial, we will learn about command line tools, how to write R command line tool via docopt, creating your own command line tool and use it on rabix or SBG platform.

Understand Docker

Docker use lightweight containers to build, ship and run application,

Package an application with all of its dependencies into a standardized unit for software development.

To save thousands word here about how to create docker image/container how to pull your tools in the docker container

This tutorial assumes that you are familiar with the docker or at least knows which docker container has the tools you want to wrap.

Materials you should read about

Rule of Thumb: search existing tools before you create your own, use official images(Rocker or Bioc) as much as possible. For example, you can search "samtools" on dockerhub, there will be a list of containers.

Understand Command Line Tools

In this chapter we will work through some real examples to write R command line tools, learn to use docopt standard, describe your tools with rabix and creating tools with R interface etc.

What is Command Line Tool

Most of us are already familiar with command line tools, tons of unix tools and bioinformatics tools have command line interface, we give the command line tool parameters to launch the application and do some work for us.

You tools may also have multiple sub-commands, for examples, samtools

$ samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 1.2 (using htslib 1.2.1)

Usage:   samtools <command> [options]

Commands:
  -- indexing
         faidx       index/extract FASTA
         index       index alignment
  -- editing
         calmd       recalculate MD/NM tags and '=' bases
         fixmate     fix mate information
         reheader    replace BAM header
         rmdup       remove PCR duplicates
         targetcut   cut fosmid regions (for fosmid pool only)
  -- file operations
         bamshuf     shuffle and group alignments by name
         cat         concatenate BAMs
         merge       merge sorted alignments
         mpileup     multi-way pileup
         sort        sort alignment file
         split       splits a file by read group
         bam2fq      converts a BAM to a FASTQ
  -- stats
         bedcov      read depth per BED region
         depth       compute the depth
         flagstat    simple stats
         idxstats    BAM index stats
         phase       phase heterozygotes
         stats       generate stats (former bamcheck)
  -- viewing
         flags       explain BAM flags
         tview       text alignment viewer
         view        SAM<->BAM<->CRAM conversion

so sort or view are sub commands, to conform to rabix standard, let me quote some key here

Each command comes with its own parameters, for example, let's take a look for samtools sort

Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -o FILE    Write final output to FILE rather than standard output
  -O FORMAT  Write output as FORMAT ('sam'/'bam'/'cram')   (either -O or
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam       -T is required)
  -@ INT     Set number of sorting and compression threads [1]

Legacy usage: samtools sort [options...] <in.bam> <out.prefix>
Options:
  -f         Use <out.prefix> as full final filename rather than prefix
  -o         Write final output to stdout rather than <out.prefix>.bam
  -l,m,n,@   Similar to corresponding options above

We see the options, those are the inputs (some or all of them) we want to describe with the command line, and expose to users.

rabix: Portable Bioinformatics Pipelines

Introduction

rabix_homepage Please visit the website https://www.rabix.org for detailed information. This is a very nice implementation of Command Workflow Language, and has so far the best user experience for building your tools and flows. SBG platform also use the same interface.

Let's quote the features quickly from the website, I think it is summarized very well.

Please login with your github account, then it's ready to start describing tools and pipelines. For more detailed tutorial, a good read will be on SBG developer hub. [TODO]

Interface

You don't need to know about JSON, YAML or any language, the most easy way is that you can simply start describing your tools and create workflows with the graphical user interface. rabix_interface

Please follow the full SBG tutorials for rabix on the platform, the interface will be the same.

If you create your tool description files somewhere else, for example, use our R interface, you can simply import the JSON file rabix_import

And please paste your tool/workflow description JSON file into the input window and click "import" button, then all fields specified will be pasted to the interface automatically, click save

rabix_import_window

Now please click the "+Create" button, it will ask you to save the tools to the specific repos, you can create one as well.

rabix_create

Then you will be able to find your tools in your repos, try click "+New" button on the top right, and click "+New workflow", this will lead you to the workflow editor interface, with SBG's own drag-n-drop interface.

rabix_workflow

Example

Example 1: create some simple R command line

Let's just specify the r-base image

Example 2: create samtools tools

docopt: Command-line Interface Description Language

With docopt, you can define the interface for you command line applications and different implementation will provide parser for it. Please visit docopt website to read through the requirements before you started writing your own docopt

Actually the style will look very familiar to those who has been working with command line tools a lot in Linux. Here is an example from its website

Usage:
  naval_fate ship new <name>...
  naval_fate ship <name> move <x> <y> [--speed=<kn>]
  naval_fate ship shoot <x> <y>
  naval_fate mine (set|remove) <x> <y> [--moored|--drifting]
  naval_fate -h | --help
  naval_fate --version

Options:
  -h --help     Show this screen.
  --version     Show version.
  --speed=<kn>  Speed in knots [default: 10].
  --moored      Moored (anchored) mine.
  --drifting    Drifting mine.

Please do read the website for details, but in short, let's summarize here, you have

Writing R command line tools

In R, we also have a nice implementation in a package called docopt, developed by Edwin de Jonge. Check out its tutorial on github.

So let's quickly create a command line interface for our R scripts with a dummy example. Let's turn the uniform distribution function runif into a command line tool.

when you check out the help page for runif, here is the key information you want to mark down.

Usage

runif(n, min = 0, max = 1)

Arguments

n   
number of observations. If length(n) > 1, the length is taken to be the number required.

min, max    
lower and upper limits of the distribution. Must be finite.

I will add one more parameter to set seed, here is the R script file called runif.R.

At the beginning, I use docopt standard to write my tool help.

'usage: runif.R [--n=<int> --min=<float> --max=<float> --seed=<float>]

options:
 --n=<int>        number of observations. If length(n) > 1, the length is taken to be the number required [default: 1].
 --min=<float>   lower limits of the distribution. Must be finite [default: 0].
 --max=<float>   upper limits of the distribution. Must be finite [default: 1].
 --seed=<float>  seed for set.seed() function [default: 1]' -> doc

library(docopt)

Let's first do some testing in R session before you make it a full functional command line tool.

docopt(doc) #with no argumetns provided
docopt(doc, "--n 10 --min=3 --max=5")

Add my command line function

opts <- docopt(doc)
set.seed(opts$seed)
runif(n = as.integer(opts$n), 
      min = as.numeric(opts$min), 
      max = as.numeric(opts$max))

Add Shebang at the top of the file, and a complete example for runif.R command line will be like this

#!/usr/bin/Rscript
'usage: runif.R [--n=<int> --min=<float> --max=<float> --seed=<float>]

options:
 --n=<int>        number of observations. If length(n) > 1, the length is taken to be the number required [default: 1].
 --min=<float>   lower limits of the distribution. Must be finite [default: 0].
 --max=<float>   upper limits of the distribution. Must be finite [default: 1].
 --seed=<float>  seed for set.seed() function [default: 1]' -> doc

library(docopt)
opts <- docopt(doc)
set.seed(opts$seed)
runif(n = as.integer(opts$n), 
      min = as.numeric(opts$min), 
      max = as.numeric(opts$max))

OK seems good, now let's test it in our terminal, don't forget to make it executable by doing something like chmod 755 runif.R

$ ./runif.R --help
Loading required package: methods
usage: runif.R [--n=<int> --min=<float> --max=<float> --seed=<float>]

options:
 --n=<int>        number of observations. If length(n) > 1, the length is taken to be the number required [default: 1].
 --min=<float>   lower limits of the distribution. Must be finite [default: 0].
 --max=<float>   upper limits of the distribution. Must be finite [default: 1].
 --seed=<float>  seed for set.seed() function [default: 1]
$ ./runif.R
Loading required package: methods
[1] 0.2655087
$ ./runif.R
Loading required package: methods
[1] 0.2655087
$ ./runif.R --seed=123 --n 10 --min=1 --max=100
Loading required package: methods
 [1] 29.470174 79.042208 41.488715 88.418723 94.106261  5.510093 53.282443
 [8] 89.349485 55.592066 46.204859

Python Parser for docopt

SBG engineer team provide a python parser for docopt, it can parse a command line tool from its help manual if it conforms to docopt standard. It will become part of the rabix tool, but before that, I also include the python parser in this R package.

Let's use above example, the script is in the package folder.

runif.file <- system.file("cwl", "runif.R", package = "rabix")
system(runif.file)

[TODO]

R Parser for docopt


Describe your runif tools


Practise

Create Tools in R with rabix

You can use R interface to create objects in R and easily convert it into JSON/YAML for rabix interface or other implementation. Most importantly, it's possible for creating other applications around those objects in R.

RabixTool Object

This object extends the CommandLineTool object, with its own adapters, for example, with additional information like owner, contributor or requirements for cpu and memory. This object describes a command line tool with input, output, arguments and other information for an executor to understand and execute. rabix packages also provides validation, a set of short function constructor names for easy construction of the object.

Let's use the same example above for creating a command line tool for samtools sort

Load the package first

rbx <- RabixTool(id = "runif",
                 label = "Random number generator",
                 description = "Random number generator",
                 dockerPull = "tengfei/runif",
                 cpu = 1, mem = 1024,
                 baseCommand = "runif.R",
                 inputs = list(input(id = "number",
                     description = "number of observations",
                     type = "integer",
                     label = "number",
                     prefix = "--n",
                     default = 1,
                     required = TRUE),
                     input(id = "min",
                           description = "lower limits of the distribution",
                           type = "float",
                           label = "min",
                           prefix = "--min",
                           default = 0),
                     input(id = "max",
                           description = "upper limits of the distribution",
                           type = "float",
                           label = "max",
                           prefix = "--max",
                           default = 1),
                     input(id = "seed",
                           description = "seed with set.seed",
                           type = "float",
                           label = "seed",
                           prefix = "--seed",
                           default = 1)),
                 outputs = list(output(id = "random_file",
                     type = "file",
                     label = "output", 
                     description = "random number file",
                     glob = "*.txt")))
rbx$toJSON()

Or print it nicely to check it

rbx$toJSON(pretty = TRUE)

To write

rbx$toJSON("~/temp.json")

Practise

Acknoledgement

To the awesome Seven Bridges Genomics team for building and supporting open-source community.

Session Inforamtion

sessionInfo()


tengfei/rabix.R documentation built on May 31, 2019, 8:34 a.m.