BiocStyle::markdown(css.files = "custom.css")
code <- function(...) { cat(paste(..., sep = "\n")) } code2 <- function(...) { cat(paste("```markdown", ..., "\n", "```", sep = "\n")) }
How hard it is to create a data analysis workflow, deploy it and distribute it? sometimes it's quite hard. For example, here is Stanford huge seq pipeline. Each node represents a tool with it's own dependencies, written in different languages. And a working analysis flow could as complicated as shown.
We are facing problems
rabix is an open-source toolkit for developing and running portable workflows based on the Common Workflow Language specification and Docker, rabix is developed and maintained by Seven Bridges Genomics engineers.
Common workflow languange is a community wise effort to create specifications that enable reproducible, portable data analysis flow. There is an Bioconductor package called cwl
developed to provide implementation of objects and it is based on draf2, full details are described in the website.
rabix
Bioconductor package is built based on package cwl
, with its own adapter used for rabix interface. With rabix
package, you can
In this tutorial, we will learn about command line tools, how to write R command line tool via docopt, creating your own command line tool and use it on rabix or SBG platform.
Docker use lightweight containers to build, ship and run application,
Package an application with all of its dependencies into a standardized unit for software development.
To save thousands word here about how to create docker image/container how to pull your tools in the docker container
This tutorial assumes that you are familiar with the docker or at least knows which docker container has the tools you want to wrap.
Materials you should read about
Rule of Thumb: search existing tools before you create your own, use official images(Rocker or Bioc) as much as possible. For example, you can search "samtools" on dockerhub, there will be a list of containers.
In this chapter we will work through some real examples to write R command line tools, learn to use docopt standard, describe your tools with rabix and creating tools with R interface etc.
Most of us are already familiar with command line tools, tons of unix tools and bioinformatics tools have command line interface, we give the command line tool parameters to launch the application and do some work for us.
You tools may also have multiple sub-commands, for examples, samtools
$ samtools Program: samtools (Tools for alignments in the SAM format) Version: 1.2 (using htslib 1.2.1) Usage: samtools <command> [options] Commands: -- indexing faidx index/extract FASTA index index alignment -- editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header rmdup remove PCR duplicates targetcut cut fosmid regions (for fosmid pool only) -- file operations bamshuf shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group bam2fq converts a BAM to a FASTQ -- stats bedcov read depth per BED region depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck) -- viewing flags explain BAM flags tview text alignment viewer view SAM<->BAM<->CRAM conversion
so sort
or view
are sub commands, to conform to rabix standard, let me quote some key here
view
or sort
tool should reference the same samtools
docker image. samtools sort
will be your base command. Each command comes with its own parameters, for example, let's take a look for samtools sort
Usage: samtools sort [options...] [in.bam] Options: -l INT Set compression level, from 0 (uncompressed) to 9 (best) -m INT Set maximum memory per thread; suffix K/M/G recognized [768M] -n Sort by read name -o FILE Write final output to FILE rather than standard output -O FORMAT Write output as FORMAT ('sam'/'bam'/'cram') (either -O or -T PREFIX Write temporary files to PREFIX.nnnn.bam -T is required) -@ INT Set number of sorting and compression threads [1] Legacy usage: samtools sort [options...] <in.bam> <out.prefix> Options: -f Use <out.prefix> as full final filename rather than prefix -o Write final output to stdout rather than <out.prefix>.bam -l,m,n,@ Similar to corresponding options above
We see the options, those are the inputs (some or all of them) we want to describe with the command line, and expose to users.
Please visit the website https://www.rabix.org for detailed information. This is a very nice implementation of Command Workflow Language, and has so far the best user experience for building your tools and flows. SBG platform also use the same interface.
Let's quote the features quickly from the website, I think it is summarized very well.
Please login with your github account, then it's ready to start describing tools and pipelines. For more detailed tutorial, a good read will be on SBG developer hub. [TODO]
You don't need to know about JSON, YAML or any language, the most easy way is that you can simply start describing your tools and create workflows with the graphical user interface.
Please follow the full SBG tutorials for rabix on the platform, the interface will be the same.
If you create your tool description files somewhere else, for example, use our R interface, you can simply import the JSON file
And please paste your tool/workflow description JSON file into the input window and click "import" button, then all fields specified will be pasted to the interface automatically, click save
Now please click the "+Create" button, it will ask you to save the tools to the specific repos, you can create one as well.
Then you will be able to find your tools in your repos, try click "+New" button on the top right, and click "+New workflow", this will lead you to the workflow editor interface, with SBG's own drag-n-drop interface.
Let's just specify the r-base image
With docopt, you can define the interface for you command line applications and different implementation will provide parser for it. Please visit docopt website to read through the requirements before you started writing your own docopt
Actually the style will look very familiar to those who has been working with command line tools a lot in Linux. Here is an example from its website
Usage: naval_fate ship new <name>... naval_fate ship <name> move <x> <y> [--speed=<kn>] naval_fate ship shoot <x> <y> naval_fate mine (set|remove) <x> <y> [--moored|--drifting] naval_fate -h | --help naval_fate --version Options: -h --help Show this screen. --version Show version. --speed=<kn> Speed in knots [default: 10]. --moored Moored (anchored) mine. --drifting Drifting mine.
Please do read the website for details, but in short, let's summarize here, you have
In R, we also have a nice implementation in a package called docopt
, developed by Edwin de Jonge. Check out its tutorial on github.
So let's quickly create a command line interface for our R scripts with a dummy example. Let's turn the uniform distribution function runif
into a command line tool.
when you check out the help page for runif
, here is the key information you want to mark down.
Usage runif(n, min = 0, max = 1) Arguments n number of observations. If length(n) > 1, the length is taken to be the number required. min, max lower and upper limits of the distribution. Must be finite.
I will add one more parameter to set seed, here is the R script file called runif.R
.
At the beginning, I use docopt standard to write my tool help.
'usage: runif.R [--n=<int> --min=<float> --max=<float> --seed=<float>] options: --n=<int> number of observations. If length(n) > 1, the length is taken to be the number required [default: 1]. --min=<float> lower limits of the distribution. Must be finite [default: 0]. --max=<float> upper limits of the distribution. Must be finite [default: 1]. --seed=<float> seed for set.seed() function [default: 1]' -> doc library(docopt)
Let's first do some testing in R session before you make it a full functional command line tool.
docopt(doc) #with no argumetns provided docopt(doc, "--n 10 --min=3 --max=5")
Add my command line function
opts <- docopt(doc) set.seed(opts$seed) runif(n = as.integer(opts$n), min = as.numeric(opts$min), max = as.numeric(opts$max))
Add Shebang at the top of the file, and a complete example for runif.R
command line will be like this
#!/usr/bin/Rscript 'usage: runif.R [--n=<int> --min=<float> --max=<float> --seed=<float>] options: --n=<int> number of observations. If length(n) > 1, the length is taken to be the number required [default: 1]. --min=<float> lower limits of the distribution. Must be finite [default: 0]. --max=<float> upper limits of the distribution. Must be finite [default: 1]. --seed=<float> seed for set.seed() function [default: 1]' -> doc library(docopt) opts <- docopt(doc) set.seed(opts$seed) runif(n = as.integer(opts$n), min = as.numeric(opts$min), max = as.numeric(opts$max))
OK seems good, now let's test it in our terminal, don't forget to make it executable by doing something like chmod 755 runif.R
$ ./runif.R --help Loading required package: methods usage: runif.R [--n=<int> --min=<float> --max=<float> --seed=<float>] options: --n=<int> number of observations. If length(n) > 1, the length is taken to be the number required [default: 1]. --min=<float> lower limits of the distribution. Must be finite [default: 0]. --max=<float> upper limits of the distribution. Must be finite [default: 1]. --seed=<float> seed for set.seed() function [default: 1] $ ./runif.R Loading required package: methods [1] 0.2655087 $ ./runif.R Loading required package: methods [1] 0.2655087 $ ./runif.R --seed=123 --n 10 --min=1 --max=100 Loading required package: methods [1] 29.470174 79.042208 41.488715 88.418723 94.106261 5.510093 53.282443 [8] 89.349485 55.592066 46.204859
SBG engineer team provide a python parser for docopt, it can parse a command line tool from its help manual if it conforms to docopt standard. It will become part of the rabix tool, but before that, I also include the python parser in this R package.
Let's use above example, the script is in the package folder.
runif.file <- system.file("cwl", "runif.R", package = "rabix") system(runif.file)
[TODO]
You can use R interface to create objects in R and easily convert it into JSON/YAML for rabix interface or other implementation. Most importantly, it's possible for creating other applications around those objects in R.
This object extends the CommandLineTool
object, with its own adapters, for example, with additional information like owner, contributor or requirements for cpu and memory. This object describes a command line tool with input, output, arguments and other information for an executor to understand and execute. rabix
packages also provides validation, a set of short function constructor names for easy construction of the object.
Let's use the same example above for creating a command line tool for samtools sort
Load the package first
rbx <- RabixTool(id = "runif", label = "Random number generator", description = "Random number generator", dockerPull = "tengfei/runif", cpu = 1, mem = 1024, baseCommand = "runif.R", inputs = list(input(id = "number", description = "number of observations", type = "integer", label = "number", prefix = "--n", default = 1, required = TRUE), input(id = "min", description = "lower limits of the distribution", type = "float", label = "min", prefix = "--min", default = 0), input(id = "max", description = "upper limits of the distribution", type = "float", label = "max", prefix = "--max", default = 1), input(id = "seed", description = "seed with set.seed", type = "float", label = "seed", prefix = "--seed", default = 1)), outputs = list(output(id = "random_file", type = "file", label = "output", description = "random number file", glob = "*.txt"))) rbx$toJSON()
Or print it nicely to check it
rbx$toJSON(pretty = TRUE)
To write
rbx$toJSON("~/temp.json")
To the awesome Seven Bridges Genomics team for building and supporting open-source community.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.