library(params) library(flowr)
Requirements:
## for a latest released version (from CRAN) install.packages("flowr", repos = CRAN="http://cran.rstudio.com") ## OR latest version devtools::install_github("sahilseth/flowr", ref = "master")
After installation run setup()
, this will copy the flowr's helper script to ~/bin
. Please make sure that this folder is in your $PATH
variable.
library(flowr) setup()
Running flowr
from the terminal should now show the following:
Usage: flowr function [arguments] status Detailed status of a flow(s). rerun rerun a previously failed flow kill Kill the flow, upon providing working directory fetch_pipes Checking what modules and pipelines are available; flowr fetch_pipes Please use 'flowr -h function' to obtain further information about the usage of a specific function.
If you interested, visit funr's github page for more details
From this step on, one has the option of typing commands in a R console OR a bash shell (command line). For brevity we will show examples using the shell.
Test a small pipeline on the cluster
This will run a three step pipeline, testing several different relationships between jobs. Initially, we can test this locally, and later on a specific HPCC platform.
## This may take about a minute or so. flowr run x=sleep_pipe platform=local execute=TRUE ## corresponding R command: run(x='sleep_pipe', platform='local', execute=TRUE)
If this completes successfully, we can try this on a computing cluster; where this would submit a few interconnected jobs.
Several platforms are supported out of the box (torque, moab, sge, slurm and lsf), you may use the platform variable to switch between platforms.
flowr run pipe=sleep_pipe platform=lsf execute=TRUE ## other options for platform: torque, moab, sge, slurm, lsf ## this shows the folder being used as a working directory for this flow.
Once the submission is complete, we can test the status using status()
by supplying it the full path
as recovered from the previous step.
flowr status x=~/flowr/runs/sleep_pipe-samp1-20150923-10-37-17-4WBiLgCm ## we expect to see a table like this when is completes successfully: | | total| started| completed| exit_status|status | |:--------------|-----:|-------:|---------:|-----------:|:---------| |001.sleep | 3| 3| 3| 0|completed | |002.create_tmp | 3| 3| 3| 0|completed | |003.merge | 1| 1| 1| 0|completed | |004.size | 1| 1| 1| 0|completed | ## Also we expect a few files to be created: ls ~/flowr/runs/sleep_pipe-samp1-20150923-10-37-17-4WBiLgCm/tmp samp1_merged samp1_tmp_1 samp1_tmp_2 samp1_tmp_3 ## If both these checks are fine, we are all set !
There are a few places where things may go wrong, you may follow the advanced configuration guide for more details. Feel free to post questions on github issues page.
Support for several popular cluster platforms is built-in. There is a template, for each platform, which should work out of the box.
Further, one may copy and edit them (and save to ~/flowr/conf
) in case some changes are required. Templates from this folder (~/flowr/conf
), would override defaults.
Here are links to latest templates on github:
Not sure what platform you have?
You may check the version by running ONE of the following commands:
msub --version ## Version: **moab** client 8.1.1 man bsub ##Submits a job to **LSF** by running the specified qsub --help
Here are some helpful guides and details on the platforms:
Comparison_of_cluster_software
This needs expansion
flowr has a configuration file, with parameters regarding default paths, verboseness etc.
flowr loads this default configuration from the package installation.
In addition, to customize the parameters, simply create a tab-delimited file called ~/.flowr
.
An example of this file is available here
Additional files loaded if available:
~/flowr/conf/flowr.conf
~/.flowr
3. Use a custom flowdef
We can copy an example flow definition and customize it to suit our needs. This a tab delimited text file, so make sure that the format is correct after you make any changes.
cd ~/flowr/pipelines wget https://raw.githubusercontent.com/sahilseth/flowr/master/inst/pipelines/sleep_pipe.def ## check the format flowr as.flowdef x=~/flowr/pipelines/sleep_pipe.def
Run the test with a custom flowdef:
flowr run x=sleep_pipe execute=TRUE def=~/flowr/pipelines/sleep_pipe.def ## platform=lsf [optional, picked up from flowdef]
4. Use a custom submission template
If you need to customize the HPCC submission template, copy the file for your platform and make your desired changes.
For example the MOAB based cluster in our institution does not accept the queue
argument,
so we need to comment it out.
Download the template for a specific HPCC platform into ~/flowr/conf
cd ~/flowr/conf ## flowr automatically picks up a template from this folder. ## for MOAB (msub) wget https://raw.githubusercontent.com/sahilseth/flowr/master/inst/conf/moab.sh ## for Torque (qsub) wget https://raw.githubusercontent.com/sahilseth/flowr/master/inst/conf/torque.sh ## for IBM LSF (bsub) wget https://raw.githubusercontent.com/sahilseth/flowr/master/inst/conf/lsf.sh ## for SGE (qsub) wget https://raw.githubusercontent.com/sahilseth/flowr/master/inst/conf/sge.sh ## for SLURM (sbatch) [untested] wget https://raw.githubusercontent.com/sahilseth/flowr/master/inst/conf/slurm.sh
Make the desired changes using your favourite editor and submit again.
1. Parsing job ids
Flowr parses job IDs to keep a log of all submitted jobs, and also to pass them along as a dependency to subsequent jobs. This is taken care by the parse_jobids() function. Each job scheduler shows the jobs id, when you submit a job, but it may show it in a slightly different fashion. To accommodate this one can use regular expressions as described in the relevant section of the flowr config.
For example LSF may show a string such as:
Job <335508> is submitted to queue <transfer>. ## test if it parses correctly jobid="Job <335508> is submitted to queue <transfer>." set_opts(flow_parse_lsf = ".*(\<[0-9]*\>).*") parse_jobids(jobid, platform="lsf") [1] "335508"
In this case 335508 was the job id and regex worked well !
Once we identify the correct regex for the platform you may update the configuration file with it.
cd ~/flowr/conf wget https://raw.githubusercontent.com/sahilseth/flowr/master/inst/conf/flowr.conf ## flowr automatically reads from this location, if you prefer to put it elsewhere, use load_opts("flowr.conf") ## visit sahilseth.github.io/params for more details.
Update the regex pattern and submit again.
2. Check dependency string
After collecting job ids from previous jobs, flowr renders them as a dependency for subsequent jobs. This is handled by render_dependency.PLATFORM functions.
Confirm that the dependency parameter is specified correctly in the submission scripts:
wd=~/flowr/runs/sleep_pipe-samp1-20150923-11-20-39-dfvhp5CK ## path to the most recent submission cat $wd/002.create_tmp/create_tmp_cmd_1.sh
There are several verbose levels available (0, 1, 2, 3, ...)
One can change the verbose levels in this file (~/flowr/conf/flowr.conf
)
and check verbosity section in the help pages for more details.
The resource requirement columns of flow definition are passed along to the final (cluster) submission script.
For example values in cpu_reserved
column would be populated as {{{CPU}}}
in the submission template.
The following table provides a mapping between the flow definition columns and variables in the submission templates:
#extdata = file.path(system.file(package = "flowr"), "extdata") mat = read_sheet("files/flow_def_columns.txt") kable(mat, col.names = c("flowdef variable", "submission template variable"))
* These are generated on the fly
and ** This is gathered from flow mat
Adding a new platform involves a few steps, briefly we need to consider the following steps where changes would be necessary.
parsing job ids: flowr keeps a log of all submitted jobs, and also to pass them along as a dependency to subsequent jobs. This is taken care by the parse_jobids() function. Each job scheduler shows the jobs id, when you submit a job, but each shows it in a slightly different pattern. To accommodate this one can use regular expressions as described in the relevant section of the flowr config.
render dependency: After collecting job ids from previous jobs, flowr renders them as a dependency for subsequent jobs. This is handled by render_dependency.PLATFORM functions.
Essentially this requires us to add a new line like: setClass("torque", contains = "job")
.
There are several job scheduling systems available and we try to support the major players. Adding support is quite easy if we have access to them. Your favourite not in the list? re-open this issue, with details on the platform: adding platforms
## outfiles end with .out, and are placed in a folder like 00X.<jobname>/ ## here is one example: cat $wd/002.create_tmp/create_tmp_cmd_1.out ## final script: cat $wd/002.create_tmp/create_tmp_cmd_1.sh
devtools:::install_github(“sahilseth/flowr”) error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
Solution:
This is basically a issue with httr (link) Try this:
install.packages("RCurl") devtools:::install_github("sahilseth/flowr")
If not then try this: install.packages("httr");
library(httr); set_config( config( ssl.verifypeer = 0L ) ) devtools:::install_github("sahilseth/flowr")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.