2-ClusterManagement-BatchJobs.md

batchtools is the successor of BatchJobs and will be used from now on as it is more stable and more flexible. Go to batchtools instructions.

BatchJobs package

BatchJobs is a useful package to communicate with a computing cluster: send jobs, check their status, eventually rerun them, retrieve the results. PopSV has been designed into separate steps to be ran more easily on a computing cluster using BatchJobs. Thanks to the multi-step workflow, the computation is parallelized sometimes by sample, other times by genomic regions.

Instead of running each step manually, we recommend using the two-commands wrappers (see Automated run below). It's basically a wrapper for the basic analysis steps with some useful functions (running custom steps, stop/restart). It should be sufficient for most analysis but it's less flexible and if you want to change some specific parameters you might have to tweak it.

Installation and configuration

BatchJobs package can be installed through CRAN

install.packages("BatchJobs")

The most important step is configuring it for your computing cluster. It's not long but should be done carefully. And once this is done correctly, the rest follows nicely.

You will need to create 3 files :

I would recommend to put these 3 files in the root of your personal space, i.e. ~/. You could put them in the project folder, but then it means you have to copy them each time you create/run another project. Putting them in your root means that they will always be used by default by BatchJobs.

Cluster template

A cluster template is a template form of the bash script that you would send through qsub/msub. There you define the placeholder for the resources or parameters of the job. This file will be parsed by BatchJobs.

For our cluster Guillimin, I created a guillimin.tmpl file like this :

#PBS -N <%= job.name %>
#PBS -j oe
#PBS -o <%= log.file %>
#PBS -l walltime=<%= resources$walltime %>
#PBS -l nodes=<%= resources$nodes %>:ppn=<%= resources$cores %>
#PBS -A bws-221-ae
#PBS -V

## Run R:
## we merge R output with stdout from PBS, which gets then logged via -o option
R CMD BATCH --no-save --no-restore "<%= rscript %>" /dev/stdout

Placeholders are in the form of <%= resources$walltime %>. BatchJobs will insert there the value defined by walltime element in the resources list (see later in submitJobs command). Although you most likely won't have to change these placeholders, you might need to update the lines if your cluster uses a different syntax. For example, in our cluster, we need to give an ID for our lab with -A.

In order to easily use the pipelines provided with PopSV package, I would recommend to put exactly the placeholders walltime, cores and nodes but to hard-code the rest (e.g. queue, lab ID) in the template.

Parser functions

Parser functions are saved in a R script, called for example makeClusterFunctionsAdaptive.R. This will parse the template and create the actual commands to send, cancel and check jobs.

Most likely, you just need to check/replace qsub/qdel/qstat calls with the correct bash commands (sometimes msub/canceljob/showq).

From our file, these are the lines you might need to change :

res =  BatchJobs:::runOSCommandLinux("qsub", outfile, stop.on.exit.code = FALSE)
cfKillBatchJob("canceljob", batch.job.id)
BatchJobs:::runOSCommandLinux("showq", "-u $USER")$output

.BatchJobs.R configuration file

.BatchJobs.R is just the configuration files that links the two other files. You don't really need to change it. Eventually change the email address.

It looks like this :

source("~/makeClusterFunctionsAdaptive.R")
cluster.functions <- makeClusterFunctionsAdaptive("~/guillimin.tmpl")
mail.start <- "none"
mail.done <- "none"
mail.error <- "none"
mail.from <- "<jean.monlong@mail.mcgill.ca>"
mail.to <- "<jean.monlong@mail.mcgill.ca>"

Note: If .BatchJobs.R files are present at both ~/ and the project folder, the one in the project folder will override the parameters.

Sending Jobs

In practice, you won't have to write this part as we provide full pipelines. You might still need to change a bit the resources of the jobs (they might change from one cluster to another). More precisely I'm talking about the resource= parameter in the submitJobs command. After doing this, if you are not interested in more details, you can jump directly to the next section for an overview of a pipeline script.

Otherwise here is a quick summary of BatchJobs commands used in the scripts:

For example, to run the step that retrieve the bin count in each BAM files it looks like this :

getBC.reg <- makeRegistry(id="getBC")
getBC.f <- function(file.i, bins.f, files.df){
  library(PopSV)
  load(bins.f)
  bin.bam(files.df$bam[file.i], bins.df, files.df$bc[file.i])
}
batchMap(getBC.reg, getBC.f,1:nrow(files.df), more.args=list(bins.f="bins.RData", files.df=files.df))
submitJobs(getBC.reg, findNotDone(getBC.reg), resources=list(walltime="20:0:0", nodes="1", cores="1"))
showStatus(getBC.reg)

Here we want to get the bin counts of each sample. We create a registry called getBC. Then we define the function that will get the bin counts for a sample. The first parameter of this function (here file.i which is the index of the sample) will be different for each job sent by BatchJobs. Other parameters are common to all jobs. Within the function, we load the package and useful data and run the instructions we want. batchMap creates one job per sample ID and links the function we've just defined. The jobs are finally submitted to the cluster with the desired number of cores, wall time, etc

We can check the status of the job with showStatus(getBC.reg) command.

Pipeline workflow

Automated run

Two wrapper functions around BatchJobs allows you to run PopSV without manually sending the jobs for each steps. These two functions (autoGCcounts and autoNormTest) are located in automatedPipeline.R. Now, a full analysis can be run like this:

## Load package and wrapper
library(BatchJobs)
library(PopSV)
source("automatedPipeline.R")
## Set-up files and bins
bam.files = read.table("bams.tsv", as.is=TRUE, header=TRUE)
files.df = init.filenames(bam.files, code="example")
save(files.df, file="files.RData")
bin.size = 1e3
bins.df = fragment.genome.hp19(bin.size)
save(bins.df, file="bins.RData")
## Run PopSV
res.GCcounts = autoGCcounts("files.RData", "bins.RData")
res.df = autoNormTest("files.RData", "bins.RData")

The advantage of this wrapper is a easier management of the cluster and pipeline. However it's not so flexible: if a step need to be changed for some reason, you might have to change it within the automatedPipeline.R script.

Still, a few parameters can be passed to the two functions for the user convenience:

As an example this script shows how PopSV is run using these wrappers. In addition, when we want to analyze X and Y chromosomes, the samples have to be split and these wrappers come handy to run easily three analysis (see this example).

Step-by-step manual run

The general idea is to have one script per analysis (e.g. bin size, project). Each such analysis should be in its own folder to avoid possible confusion between temporary files. Examples of pipeline scripts can be found in the scripts folder of the GitHub repository.

Because it manipulates large data (BAM files, genome-wide coverage) and large sample sizes, PopSV was designed to create and work with intermediate files. The management of these files are mostly handled automatically. In practice all the important path and folder structure is saved in the files.df data.frame, originally created by init.filenames function. For this reason, the results of each analysis steps are saved as the local files so that the next steps can be run later without the need for you to think about what to save etc.

So the script doesn't need to be ran each time from the start but rather ran step by step. In practice you often have to wait a couple of hours for some step to compute. Think about R as a new shell: you would open R, check the status of the jobs in the clusters, rerun them if necessary, or start the next step, etc. You can run R on this master script in a login node because nothing will be directly computed there, the real computation are sent as actual jobs.

After one step sends jobs to the cluster, the user can exit R, log out, have a coffee, think about all the time saved thanks to BatchJobs and then open everything again later and continue. No need to rerun everything, just load the libraries and the registries (e.g. running getBC.reg <- makeRegistry(id="getBC") again) you want to check.



jmonlong/PopSV documentation built on Sept. 15, 2019, 9:29 p.m.