OVERVIEW.md

Requirements

Required data

This pipeline requires paired tumor-normal data. This is because the allele-specific copy numbers are inferred from the allelic imbalance in the tumor at heterozygous SNPs. The heterozygous SNPs are identified from the allelic signals in the matched normal.

Required software

This pipeline is implemented in R and requires R packages aroma.seq, sequenza (Favero et al. 2015), and PSCBS (Bengtsson et al. 2010, Olshen et al. 2011). To install these packages and all of their dependencies to the current working directory, call the following from R:

if (!requireNamespace("pak")) install.packages("pak")
pak:::proj_create()
pak::pkg_install("sequenza@2.1.2")
pak::pkg_install("HenrikBengtsson/aroma.seq")
pak::pkg_install("HenrikBengtsson/CostelloPSCNSeq@develop")

In addition to the above R dependencies, the pipeline requires that samtools (Li et al. 2009) is on the PATH.

Important: The pipeline does not work with sequenza (>= 3.0.0; 2019-05-09). This is the reason why we install sequenza 2.1.2. The use of pak:::proj_create() causes all packages to be installed to ./r-packages/, which avoids clashing the R package library that is typically installed under ~/R/.

Setup (once)

  1. Run Rscript 0.setup.R once. This will setup links to shared annotation data sets and lab data files on the TIPCC compute cluster.

  2. Make sure ./config.yml is correct. It specify the default analysis settings. The individual entries can be overridden by individual command-line options to the below Rscript calls.

  3. Configure parallel processing following the instructions in Section 'Configure parallel processing' below.

Data processing

The following scripts should be run in order:

Rscript -e CostelloPSCNSeq::pscnseq --args --what=mpileup  --verbose=TRUE  # ~25 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=sequenza --verbose=TRUE  # ~60 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=pscbs    --verbose=TRUE  #  ~5 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=reports  --verbose=TRUE  #  ~2 min

Comment: The timings listed are typical run times for our test tumor-normal sample.

You may want to adjust inst/config.yml to process other data sets. Alternatively, you can specify another file that this default via command-line option --config, e.g. Rscript -e "CostelloPSCNSeq::pscnseq(what='mpileup', verbose=TRUE)" --config=config_set_a.yml.

Data processing via scheduler

To process the above four steps via the Torque/PBS scheduler, use:

$ qsub -d $(pwd) inst/scripts/1-4.submit_all.pbs

This will in turn submit the corresponding PBS scripts 1.mpileup.pbs, 2.sequenza.pbs, 3.pscbs.pbs, and 4.reports.pbs to the scheduler. Those PBS scripts are located in inst/scripts/ and "freeze" software versions to R 3.6.1 and samtools 1.3.1.

Configure parallel processing

The pipeline supports both sequential and parallel processing on a large number of backends and compute resources. By default the pipeline is configured to process the data sequentially on the current machine, but this can easily be changed to run in parallel, say, on a compute cluster. In order not to clutter up the analysis scripts, these settings are preferably done in a separate .future.R (loaded automatically by the future framework) in the project root directory.

To process data via a TORQUE / PBS job scheduler using the future.batchtools package, try with the configuration that we use for our UCSF TIPCC cluster;

# Copy to project directory
$ cp inst/future-configs/batchtools/.future.R .

# Copy to project directory
$ cp inst/future-configs/batchtools/batchtools.torque.tmpl .

# Install the future.batchtools package
$ Rscript -e "install.packages('future.batchtools')"

These should be generic enough to also run on other TORQUE / PBS systems. If not, see the batchtools package for how to configure the template file.

You can verify that it works by trying the following in the project directory:

> library("future")
Using future plan:
plan(list(samples = tweak(batchtools_torque, label = "sample", 
    resources = list(vmem = "4gb")), chromosomes = tweak(batchtools_torque, 
    label = "chr", resources = list(vmem = "8gb"))))

This confirms that as soon as the future package is loaded, it will source the .future.R script which in turn will setup the parallel settings. It is .future.R that reports on the future plan used.

Next, we can try to submit a job to the scheduler using these settings by:

> x %<-% Sys.info()[["nodename"]]

In this step, future.batchtools will import the batchtools.torque.tmpl file in that we copied to project working directory. If it fails to locate that file, there will be an error. If it succeeds, a batchtools job will be submitted to the job scheduler - which can be seen when if calling qstat -u $USER in another shell.

Finally, if we try to look at the value of x;

> x
[1] "n17"
> 

it will block until the job is finished and then its value will be printed. Here we see that the job was running on compute node n17.

Help

> help("pscnseq", package = "CostelloPSCNSeq")

<%
options(pager = function(files, header, title, delete.file) {
  bfr <- readLines(files, warn = FALSE)
  if (delete.file) file.remove(files)
  bfr <- gsub("_\b", "", bfr)
  cat(bfr, sep = "\n")
})
h <- help("pscnseq", package = "CostelloPSCNSeq", help_type = "text")
out <- utils::capture.output(print(h))
cat(out, sep = "\n")
%>

References



HenrikBengtsson/CostelloPSCNSeq documentation built on Feb. 28, 2021, 5:49 p.m.