This pipeline requires paired tumor-normal data. This is because the allele-specific copy numbers are inferred from the allelic imbalance in the tumor at heterozygous SNPs. The heterozygous SNPs are identified from the allelic signals in the matched normal.
This pipeline is implemented in R and requires R packages aroma.seq, sequenza (Favero et al. 2015), and PSCBS (Bengtsson et al. 2010, Olshen et al. 2011). To install these packages and all of their dependencies to the current working directory, call the following from R:
if (!requireNamespace("pak")) install.packages("pak")
pak:::proj_create()
pak::pkg_install("sequenza@2.1.2")
pak::pkg_install("HenrikBengtsson/aroma.seq")
pak::pkg_install("HenrikBengtsson/CostelloPSCNSeq@develop")
In addition to the above R dependencies, the pipeline requires that samtools (Li et al. 2009) is on the PATH
.
Important: The pipeline does not work with sequenza (>= 3.0.0; 2019-05-09). This is the reason why we install sequenza 2.1.2. The use of pak:::proj_create()
causes all packages to be installed to ./r-packages/
, which avoids clashing the R package library that is typically installed under ~/R/
.
Run Rscript 0.setup.R
once. This will setup links to shared annotation data sets and lab data files on the TIPCC compute cluster.
Make sure ./config.yml
is correct. It specify the default analysis settings. The individual entries can be overridden by individual command-line options to the below Rscript
calls.
Configure parallel processing following the instructions in Section 'Configure parallel processing' below.
The following scripts should be run in order:
Rscript -e CostelloPSCNSeq::pscnseq --args --what=mpileup --verbose=TRUE # ~25 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=sequenza --verbose=TRUE # ~60 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=pscbs --verbose=TRUE # ~5 min
Rscript -e CostelloPSCNSeq::pscnseq --args --what=reports --verbose=TRUE # ~2 min
Comment: The timings listed are typical run times for our test tumor-normal sample.
You may want to adjust inst/config.yml
to process other data sets. Alternatively, you can specify another file that this default via command-line option --config
, e.g. Rscript -e "CostelloPSCNSeq::pscnseq(what='mpileup', verbose=TRUE)" --config=config_set_a.yml
.
To process the above four steps via the Torque/PBS scheduler, use:
$ qsub -d $(pwd) inst/scripts/1-4.submit_all.pbs
This will in turn submit the corresponding PBS scripts 1.mpileup.pbs
, 2.sequenza.pbs
, 3.pscbs.pbs
, and 4.reports.pbs
to the scheduler. Those PBS scripts are located in inst/scripts/
and "freeze" software versions to R 3.6.1 and samtools 1.3.1.
The pipeline supports both sequential and parallel processing on a large number of backends and compute resources. By default the pipeline is configured to process the data sequentially on the current machine, but this can easily be changed to run in parallel, say, on a compute cluster. In order not to clutter up the analysis scripts, these settings are preferably done in a separate .future.R
(loaded automatically by the future framework) in the project root directory.
To process data via a TORQUE / PBS job scheduler using the future.batchtools package, try with the configuration that we use for our UCSF TIPCC cluster;
# Copy to project directory
$ cp inst/future-configs/batchtools/.future.R .
# Copy to project directory
$ cp inst/future-configs/batchtools/batchtools.torque.tmpl .
# Install the future.batchtools package
$ Rscript -e "install.packages('future.batchtools')"
These should be generic enough to also run on other TORQUE / PBS systems. If not, see the batchtools package for how to configure the template file.
You can verify that it works by trying the following in the project directory:
> library("future")
Using future plan:
plan(list(samples = tweak(batchtools_torque, label = "sample",
resources = list(vmem = "4gb")), chromosomes = tweak(batchtools_torque,
label = "chr", resources = list(vmem = "8gb"))))
This confirms that as soon as the future package is loaded, it will source the .future.R
script which in turn will setup the parallel settings. It is .future.R
that reports on the future plan used.
Next, we can try to submit a job to the scheduler using these settings by:
> x %<-% Sys.info()[["nodename"]]
In this step, future.batchtools will import the batchtools.torque.tmpl
file in that we copied to project working directory. If it fails to locate that file, there will be an error. If it succeeds, a batchtools job will be submitted to the job scheduler - which can be seen when if calling qstat -u $USER
in another shell.
Finally, if we try to look at the value of x
;
> x
[1] "n17"
>
it will block until the job is finished and then its value will be printed. Here we see that the job was running on compute node n17.
> help("pscnseq", package = "CostelloPSCNSeq")
<%
options(pager = function(files, header, title, delete.file) {
bfr <- readLines(files, warn = FALSE)
if (delete.file) file.remove(files)
bfr <- gsub("_\b", "", bfr)
cat(bfr, sep = "\n")
})
h <- help("pscnseq", package = "CostelloPSCNSeq", help_type = "text")
out <- utils::capture.output(print(h))
cat(out, sep = "\n")
%>
Bengtsson H, Neuvial P, Speed TP. TumorBoost: Normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays, BMC Bioinformatics, 2010. DOI: 10.1186/1471-2105-11-245. PMID: 20462408, PMCID: PMC2894037
Olshen AB, Bengtsson H, Neuvial P, Spellman PT, Olshen RA, Seshan VA. Parent-specific copy number in paired tumor-normal studies using circular binary segmentation, Bioinformatics, 2011. DOI: 10.1093/bioinformatics/btr329. PMID: 21666266. PMCID: PMC3137217
Favero F, Joshi T, Marquard AM, Birkbak NJ, Krzystanek M, Li Q, Szallasi Z and Eklund AC. Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data, Annals of Oncology, 2015. DOI: 10.1093/annonc/mdu479, PMID: 25319062, PMCID: PMC4269342
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and 1000 Genome Project Data Processing Subgroup, The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 2009. DOI: 10.1093/bioinformatics/btp352, PMID: 19505943, PMCID: PMC2723002
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.