library(PureCN) library(BiocStyle)
r Biocpkg("PureCN") is backward compatible with input generated by
versions 1.16 and 1.18. For versions 1.8 to 1.14, please re-run NormalDB.R
(see also below): 
$ Rscript $PURECN/NormalDB.R --outdir $OUT_REF \ --coveragefiles example_normal.list \ --genome hg19 --normal_panel $NORMAL_PANEL --assay agilent_v6
For upgrades from version 1.6, we highly recommend starting from scratch following this tutorial.
For the command line scripts described in this tutorial, we will need to 
install r Biocpkg("PureCN") with suggested dependencies:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("PureCN", dependencies = TRUE)
Alternatively, manually install the packages required by the command line scripts:
BiocManager::install(c("PureCN", "optparse", 
    "TxDb.Hsapiens.UCSC.hg19.knownGene", "org.Hs.eg.db"))
(Replace hg19 with your genome version).
To use the alternative and in many cases recommended r CRANpkg("PSCBS")
segmentation:
# default PSCBS without support of interval weights
BiocManager::install("PSCBS")
# patched PSCBS with support of interval weights
BiocManager::install("lima1/PSCBS", ref="add_dnacopy_weighting")
To call mutational signatures, install the GitHub version of the
r CRANpkg("deconstructSigs") package:
BiocManager::install("raerose01/deconstructSigs")
For the experimental support of importing variant calls from GATK4 GenomicsDB, follow the installations instructions from GenomicsDB-R.
system.file("extdata", package="PureCN")
$ export PURECN="/path/to/PureCN/extdata" $ Rscript $PURECN/PureCN.R --help Usage: /path/to/PureCN/inst/extdata/PureCN.R [options] ...
# specify path where PureCN should store reference files
$ export OUT_REF="reference_files"
$ Rscript $PURECN/IntervalFile.R --infile baits_hg19.bed \ 
    --fasta hg19.fa --outfile $OUT_REF/baits_hg19_intervals.txt \
    --offtarget --genome hg19 \
    --export $OUT_REF/baits_optimized_hg19.bed \
    --mappability wgEncodeCrgMapabilityAlign100mer.bigWig \
    --reptiming wgEncodeUwRepliSeqK562WaveSignalRep1.bigWig
Internally, this script uses r Biocpkg("rtracklayer") to parse the
infile. Make sure that the file format matches the file extension.
See the r Biocpkg("rtracklayer") documentation for problems loading the file.
Check that the genome version of the baits file matches the reference.  Do not
include chrM baits in case the capture kit includes some.
The --offtarget flag will include off-target reads. Including them is
recommended except for Amplicon data.
The --genome version is needed to annotate exons with gene symbols.  Use
hg19/hg38 for human genomes, not b37/b38. You might get a warning that an
annotation package is missing.  For hg19, install
r Biocpkg("TxDb.Hsapiens.UCSC.hg19.knownGene") in R.
The --export argument is optional. If provided, this script will store the
modified intervals as BED file for example (again every 
r Biocpkg("rtracklayer") format is supported). This is useful when the coverages
are calculated with third-party tools like GATK.
The --mappability argument should provide a r Biocpkg("rtracklayer")
parsable file with a mappability score in the first meta data column. If
provided, off-target regions will be restricted to regions specified in this
file.  On-target regions with low mappability will be excluded. For hg19,
download the file from the UCSC website. Choose the kmer size that best fits
your average mapped read length. For hg38, download recommended 76-kmer or
100-kmer mappability files through the courtesy of the Waldron lab from: 
See the FAQ section of the main vignette for instruction how to generate such a file for other references.
Similarly, the --reptiming argument takes a replication timing score in the
same format.  If provided, GC-normalized and log-transformed coverage is tested
for a linear relationship with this score and normalized accordingly.  This is
optional and provides only a minor benefit for coverage normalization, but can
identify samples with high proliferation. Requires --offtarget to be useful. 
r Biocpkg("PureCN") does not ship with a variant caller. Use a third-party
tool to generate a VCF for each sample. 
Important recommendations:
Use MuTect 1.1.7 if possible; Mutect 2 from GATK 4.1.7+ is now out of alpha.
VCFs from most other tumor-only callers such as VarScan2 and FreeBayes are supported, but only very limited artifact filtering will be performed for these callers. Make sure to provide filtered VCFs. See the FAQ section in the main vignette for common problems and questions related to input data.
Since germline SNPs are needed to infer allele-specific copy numbers, the provided VCF needs to contain both somatic and germline variants. Make sure that upstream filtering does not remove high quality SNPs, in particular due to presence in germline databases.
Run the variant caller with a 50-75 base pair interval padding to increase the number of heterozygous SNPs
The following describes r Biocpkg("PureCN") runs with internal copy number
normalization and segmentation. 
What you will need:
The interval file generated above
BAM files of tumor samples.
BAM files of normal samples (see main vignette for recommendations). These normal samples are not required to be patient-matched to the tumor samples, but they need to be processed-matched (same assay run through the same alignment pipeline, ideally sequenced in the same lab)
VCF files generated above for all tumor and normal BAM files
For each sample, tumor and normal, calculate GC-normalized coverages:
# Calculate and GC-normalize coverage from a BAM file $ Rscript $PURECN/Coverage.R --outdir $OUT/$SAMPLEID \ --bam ${SAMPLEID}.bam \ --intervals $OUT_REF/baits_hg19_intervals.txt # GC-normalize coverage from a GATK DepthOfCoverage file Rscript $PURECN/Coverage.R --outdir $OUT/$SAMPLEID \ --coverage ${SAMPLEID}.coverage.sample_interval_summary \ --intervals $OUT_REF/baits_hg19_intervals.txt
Similar to GATK, this script also takes a text file containing a list of BAM or
coverage file names (one per line). The file extension must be .list:
# Calculate and GC-normalize coverage from a list of BAM files 
$ Rscript $PURECN/Coverage.R --outdir $OUT \ 
    --bam normals.list \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --cores 4
Important recommendations:
--keepduplicates or --removemapq0 if you know what you are
doing and always use the same command line arguments for tumor and the normalsTo build a normal database for coverage normalization, copy the paths to all GC-normalized normal coverage files in a single text file, line-by-line:
ls -a normal*loess.txt.gz | cat > example_normal.list
# From already GC-normalized files
$ Rscript $PURECN/NormalDB.R --outdir $OUT_REF \
    --coveragefiles example_normal.list \
    --genome hg19 --assay agilent_v6
# When normal panel VCF is available (highly recommended for
# unmatched samples)
$ Rscript $PURECN/NormalDB.R --outdir $OUT_REF \
    --coveragefiles example_normal.list \
    --normal_panel $NORMAL_PANEL \
    --genome hg19 \
    --assay agilent_v6
# For a Mutect2/GATK4 normal panel GenomicsDB (experimental)
$ Rscript $PURECN/NormalDB.R --outdir $OUT_REF \
    --coveragefiles example_normal.list \
    --normal_panel $GENOMICSDB-WORKSPACE-PATH/pon_db \
    --genome hg19 \
    --assay agilent_v6
Important recommendations:
Consider generating different databases when differences are significant, e.g. for samples with different read lengths or insert size distributions
In particular, do not mix normal data obtained with different capture kits (e.g. Agilent SureSelect v4 and v6)
Provide a normal panel VCF here to precompute mapping bias for faster
  runtimes. The only requirement for the VCF is an AD format field containing
  the number of reference and alt reads for all samples. See the example file
  $PURECN/normalpanel.vcf.gz.
For ideal results, examine the interval_weights.png file to find good
  off-target bin widths. You will need to re-run IntervalFile.R with the
  --offtargetwidth parameter and re-calculate the coverages.
The --assay argument is optional and is only used to add the provided assay
  name to all output files
A warning pointing to the likely use of a wrong baits file means that more
  than 5% of targets have close to 0 coverage in all normal samples. A BED file
  with the low coverage targets will be generated in --outdir.  If for any
  reason there is no access to the correct file, it is recommended to re-run the
  IntervalFile.R command and provide this BED file with --exclude.
Now that the assay-specific files are created and all coverages calculated, we
run PureCN.R to normalize, segment and determine purity and ploidy:
mkdir $OUT/$SAMPLEID # Without a matched normal (minimal test run) $ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt.gz \ --sampleid $SAMPLEID \ --vcf ${SAMPLEID}_mutect.vcf \ --normaldb $OUT_REF/normalDB_hg19.rds \ --intervals $OUT_REF/baits_hg19_intervals.txt \ --genome hg19 # Production pipeline run $ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt.gz \ --sampleid $SAMPLEID \ --vcf ${SAMPLEID}_mutect.vcf \ --statsfile ${SAMPLEID}_mutect_stats.txt \ --normaldb $OUT_REF/normalDB_hg19.rds \ --mappingbiasfile $OUT_REF/mapping_bias_hg19.rds \ --intervals $OUT_REF/baits_hg19_intervals.txt \ --snpblacklist hg19_simpleRepeats.bed \ --genome hg19 \ --force --postoptimize --seed 123 # With a matched normal (test run; for production pipelines we recommend the # unmatched workflow described above) $ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt.gz \ --normal $OUT/$SAMPLEID/${SAMPLEID_NORMAL}_coverage_loess.txt.gz \ --sampleid $SAMPLEID \ --vcf ${SAMPLEID}_mutect.vcf \ --normaldb $OUT_REF/normalDB_hg19.rds \ --intervals $OUT_REF/baits_hg19_intervals.txt \ --genome hg19 # Recreate output after manual curation of ${SAMPLEID}.csv $ Rscript $PURECN/PureCN.R --rds $OUT/$SAMPLEID/${SAMPLEID}.rds
Important recommendations:
Even if matched normals are available, it is often better to use the normal
  database for coverage normalization. When a matched normal coverage is
  provided with --normal then the pool of normal coverage normalization and
  denoising steps are skipped!
Always provide the normal coverage database to ignore low quality regions in the segmentation and to increase the sensitivity for homozygous deletions in high purity samples.
The normal panel VCF file is useful for mapping bias correction and especially recommended without matched normals. See the FAQ of the main vignette how to generate this file. It is not essential for test runs.
The MuTect 1.1.7 stats file (the main output file besides the VCF) should be provided for better artifact filtering. If the VCF was generated by a pipeline that performs good artifact filtering, this file is not needed.
The --postoptimize flag defines that purity should be optimized using both
  variant allelic fractions and copy number instead of copy number only.  This
  results in a significant runtime increase for whole-exome data.
If --out is a directory, it will use the sample id as file prefix for all
  output files. Otherwise r Biocpkg("PureCN") will use --out as prefix.   
The --parallel flag will enable the parallel fitting of local optima. See 
  r Biocpkg("BiocParallel") for details. This script will use the default
  backend.
--funsegmentation PSCBS will become the new default in 1.22.  Support for
  interval weights currently requires a patch (see Section
  \@ref(installation)).
--model betabin will become the new default in 1.22 with larger panel of
  normals (probably more than 10-15 normal samples). We encourage users to try
  this model and appreciate feedback.
Defaults are well calibrated and should produce close to ideal results for most samples. A few common cases where changing defaults makes sense:
High purity and high quality: For cancer types with a high expected
  purity, such as ovarian cancer, AND when quality is expected to be very
  good (high coverage, young samples), --maxcopynumber 8
Small panels with high coverage: --padding 100 (or higher), requires
  running the variant caller with this padding or without interval file. Use
  the same settings for the panel of normals VCF so that SNPs in the
  flanking regions have reliable mapping bias estimates. The
  --maxhomozygousloss parameter might also need some adjustment for very
  small panels with large gaps around captured deletions.
Cell lines: Safely skip the search for low purity solutions in cell lines:
  --maxcopynumber 8, --minpurity 0.9, --maxpurity 0.99.  Add
  --modelhomozygous to find regions of LOH in samples without normal
  contamination (do not provide this flag when matched normal data are
  available in the VCF).
cfDNA: --minpurity 0.1, --minaf 0.01 (or lower) and --error 0.0005
  (or lower, when there is UMI-based error correction). Note that the
  estimated purity can be very wrong when the true purity is below 5-7%;
  these samples are usually flagged as non-aberrant.
Amplicon data: --model betabin (Amplicon data is not officially
  supported)
All assays: --maxsegments should set to a value so that with few
  exceptions only poor quality samples exceed this cutoff. For cancer types
  with high heterogenity, it is also recommended to increase
  --maxnonclonal to 0.3-0.4 (this will increase the runtime significantly
  for whole-exome data).
What you will need:
Output of third-party tools (see details below)
VCF files for all tumor samples and some normal files (see main vignette for questions related to required normal samples)
If you already have a segmentation from third-party tools (for example CNVkit, GATK4, EXCAVATOR2). For a minimal test run:
Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --sampleid $SAMPLEID \ --segfile $OUT/$SAMPLEID/${SAMPLEID}.cnvkit.seg \ --vcf ${SAMPLEID}_mutect.vcf \ --intervals $OUT_REF/baits_hg19_intervals.txt \ --genome hg19
See the main vignette for more details and file formats.
For a production pipeline run we provide again more information about the assay and genome. Here an CNVkit example:
# Recommended: Provide a normal panel VCF to remove mapping biases, pre-compute # position-specific bias for much faster runtimes with large panels # This needs to be done only once for each assay Rscript $PURECN/NormalDB.R --outdir $OUT_REF --normal_panel $NORMAL_PANEL \ --assay agilent_v6 --genome hg19 --force # Export the segmentation in DNAcopy format cnvkit.py export seg $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cns --enumerate-chroms \ -o $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg # Run PureCN by providing the *.cnr and *.seg files Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --sampleid $SAMPLEID \ --tumor $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cnr \ --segfile $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg \ --mappingbiasfile $OUT_REF/mapping_bias_agilent_v6_hg19.rds \ --vcf ${SAMPLEID}_mutect.vcf \ --statsfile ${SAMPLEID}_mutect_stats.txt \ --snpblacklist hg19_simpleRepeats.bed \ --genome hg19 \ --funsegmentation Hclust \ --force --postoptimize --seed 123
Important recommendations:
The --funsegmentation argument controls if the data should to be
  re-segmented using germline BAFs (default). Set this value to none if the
  provided segmentation should be used as is.  The recommended Hclust will
  only cluster provided segments.
Since CNVkit provides all necessary information in the *.cnr output files,
  the --intervals argument is not required.
In test runs, especially when the input VCF contains matched normal
  information, --mappingbiasfile can be skipped
CNVkit runs without normal reference samples are not recommended
# Recommended: Provide a normal panel GenomicsDB to remove mapping # biases, pre-compute position-specific bias for much faster runtimes # with large panels. This needs to be done only once for each assay. Rscript $PURECN/NormalDB.R --outdir $OUT_REF \ --normal_panel $GENOMICSDB-WORKSPACE-PATH/pon_db \ --assay agilent_v6 --genome hg19 --force Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --sampleid $SAMPLEID \ --tumor $OUT/$SAMPLEID/${SAMPLEID}.hdf5 \ --logratiofile $OUT/$SAMPLEID/${SAMPLEID}.denoisedCR.tsv \ --segfile $OUT/$SAMPLEID/${SAMPLEID}.modelFinal.seg \ --mappingbiasfile $OUT_REF/mapping_bias_agilent_v6_hg19.rds \ --vcf ${SAMPLEID}_mutect2_filtered.vcf \ --snpblacklist hg19_simpleRepeats.bed \ --genome hg19 \ --funsegmentation Hclust \ --force --postoptimize --seed 123
Dx.R provides copy number and mutation metrics commonly used as biomarkers,
most importantly tumor mutational burden (TMB), chromosomal instability (CIN)
and mutational signatures.
# Provide a BED file with callable regions, for examples obtained by # GATK CallableLoci. Useful to calculate mutations per megabase and # to exclude low quality regions. grep CALLABLE ${SAMPLEID}_callable_status.bed > \ ${SAMPLEID}_callable_status_filtered.bed # Only count mutations in callable regions, also subtract what was # ignored in PureCN.R via --snpblacklist, like simple repeats, from the # mutation per megabase calculation # Also search for the COSMIC mutation signatures # (http://cancer.sanger.ac.uk/cosmic/signatures) Rscript $PureCN/Dx.R --out $OUT/$SAMPLEID/$SAMPLEID \ --rds $OUT/SAMPLEID/${SAMPLEID}.rds \ --callable ${SAMPLEID}_callable_status_filtered.bed \ --exclude hg19_simpleRepeats.bed \ --signatures # Restrict mutation burden calculation to coding sequences Rscript $PureCN/FilterCallableLoci.R --genome hg19 \ --infile ${SAMPLEID}_callable_status_filtered.bed \ --outfile ${SAMPLEID}_callable_status_filtered_cds.bed \ --exclude '^HLA' Rscript $PureCN/Dx.R --out $OUT/$SAMPLEID/${SAMPLEID}_cds \ --rds $OUT/SAMPLEID/${SAMPLEID}.rds \ --callable ${SAMPLEID}_callable_status_filtered_cds.bed \ --exclude hg19_simpleRepeats.bed
Important recommendations:
--minDepth N where N is roughly 20% of the mean
  target coverage of all samples.Argument name          | Corresponding PureCN argument | PureCN function 
-----------------------|-------------------------------|----------------
--fasta              | reference.file | preprocessIntervals 
--infile             | interval.file | preprocessIntervals 
--offtarget          |     off.target    | preprocessIntervals 
--targetwidth        | average.target.width | preprocessIntervals 
--mintargetwidth     | min.target.width | preprocessIntervals 
--smalltargets       | small.targets | preprocessIntervals 
--offtargetwidth     | average.off.target.width | preprocessIntervals 
--offtargetseqlevels | off.target.seqlevels | preprocessIntervals 
--mappability        | mappability   | preprocessIntervals 
--minmappability     | min.mappability   | preprocessIntervals 
--reptiming          | reptiming   | preprocessIntervals 
--reptimingwidth     | average.reptiming.width | preprocessIntervals
--genome             | txdb, org | annotateTargets 
--outfile            | | 
--export             | | rtracklayer::export
--version -v         | | 
--force -f           | | 
--help -h            | | 
: (#tab:intervalfile) IntervalFile
Argument name      | Corresponding PureCN argument | PureCN function 
-------------------|-------------------------------|----------------
--bam            | bam.file | calculateBamCoverageByInterval
--bai            | index.file | calculateBamCoverageByInterval 
--coverage       | coverage.file | correctCoverageBias
--intervals      | interval.file  | correctCoverageBias 
--method         | method  | correctCoverageBias
--keepduplicates | keep.duplicates  | calculateBamCoverageByInterval
--removemapq0    | mapqFilter  | ScanBamParam
--outdir         | | 
--cores          | | Number of CPUs to use when multiple BAMs are provided 
--parallel       | | Use default r Biocpkg("BiocParallel") backend when multiple BAMs are provided 
--seed           | | 
--version -v     | | 
--force -f       | | 
--help -h        | | 
: (#tab:coverage) Coverage
Argument name          | Corresponding PureCN argument | PureCN function 
-----------------------|-------------------------------|----------------
--coveragefiles   | normal.coverage.files | createNormalDatabase
--normal_panel    | normal.panel.vcf.file | calculateMappingBiasVcf 
--assay -a          | Optional assay name | Used in output file names. 
--genome -g        | Optional genome version | Used in output file names. 
--outdir -o         | | 
--version -v        | | 
--force -f          | | 
--help -h           | | 
: (#tab:normaldb) NormalDB
Argument name          | Corresponding PureCN argument | PureCN function 
-----------------------|-------------------------------|----------------
--sampleid -i      | sampleid  | runAbsoluteCN
--normal           | normal.coverage.file   | runAbsoluteCN
--tumor            | tumor.coverage.file     | runAbsoluteCN 
--vcf              | vcf.file          | runAbsoluteCN 
--rds              | file.rds          | readCurationFile 
--mappingbiasfile  | mapping.bias.file | setMappingBiasVcf 
--normaldb         | normalDB (serialized with saveRDS) | calculateTangentNormal, filterTargets 
--segfile          | seg.file           | runAbsoluteCN
--logratiofile     | log.ratio          | runAbsoluteCN
--additionaltumors | tumor.coverage.files | processMultipleSamples
--sex              | sex                | runAbsoluteCN
--genome           | genome             | runAbsoluteCN
--intervals        | interval.file      | runAbsoluteCN 
--statsfile        | stats.file | filterVcfMuTect      
--minaf            | af.range | filterVcfBasic 
--snpblacklist     | snp.blacklist | filterVcfBasic   
--error            | error | runAbsoluteCN   
--dbinfoflag       | DB.info.flag | runAbsoluteCN   
--popafinfofield   | POPAF.info.field | runAbsoluteCN   
--mincosmiccnt     | min.cosmic.cnt | setPriorVcf   
--funsegmentation  | fun.segmentation | runAbsoluteCN
--alpha            | alpha | segmentationCBS 
--undosd           | undo.SD | segmentationCBS 
--maxsegments      | max.segments | runAbsoluteCN 
--minpurity        | test.purity | runAbsoluteCN 
--maxpurity        | test.purity | runAbsoluteCN 
--minploidy        | min.ploidy | runAbsoluteCN 
--maxploidy        | max.ploidy | runAbsoluteCN 
--maxcopynumber   | test.num.copy | runAbsoluteCN 
--postoptimize     | post.optimize | runAbsoluteCN   
--bootstrapn       | n | bootstrapResults 
--modelhomozygous  | model.homozygous | runAbsoluteCN
--model            | model | runAbsoluteCN
--logratiocalibration | log.ratio.calibration | runAbsoluteCN
--maxnonclonal     | max.non.clonal | runAbsoluteCN
--maxhomozygousloss | max.homozygous.loss | runAbsoluteCN
--outvcf           | return.vcf | predictSomatic 
--out -o           | |                          
--parallel         | BPPARAM | runAbsoluteCN 
--cores            | BPPARAM | runAbsoluteCN 
--seed             | | 
--version -v       | | 
--force -f         | | 
--help -h          | | 
: (#tab:purecn) PureCN
Argument name   | Corresponding PureCN argument | PureCN function 
----------------|-------------------------------|----------------
--rds         | file.rds | readCurationFile 
--callable    | callable | callMutationBurden 
--exclude     | exclude  | callMutationBurden 
--maxpriorsomatic | max.prior.somatic  | callMutationBurden 
--signatures | | deconstructSigs::whichSignatures 
--signature_databases | | deconstructSigs::whichSignatures 
--out         | | 
--version -v  | | 
--force -f    | | 
--help -h     | | 
: (#tab:dx) Dx
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.