library(PureCN)
library(BiocStyle)

Prerequisites

Update from previous stable versions

r Biocpkg("PureCN") is backward compatible with input generated by versions 1.16 and later. For versions 1.8 to 1.14, please re-run NormalDB.R (see also below):

$ Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --coverage-files example_normal_coverages.list \
    --genome hg19 --normal-panel $NORMAL_PANEL --assay agilent_v6

When using --model betabin in PureCN.R, we recommend for all previous versions re-creating the mapping bias database by re-running NormalDB.R:

# only re-creating the mapping bias file
$ Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --genome hg19 --normal-panel $NORMAL_PANEL --assay agilent_v6

For upgrades from version 1.6, we highly recommend starting from scratch following this tutorial.

Installation

For the command line scripts described in this tutorial, we will need to install r Biocpkg("PureCN") with suggested dependencies:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("PureCN", dependencies = TRUE)

Alternatively, manually install the packages required by the command line scripts:

BiocManager::install(c("PureCN", "optparse", "R.utils",
    "TxDb.Hsapiens.UCSC.hg19.knownGene", "org.Hs.eg.db"))

(Replace hg19 with your genome version).

To use the alternative and in many cases recommended r CRANpkg("PSCBS") segmentation:

# default PSCBS without support of interval weights
BiocManager::install("PSCBS")

# patched PSCBS with support of interval weights
BiocManager::install("lima1/PSCBS", ref="add_dnacopy_weighting")

To call mutational signatures, install the GitHub version of the r CRANpkg("deconstructSigs") package:

BiocManager::install("raerose01/deconstructSigs")

For the experimental support of importing variant calls from GATK4 GenomicsDB, follow the installations instructions from GenomicsDB-R.

The GATK4 segmentation requires the gatk binary in path. Versions 4.1.7.0 and newer are supported.

Prepare environment and assay-specific reference files

system.file("extdata", package = "PureCN")
$ export PURECN="/path/to/PureCN/extdata"
$ Rscript $PURECN/PureCN.R --help
Usage: /path/to/PureCN/inst/extdata/PureCN.R [options] ...
# specify path where PureCN should store reference files
$ export OUT_REF="reference_files"
$ Rscript $PURECN/IntervalFile.R --in-file baits_hg19.bed \ 
    --fasta hg19.fa --out-file $OUT_REF/baits_hg19_intervals.txt \
    --off-target --genome hg19 \
    --export $OUT_REF/baits_optimized_hg19.bed \
    --mappability wgEncodeCrgMapabilityAlign100mer.bigWig \
    --reptiming wgEncodeUwRepliSeqK562WaveSignalRep1.bigWig

Internally, this script uses r Biocpkg("rtracklayer") to parse the --in-file. Make sure that the file format matches the file extension. See the r Biocpkg("rtracklayer") documentation for problems loading the file. Check that the genome version of the baits file matches the reference. Do not include chrM baits in case the capture kit includes some.

We do not recommend padding the baits file manually unless the coverages are very low (<30X) where the increased counts from the padded regions might decrease sampling variance slightly. Note that we do however strongly recommend running the variant caller with a padding of at least 50bp to increase the number of informative SNPs, see below in the VCF section. Double check that the genome version of the --in-file is correct - many assays are still designed using older references and might need to be lifted over to the pipeline reference. If possible, do NOT use a BED file that contains the targeted exons, instead use the coordinates of the baits. These are optimized for GC-content and mappability and will produce cleaner coverage profiles.

The --off-target flag will include off-target reads. Including them is recommended except for Amplicon data. For whole-exome data, the benefit is usually also limited unless the assay is inefficient with a high fraction of off-target reads (>10-15%).

The --genome version is needed to annotate exons with gene symbols. Use hg19/hg38 for human genomes, not b37/b38. You might get a warning that an annotation package is missing. For hg19, install r Biocpkg("TxDb.Hsapiens.UCSC.hg19.knownGene") in R.

The --export argument is optional. If provided, this script will store the modified intervals as BED file for example (again every r Biocpkg("rtracklayer") format is supported). This is useful when the coverages are calculated with third-party tools like GATK.

The --mappability argument should provide a r Biocpkg("rtracklayer") parsable file with a mappability score in the first meta data column. If provided, off-target regions will be restricted to regions specified in this file. On-target regions with low mappability will be excluded. For hg19, download the file from the UCSC website. Choose the kmer size that best fits your average mapped read length. For hg38, download recommended 76-kmer or 100-kmer mappability files through the courtesy of the Waldron lab from:

See the FAQ section of the main vignette for instruction how to generate such a file for other references.

Similarly, the --reptiming argument takes a replication timing score in the same format. If provided, GC-normalized and log-transformed coverage is tested for a linear relationship with this score and normalized accordingly. This is optional and provides only a minor benefit for coverage normalization, but can identify samples with high proliferation. Requires --off-target to be useful.

Create VCF files

r Biocpkg("PureCN") does not ship with a variant caller. Use a third-party tool to generate a VCF for each sample.

Important recommendations:

Run PureCN with internal segmentation

The following describes r Biocpkg("PureCN") runs with internal copy number normalization and segmentation.

What you will need:

Coverage

For each sample, tumor and normal, calculate GC-normalized coverages:

# Calculate and GC-normalize coverage from a BAM file 
$ Rscript $PURECN/Coverage.R --out-dir $OUT/$SAMPLEID \ 
    --bam ${SAMPLEID}.bam \
    --intervals $OUT_REF/baits_hg19_intervals.txt

Similar to GATK, this script also takes a text file containing a list of BAM or coverage file names (one per line). The file extension must be .list:

# Calculate and GC-normalize coverage from a list of BAM files 
$ Rscript $PURECN/Coverage.R --out-dir $OUT/normals \ 
    --bam normals.list \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --cores 4

Important recommendations:

# GC-normalize coverage from a GATK DepthOfCoverage file
Rscript $PURECN/Coverage.R --out-dir $OUT/$SAMPLEID \
    --coverage ${SAMPLEID}.coverage.sample_interval_summary \ 
    --intervals $OUT_REF/baits_hg19_intervals.txt

NormalDB

To build a normal database for coverage normalization, copy the paths to all (GC-normalized) normal coverage files in a single text file, line-by-line:

ls -a $OUT/normals/*_loess.txt.gz | cat > example_normal_coverages.list
# In case no GC-normalization is performed:
# ls -a $OUT/normals/*_coverage.txt.gz | cat > example_normal_coverages.list

$ Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --coverage-files example_normal_coverages.list \
    --genome hg19 --assay agilent_v6

# When normal panel VCF is available (highly recommended for
# unmatched samples)
$ Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --coverage-files example_normal_coverages.list \
    --normal-panel $NORMAL_PANEL \
    --genome hg19 \
    --assay agilent_v6

# For a Mutect2/GATK4 normal panel GenomicsDB (beta)
$ Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --coverage-files example_normal_coverages.list \
    --normal-panel $GENOMICSDB-WORKSPACE-PATH/pon_db \
    --genome hg19 \
    --assay agilent_v6

Important recommendations:

PureCN

Now that the assay-specific files are created and all coverages calculated, we run r Biocpkg("PureCN") to normalize, segment and determine purity and ploidy:

mkdir $OUT/$SAMPLEID

# Without a matched normal (minimal test run)
$ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt.gz \
    --sampleid $SAMPLEID \
    --vcf ${SAMPLEID}_mutect.vcf \
    --normaldb $OUT_REF/normalDB_hg19.rds \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --genome hg19 

# Production pipeline run
$ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt.gz \
    --sampleid $SAMPLEID \
    --vcf ${SAMPLEID}_mutect.vcf \
    --stats-file ${SAMPLEID}_mutect_stats.txt \
    --fun-segmentation PSCBS \
    --normaldb $OUT_REF/normalDB_hg19.rds \
    --mapping-bias-file $OUT_REF/mapping_bias_hg19.rds \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --snp-blacklist hg19_simpleRepeats.bed \
    --genome hg19 \
    --model betabin \
    --force --post-optimize --seed 123

# With a matched normal (test run; for production pipelines we recommend the
# unmatched workflow described above)
$ Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}_coverage_loess.txt.gz \
    --normal $OUT/$SAMPLEID/${SAMPLEID_NORMAL}_coverage_loess.txt.gz \
    --sampleid $SAMPLEID \
    --vcf ${SAMPLEID}_mutect.vcf \
    --normaldb $OUT_REF/normalDB_hg19.rds \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --genome hg19

# Recreate output after manual curation of ${SAMPLEID}.csv
$ Rscript $PURECN/PureCN.R --rds $OUT/$SAMPLEID/${SAMPLEID}.rds

Important recommendations:

Run PureCN with third-party segmentation

Our internal r Biocpkg("PureCN") normalization combined with the PSCBS or GATK4 segmentation should produce highly competitive results and we encourage users to try it and compare it to their existing pipelines. However, we realize that often it is not an option to change tools in production pipelines and we therefore made it relatively easy to use r Biocpkg("PureCN") with third-party tools. We provide examples for CNVkit and GATK4 and it should be straightforward to adapt those for other tools.

What you will need:

General usage

If you already have a segmentation from third-party tools (for example CNVkit, GATK4, EXCAVATOR2). For a minimal test run:

Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID  \
    --sampleid $SAMPLEID \
    --seg-file $OUT/$SAMPLEID/${SAMPLEID}.cnvkit.seg \
    --vcf ${SAMPLEID}_mutect.vcf \
    --intervals $OUT_REF/baits_hg19_intervals.txt \
    --genome hg19 

See the main vignette for more details and file formats.

Recommended CNVkit usage

For a production pipeline run we provide again more information about the assay and genome. Here an CNVkit example:

# Recommended: Provide a normal panel VCF to remove mapping biases, pre-compute
# position-specific bias for much faster runtimes with large panels
# This needs to be done only once for each assay
Rscript $PURECN/NormalDB.R --out-dir $OUT_REF --normal-panel $NORMAL_PANEL \
    --assay agilent_v6 --genome hg19 --force

# Export the segmentation in DNAcopy format
cnvkit.py export seg $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cns --enumerate-chroms \
    -o $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg

# Run PureCN by providing the *.cnr and *.seg files 
Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID  \
    --sampleid $SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.cnr \
    --seg-file $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg \
    --mapping-bias-file $OUT_REF/mapping_bias_agilent_v6_hg19.rds \
    --vcf ${SAMPLEID}_mutect.vcf \
    --stats-file ${SAMPLEID}_mutect_stats.txt \
    --snp-blacklist hg19_simpleRepeats.bed \
    --genome hg19 \
    --fun-segmentation Hclust \
    --force --post-optimize --seed 123

Important recommendations:

Recommended GATK4 usage

# Recommended: Provide a normal panel GenomicsDB to remove mapping
# biases, pre-compute position-specific bias for much faster runtimes
# with large panels. This needs to be done only once for each assay. 
Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \
    --normal-panel $GENOMICSDB-WORKSPACE-PATH/pon_db \
    --assay agilent_v6 --genome hg19 --force

Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID  \
    --sampleid $SAMPLEID \
    --tumor $OUT/$SAMPLEID/${SAMPLEID}.hdf5 \
    --log-ratio-file $OUT/$SAMPLEID/${SAMPLEID}.denoisedCR.tsv \
    --seg-file $OUT/$SAMPLEID/${SAMPLEID}.modelFinal.seg \
    --mapping-bias-file $OUT_REF/mapping_bias_agilent_v6_hg19.rds \
    --vcf ${SAMPLEID}_mutect2_filtered.vcf \
    --snp-blacklist hg19_simpleRepeats.bed \
    --genome hg19 \
    --fun-segmentation Hclust \
    --force --post-optimize --seed 123

Important recommendations:

Biomarkers

Dx.R provides copy number and mutation metrics commonly used as biomarkers, most importantly tumor mutational burden (TMB), chromosomal instability (CIN) and mutational signatures.

# Provide a BED file with callable regions, for examples obtained by
# GATK CallableLoci. Useful to calculate mutations per megabase and
# to exclude low quality regions.
grep CALLABLE ${SAMPLEID}_callable_status.bed > \ 
    ${SAMPLEID}_callable_status_filtered.bed

# Only count mutations in callable regions, also subtract what was
# ignored in PureCN.R via --snp-blacklist, like simple repeats, from the
# mutation per megabase calculation
# Also search for the COSMIC mutation signatures
# (http://cancer.sanger.ac.uk/cosmic/signatures)
Rscript $PureCN/Dx.R --out $OUT/$SAMPLEID/$SAMPLEID \
    --rds $OUT/SAMPLEID/${SAMPLEID}.rds \
    --callable ${SAMPLEID}_callable_status_filtered.bed \
    --exclude hg19_simpleRepeats.bed \
    --signatures 

# Restrict mutation burden calculation to coding sequences
Rscript $PureCN/FilterCallableLoci.R --genome hg19 \
    --in-file ${SAMPLEID}_callable_status_filtered.bed \
    --out-file ${SAMPLEID}_callable_status_filtered_cds.bed \
    --exclude '^HLA'

Rscript $PureCN/Dx.R --out $OUT/$SAMPLEID/${SAMPLEID}_cds \
    --rds $OUT/SAMPLEID/${SAMPLEID}.rds \
    --callable ${SAMPLEID}_callable_status_filtered_cds.bed \
    --exclude hg19_simpleRepeats.bed

Important recommendations:

Reference

Argument name | Corresponding PureCN argument | PureCN function -----------------------|-------------------------------|---------------- --fasta | reference.file | preprocessIntervals --in-file | interval.file | preprocessIntervals --off-target | off.target | preprocessIntervals --average-target-width | average.target.width | preprocessIntervals --min-target-width | min.target.width | preprocessIntervals --small-targets | small.targets | preprocessIntervals --average-off-target-width | average.off.target.width | preprocessIntervals --off-target-seqlevels | off.target.seqlevels | preprocessIntervals --mappability | mappability | preprocessIntervals --min-mappability | min.mappability | preprocessIntervals --reptiming | reptiming | preprocessIntervals --average-reptiming-width | average.reptiming.width | preprocessIntervals
--genome | txdb, org | annotateTargets --out-file | | --export | | rtracklayer::export
--version -v | | --force -f | | --help -h | |

: (#tab:intervalfile) IntervalFile

Argument name | Corresponding PureCN argument | PureCN function -------------------|-------------------------------|---------------- --bam | bam.file | calculateBamCoverageByInterval --bai | index.file | calculateBamCoverageByInterval --coverage | coverage.file | correctCoverageBias --intervals | interval.file | correctCoverageBias --method | method | correctCoverageBias --keep-duplicates | keep.duplicates | calculateBamCoverageByInterval --chunks | chunks | calculateBamCoverageByInterval --remove-mapq0 | mapqFilter | ScanBamParam --skip-gc-norm | | correctCoverageBias --out-dir | | --cores | | Number of CPUs to use when multiple BAMs are provided --parallel | | Use default r Biocpkg("BiocParallel") backend when multiple BAMs are provided --seed | | --version -v | | --force -f | | --help -h | |

: (#tab:coverage) Coverage

Argument name | Corresponding PureCN argument | PureCN function -----------------------|-------------------------------|---------------- --coverage-files | normal.coverage.files | createNormalDatabase --normal-panel | normal.panel.vcf.file | calculateMappingBiasVcf --assay -a | Optional assay name | Used in output file names. --genome -g | Optional genome version | Used in output file names. --genomicsdb-af-field | For GenomicsDB import, allelic fraction field | calculateMappingBiasGatk4 --min-normals-position-specific-fit | min.normals.position.specific.fit | calculateMappingBiasVcf, calculateMappingBiasGatk4 --out-dir -o | | --version -v | | --force -f | | --help -h | |

: (#tab:normaldb) NormalDB

Argument name | Corresponding PureCN argument | PureCN function -----------------------|-------------------------------|---------------- --sampleid -i | sampleid | runAbsoluteCN
--normal | normal.coverage.file | runAbsoluteCN
--tumor | tumor.coverage.file | runAbsoluteCN --vcf | vcf.file | runAbsoluteCN
--rds | file.rds | readCurationFile
--mapping-bias-file | mapping.bias.file | setMappingBiasVcf --normaldb | normalDB (serialized with saveRDS) | calculateTangentNormal, filterTargets --seg-file | seg.file | runAbsoluteCN
--log-ratio-file | log.ratio | runAbsoluteCN
--additional-tumors | tumor.coverage.files | processMultipleSamples
--sex | sex | runAbsoluteCN
--genome | genome | runAbsoluteCN
--intervals | interval.file | runAbsoluteCN
--stats-file | stats.file | filterVcfMuTect
--min-af | af.range | filterVcfBasic --snp-blacklist | snp.blacklist | filterVcfBasic
--error | error | runAbsoluteCN
--db-info-flag | DB.info.flag | runAbsoluteCN
--popaf-info-field | POPAF.info.field | runAbsoluteCN
--cosmic-cnt-info-field | Cosmic.CNT.info.field | runAbsoluteCN
--min-cosmic-cnt | min.cosmic.cnt | setPriorVcf
--interval-padding | interval.padding | filterVcfBasic
--min-total-counts | min.total.counts | filterIntervals
--min-fraction-offtarget | min.fraction.offtarget | filterIntervals
--fun-segmentation | fun.segmentation | runAbsoluteCN
--alpha | alpha | segmentationCBS --undo-sd | undo.SD | segmentationCBS --changepoints-penalty| changepoints.penalty | segmentationGATK4 --additional-cmd-args| additional.cmd.args | segmentationGATK4 --max-segments | max.segments | runAbsoluteCN --min-logr-sdev | min.logr.sdev | runAbsoluteCN --min-purity | test.purity | runAbsoluteCN --max-purity | test.purity | runAbsoluteCN --min-ploidy | min.ploidy | runAbsoluteCN --max-ploidy | max.ploidy | runAbsoluteCN --max-copy-number | test.num.copy | runAbsoluteCN --post-optimize | post.optimize | runAbsoluteCN
--bootstrap-n | n | bootstrapResults --speedup-heuristics | speedup.heuristics | runAbsoluteCN
--model-homozygous | model.homozygous | runAbsoluteCN
--model | model | runAbsoluteCN
--log-ratio-calibration | log.ratio.calibration | runAbsoluteCN
--max-non-clonal | max.non.clonal | runAbsoluteCN
--max-homozygous-loss | max.homozygous.loss | runAbsoluteCN
--out-vcf | return.vcf | predictSomatic --out -o | |
--parallel | BPPARAM | runAbsoluteCN --cores | BPPARAM | runAbsoluteCN --seed | | --version -v | | --force -f | | --help -h | |

: (#tab:purecn) PureCN

Argument name | Corresponding PureCN argument | PureCN function ----------------|-------------------------------|---------------- --rds | file.rds | readCurationFile
--callable | callable | callMutationBurden --exclude | exclude | callMutationBurden --max-prior-somatic | max.prior.somatic | callMutationBurden --signatures | | deconstructSigs::whichSignatures --signature-databases | | deconstructSigs::whichSignatures --out | | --version -v | | --force -f | | --help -h | |

: (#tab:dx) Dx



lima1/PureCN documentation built on April 24, 2024, 8:23 p.m.