runAbsoluteCN: Run PureCN implementation of ABSOLUTE
In lima1/PureCN: Copy number calling and SNV classification using targeted short read sequencing

runAbsoluteCN

R Documentation

Run PureCN implementation of ABSOLUTE

Description

This function takes as input tumor and normal control coverage data and a VCF containing allelic fractions of germline variants and somatic mutations. Normal control does not need to be from the same patient. In case VCF does not contain somatic status, it should contain either dbSNP or population allele frequencies, and optionally COSMIC annotation. Returns purity and ploidy combinations, sorted by likelihood score. Provides copy number and LOH data, by both gene and genomic region.

Usage

runAbsoluteCN(
  normal.coverage.file = NULL,
  tumor.coverage.file = NULL,
  log.ratio = NULL,
  seg.file = NULL,
  seg.file.sdev = 0.4,
  vcf.file = NULL,
  normalDB = NULL,
  genome,
  centromeres = NULL,
  sex = c("?", "F", "M", "diploid"),
  fun.filterVcf = filterVcfMuTect,
  args.filterVcf = list(),
  fun.setPriorVcf = setPriorVcf,
  args.setPriorVcf = list(),
  fun.setMappingBiasVcf = setMappingBiasVcf,
  args.setMappingBiasVcf = list(),
  fun.filterIntervals = filterIntervals,
  args.filterIntervals = list(),
  fun.segmentation = segmentationCBS,
  args.segmentation = list(),
  fun.focal = findFocal,
  args.focal = list(),
  sampleid = NULL,
  min.ploidy = 1.4,
  max.ploidy = 6,
  test.num.copy = 0:7,
  test.purity = seq(0.15, 0.95, by = 0.01),
  prior.purity = NULL,
  prior.K = 0.999,
  prior.contamination = 0.01,
  max.candidate.solutions = 20,
  candidates = NULL,
  min.coverage = 15,
  max.coverage.vcf = 300,
  max.non.clonal = 0.2,
  max.homozygous.loss = c(0.05, 1e+07),
  non.clonal.M = 1/3,
  max.mapping.bias = 0.8,
  max.pon = 3,
  iterations = 30,
  min.variants.segment = 5,
  log.ratio.calibration = 0.1,
  smooth.log.ratio = TRUE,
  model.homozygous = FALSE,
  error = 0.001,
  interval.file = NULL,
  max.dropout = c(0.95, 1.1),
  min.logr.sdev = 0.15,
  max.logr.sdev = 0.6,
  max.segments = 300,
  min.gof = 0.8,
  min.variants = 20,
  plot.cnv = TRUE,
  vcf.field.prefix = "",
  cosmic.vcf.file = NULL,
  DB.info.flag = "DB",
  POPAF.info.field = "POP_AF",
  Cosmic.CNT.info.field = "Cosmic.CNT",
  min.pop.af = 0.001,
  model = c("beta", "betabin"),
  post.optimize = FALSE,
  speedup.heuristics = 2,
  BPPARAM = NULL,
  log.file = NULL,
  verbose = TRUE
)

Arguments

`normal.coverage.file`	Coverage file of normal control (optional if log.ratio is provided - then it will be only used to filter low coverage exons). Should be already GC-normalized with `correctCoverageBias`. Needs to be either a file name or data read with the `readCoverageFile` function.
`tumor.coverage.file`	Coverage file of tumor. If `NULL`, requires `seg.file` and an interval file via `interval.file`. Should be already GC-normalized with `correctCoverageBias`. Needs to be either a file name or data read with the `readCoverageFile` function.
`log.ratio`	Copy number log-ratios for all exons in the coverage files. If `NULL`, calculated based on coverage files.
`seg.file`	Segmented data. Optional, to support third-pary segmentation tools. If `NULL`, use coverage files or `log.ratio` to segment the data.
`seg.file.sdev`	If `seg.file` provided, the log-ratio standard deviation, used to model likelihood of sub-clonal copy number events.
`vcf.file`	VCF file. Optional, but typically needed to select between local optima of similar likelihood. Can also be a `CollapsedVCF`, read with the `readVcf` function. Requires a DB info flag for likely germline status. The default `fun.setPriorVcf` function will also look for a Cosmic.CNT slot (see `cosmic.vcf.file`), containing the hits in the COSMIC database. Again, do not expect very useful results without a VCF file.
`normalDB`	Normal database, created with `createNormalDatabase`. If provided, used to calculate gene-level p-values (requires `Gene` column in `interval.file`) and to filter targets with low coverage in the pool of normal samples.
`genome`	Genome version, for example hg19. See `readVcf`.
`centromeres`	A `GRanges` object with centromere positions. If `NULL`, use pre-stored positions for genome versions hg18, hg19 and hg38.
`sex`	Sex of sample. If `?`, detect using `getSexFromCoverage` function and default parameters. Default parameters might not work well with every assay and might need to be tuned. If set to diploid, then PureCN will assume all chromosomes are diploid and will not try to detect sex.
`fun.filterVcf`	Function for filtering variants. Expected output is a list with elements `vcf` (`CollapsedVCF`), flag (`logical(1)`) and `flag_comment` (`character(1)`). The flags will be added to the output data and can be used to warn users, for example when samples look too noisy. Default filter will remove variants flagged by MuTect, but will keep germline variants. If ran in matched normal mode, it will by default use somatic status of variants and filter non-somatic calls with allelic fraction significantly different from 0.5 in normal. Defaults to `filterVcfMuTect`, which in turn also calls `filterVcfBasic`.
`args.filterVcf`	Arguments for variant filtering function. Arguments `vcf`, `tumor.id.in.vcf`, `min.coverage`, `model.homozygous` and `error` are required in the filter function and are automatically set.
`fun.setPriorVcf`	Function to set prior for somatic status for each variant in the VCF. Defaults to `setPriorVcf`.
`args.setPriorVcf`	Arguments for somatic prior function.
`fun.setMappingBiasVcf`	Function to set mapping bias for each variant in the VCF. Defaults to `setMappingBiasVcf`.
`args.setMappingBiasVcf`	Arguments for mapping bias function.
`fun.filterIntervals`	Function for filtering low-quality intervals in the coverage files. Needs to return a `logical` vector whether an interval should be used for segmentation. Defaults to `filterIntervals`.
`args.filterIntervals`	Arguments for target filtering function. Arguments `normal`, `tumor`, `log.ratio`, `min.coverageseg.file` and `normalDB` are required and automatically set.
`fun.segmentation`	Function for segmenting the copy number log-ratios. Expected return value is a `data.frame` representation of the segmentation. Defaults to `segmentationCBS`.
`args.segmentation`	Arguments for segmentation function. Arguments `normal`, `tumor`, `log.ratio`, `plot.cnv`, `sampleid`, `vcf`, `tumor.id.in.vcf`, `centromeres` are required in the segmentation function and automatically set.
`fun.focal`	Function for identifying focal amplifications. Defaults to `findFocal`.
`args.focal`	Arguments for focal amplification function.
`sampleid`	Sample id, provided in output files etc.
`min.ploidy`	Minimum ploidy to be considered.
`max.ploidy`	Maximum ploidy to be considered.
`test.num.copy`	Copy numbers tested in the grid search. Note that focal amplifications can have much higher copy numbers, but they will be labeled as subclonal (because they do not fit the integer copy numbers).
`test.purity`	Considered tumor purity values.
`prior.purity`	`numeric(length(test.purity))` with priors for tested purity values. If `NULL`, use flat priors.
`prior.K`	This defines the prior probability that the multiplicity of a SNV corresponds to either the maternal or the paternal copy number (for somatic variants additionally to a multiplicity of 1). For perfect segmentations, this value would be 1; values smaller than 1 thus may provide some robustness against segmentation errors.
`prior.contamination`	The prior probability that a known SNP is from a different individual.
`max.candidate.solutions`	Number of local optima considered in optimization and variant fitting steps. If there are too many local optima, it will use specified number of top candidate solutions, but will also include all optima close to diploid, because silent genomes have often lots of local optima.
`candidates`	Candidates to optimize from a previous run (`return.object$candidates`). If `NULL`, do 2D grid search and find local optima.
`min.coverage`	Minimum coverage in both normal and tumor. Intervals and variants with lower coverage are ignored. This value is provided to the `args.filterIntervals` and `args.filterVcf` lists, but can be overwritten in these lists if different cutoffs for the coverage and variant filters are wanted. To increase the sensitivity of homozygous deletions in high purity samples, the coverage cutoff in tumor is automatically lowered by 50 percent if the normal coverage is high.
`max.coverage.vcf`	This will set the maximum number of reads in the SNV fitting. This is to avoid that small non-reference biases that come apparent only at high coverages have a dramatic influence on likelihood scores. Only relevant for `model = "beta"`.
`max.non.clonal`	Maximum genomic fraction assigned to a subclonal copy number state.
`max.homozygous.loss`	`double(2)` with maximum chromosome fraction assigned to homozygous loss and maximum size of a homozygous loss segment.
`non.clonal.M`	Average expected cellular fraction of sub-clonal somatic mutations. This is to calculate expected allelic fractions of a single sub-clonal bin for variants. For all somatic variants, more accurate cellular fractions are calculated.
`max.mapping.bias`	Exclude variants with high mapping bias from the likelihood score calculation. Note that bias is reported on an inverse scale; a variant with mapping bias of 1 has no bias.
`max.pon`	Exclude variants found more than `max.pon` times in pool of normals and not in germline databases. Requires `mapping.bias.file` in `setMappingBiasVcf`. Should be set to a value high enough to be much more likely an artifact and not a true germline variant not present in germline databases.
`iterations`	Maximum number of iterations in the Simulated Annealing copy number fit optimization. Note that this an integer optimization problem that should converge quickly. Allowed range is 10 to 250.
`min.variants.segment`	Flag segments with fewer variants. The minor copy number estimation is not reliable with insufficient variants.
`log.ratio.calibration`	Re-calibrate log-ratios in the window `purity*log.ratio.calibration`.
`smooth.log.ratio`	Smooth `log.ratio` using the `DNAcopy` package.
`model.homozygous`	Homozygous germline SNPs are uninformative and by default removed. In 100 percent pure samples such as cell lines, however, heterozygous germline SNPs appear homozygous in case of LOH. Setting this parameter to `TRUE` will keep homozygous SNPs and include a homozygous SNP state in the likelihood model. Not necessary when matched normal samples are available.
`error`	Estimated sequencing error rate. Used to calculate minimum number of supporting reads for variants using `calculatePowerDetectSomatic`. Also used to calculate the probability of homozygous SNP allelic fractions (assuming reference reads are sequencing errors).
`interval.file`	A mapping file that assigns GC content and gene symbols to each exon in the coverage files. Used for generating gene-level calls. First column in format CHR:START-END. Second column GC content (0 to 1). Third column gene symbol. This file is generated with the `preprocessIntervals` function.
`max.dropout`	Measures GC bias as ratio of coverage in AT-rich (GC < 0.5) versus GC-rich on-target regions (GC >= 0.5). High coverage drop-out might indicate that data was not GC-normalized (optional with larger pool of normal samples). A warning pointing to a normalized log-ratio drop-out likely indicates that the sample quality is insufficient. For log-ratio drop-out, a warning is thrown when half the `max.dropout` is reached since it is calculated using both tumor and normal. Requires `interval.file`.
`min.logr.sdev`	Minimum log-ratio standard deviation used in the model. Useful to make fitting more robust to outliers in very clean data.
`max.logr.sdev`	Flag noisy samples with segment log-ratio standard deviation larger than this. Assay specific and needs to be calibrated.
`max.segments`	Flag noisy samples with a large number of segments. Assay specific and needs to be calibrated.
`min.gof`	Flag purity/ploidy solutions with poor fit.
`min.variants`	Do not attempt to fit allelic fractions for samples with fewer variants passing all filters.
`plot.cnv`	Generate segmentation plots.
`vcf.field.prefix`	Prefix all newly created VCF field names with this string.
`cosmic.vcf.file`	Add a `Cosmic.CNT` info field to the provided `vcf.file` using a VCF file containing the COSMIC database. The default `fun.setPriorVcf` function will give variants found in the COSMIC database a higher prior probability of being somatic. Not used in likelhood model when matched normal is available in `vcf.file`. Should be compressed and indexed with bgzip and tabix, respectively.
`DB.info.flag`	Flag in INFO of VCF that marks presence in common germline databases. Defaults to `DB` that may contain somatic variants if it is from an unfiltered germline database.
`POPAF.info.field`	As alternative to a flag, use an info field that contains population allele frequencies. The `DB` info flag has priority over this field when both exist.
`Cosmic.CNT.info.field`	Info field containing hits in the Cosmic database
`min.pop.af`	Minimum population allele frequency in `POPAF.info.field` to set a high germline prior probability.
`model`	Use either a beta or a beta-binomial distribution for fitting observed to expected allelic fractions of alterations in `vcf.file`. The latter can be useful to account for significant overdispersion, for example due to mapping biases when no pool of normals is available or due to other unmodeled biases, e.g. amplification biases. The beta-binomial model is only recommended with a sufficiently sized pool of normal samples (more than 10 normals)
`post.optimize`	Optimize purity using final SCNA-fit and variants. This might take a long time when lots of variants need to be fitted, but will typically result in a slightly more accurate purity, especially for rather silent genomes or very low purities. Otherwise, it will just use the purity determined via the SCNA-fit.
`speedup.heuristics`	Tries to avoid spending computation time on local optima that are unlikely correct. Set to 0 to turn this off, to 1 to only apply heuristics that in worst case will decrease accuracy slightly or to 2 to turn on all heuristics.
`BPPARAM`	`BiocParallelParam` object. If `NULL`, does not use parallelization for fitting local optima.
`log.file`	If not `NULL`, store verbose output to file.
`verbose`	Verbose output.

Value

A list with elements

`candidates`	Results of the grid search.
`results`	All local optima, sorted by final rank.
`input`	The input data.

Author(s)

Markus Riester

References

Riester et al. (2016). PureCN: Copy number calling and SNV classification using targeted short read sequencing. Source Code for Biology and Medicine, 11, pp. 13.

Carter et al. (2012), Absolute quantification of somatic DNA alterations in human cancer. Nature Biotechnology.

Examples


normal.coverage.file <- system.file('extdata', 'example_normal_tiny.txt',
    package = 'PureCN')
tumor.coverage.file <- system.file('extdata', 'example_tumor_tiny.txt',
    package = 'PureCN')
vcf.file <- system.file('extdata', 'example.vcf.gz',
    package = 'PureCN')
interval.file <- system.file('extdata', 'example_intervals_tiny.txt',
    package = 'PureCN')

# The max.candidate.solutions, max.ploidy and test.purity parameters are set to
# non-default values to speed-up this example.  This is not a good idea for real
# samples.
ret <-runAbsoluteCN(normal.coverage.file = normal.coverage.file,
    tumor.coverage.file = tumor.coverage.file, genome = 'hg19',
    vcf.file = vcf.file, sampleid = 'Sample1',
    interval.file = interval.file, max.ploidy = 4,
    test.purity = seq(0.3, 0.7, by = 0.05), max.candidate.solutions = 1)


# If a high-quality segmentation was obtained with third-party tools:
seg.file <- system.file('extdata', 'example_seg.txt',
    package = 'PureCN')

# By default, PureCN will re-segment the data, for example to identify
# regions of copy number neutral LOH. If this is not wanted, we can provide
# a minimal segmentation function which just returns the provided one:
funSeg <- function(seg, ...) return(seg)

res <- runAbsoluteCN(seg.file = seg.file, fun.segmentation = funSeg,
    max.ploidy = 4, test.purity = seq(0.3, 0.7, by = 0.05),
    max.candidate.solutions = 1,
    genome='hg19', interval.file = interval.file)

lima1/PureCN documentation built on June 15, 2025, 5:30 a.m.