runAbsoluteCN: Run PureCN implementation of ABSOLUTE

View source: R/runAbsoluteCN.R

runAbsoluteCNR Documentation

Run PureCN implementation of ABSOLUTE

Description

This function takes as input tumor and normal control coverage data and a VCF containing allelic fractions of germline variants and somatic mutations. Normal control does not need to be from the same patient. In case VCF does not contain somatic status, it should contain either dbSNP or population allele frequencies, and optionally COSMIC annotation. Returns purity and ploidy combinations, sorted by likelihood score. Provides copy number and LOH data, by both gene and genomic region.

Usage

runAbsoluteCN(
  normal.coverage.file = NULL,
  tumor.coverage.file = NULL,
  log.ratio = NULL,
  seg.file = NULL,
  seg.file.sdev = 0.4,
  vcf.file = NULL,
  normalDB = NULL,
  genome,
  centromeres = NULL,
  sex = c("?", "F", "M", "diploid"),
  fun.filterVcf = filterVcfMuTect,
  args.filterVcf = list(),
  fun.setPriorVcf = setPriorVcf,
  args.setPriorVcf = list(),
  fun.setMappingBiasVcf = setMappingBiasVcf,
  args.setMappingBiasVcf = list(),
  fun.filterIntervals = filterIntervals,
  args.filterIntervals = list(),
  fun.segmentation = segmentationCBS,
  args.segmentation = list(),
  fun.focal = findFocal,
  args.focal = list(),
  sampleid = NULL,
  min.ploidy = 1.4,
  max.ploidy = 6,
  test.num.copy = 0:7,
  test.purity = seq(0.15, 0.95, by = 0.01),
  prior.purity = NULL,
  prior.K = 0.999,
  prior.contamination = 0.01,
  max.candidate.solutions = 20,
  candidates = NULL,
  min.coverage = 15,
  max.coverage.vcf = 300,
  max.non.clonal = 0.2,
  max.homozygous.loss = c(0.05, 1e+07),
  non.clonal.M = 1/3,
  max.mapping.bias = 0.8,
  max.pon = 3,
  iterations = 30,
  min.variants.segment = 5,
  log.ratio.calibration = 0.1,
  smooth.log.ratio = TRUE,
  model.homozygous = FALSE,
  error = 0.001,
  interval.file = NULL,
  max.dropout = c(0.95, 1.1),
  min.logr.sdev = 0.15,
  max.logr.sdev = 0.6,
  max.segments = 300,
  min.gof = 0.8,
  min.variants = 20,
  plot.cnv = TRUE,
  vcf.field.prefix = "",
  cosmic.vcf.file = NULL,
  DB.info.flag = "DB",
  POPAF.info.field = "POP_AF",
  Cosmic.CNT.info.field = "Cosmic.CNT",
  min.pop.af = 0.001,
  model = c("beta", "betabin"),
  post.optimize = FALSE,
  speedup.heuristics = 2,
  BPPARAM = NULL,
  log.file = NULL,
  verbose = TRUE
)

Arguments

normal.coverage.file

Coverage file of normal control (optional if log.ratio is provided - then it will be only used to filter low coverage exons). Should be already GC-normalized with correctCoverageBias. Needs to be either a file name or data read with the readCoverageFile function.

tumor.coverage.file

Coverage file of tumor. If NULL, requires seg.file and an interval file via interval.file. Should be already GC-normalized with correctCoverageBias. Needs to be either a file name or data read with the readCoverageFile function.

log.ratio

Copy number log-ratios for all exons in the coverage files. If NULL, calculated based on coverage files.

seg.file

Segmented data. Optional, to support third-pary segmentation tools. If NULL, use coverage files or log.ratio to segment the data.

seg.file.sdev

If seg.file provided, the log-ratio standard deviation, used to model likelihood of sub-clonal copy number events.

vcf.file

VCF file. Optional, but typically needed to select between local optima of similar likelihood. Can also be a CollapsedVCF, read with the readVcf function. Requires a DB info flag for likely germline status. The default fun.setPriorVcf function will also look for a Cosmic.CNT slot (see cosmic.vcf.file), containing the hits in the COSMIC database. Again, do not expect very useful results without a VCF file.

normalDB

Normal database, created with createNormalDatabase. If provided, used to calculate gene-level p-values (requires Gene column in interval.file) and to filter targets with low coverage in the pool of normal samples.

genome

Genome version, for example hg19. See readVcf.

centromeres

A GRanges object with centromere positions. If NULL, use pre-stored positions for genome versions hg18, hg19 and hg38.

sex

Sex of sample. If ?, detect using getSexFromCoverage function and default parameters. Default parameters might not work well with every assay and might need to be tuned. If set to diploid, then PureCN will assume all chromosomes are diploid and will not try to detect sex.

fun.filterVcf

Function for filtering variants. Expected output is a list with elements vcf (CollapsedVCF), flag (logical(1)) and flag_comment (character(1)). The flags will be added to the output data and can be used to warn users, for example when samples look too noisy. Default filter will remove variants flagged by MuTect, but will keep germline variants. If ran in matched normal mode, it will by default use somatic status of variants and filter non-somatic calls with allelic fraction significantly different from 0.5 in normal. Defaults to filterVcfMuTect, which in turn also calls filterVcfBasic.

args.filterVcf

Arguments for variant filtering function. Arguments vcf, tumor.id.in.vcf, min.coverage, model.homozygous and error are required in the filter function and are automatically set.

fun.setPriorVcf

Function to set prior for somatic status for each variant in the VCF. Defaults to setPriorVcf.

args.setPriorVcf

Arguments for somatic prior function.

fun.setMappingBiasVcf

Function to set mapping bias for each variant in the VCF. Defaults to setMappingBiasVcf.

args.setMappingBiasVcf

Arguments for mapping bias function.

fun.filterIntervals

Function for filtering low-quality intervals in the coverage files. Needs to return a logical vector whether an interval should be used for segmentation. Defaults to filterIntervals.

args.filterIntervals

Arguments for target filtering function. Arguments normal, tumor, log.ratio, min.coverageseg.file and normalDB are required and automatically set.

fun.segmentation

Function for segmenting the copy number log-ratios. Expected return value is a data.frame representation of the segmentation. Defaults to segmentationCBS.

args.segmentation

Arguments for segmentation function. Arguments normal, tumor, log.ratio, plot.cnv, sampleid, vcf, tumor.id.in.vcf, centromeres are required in the segmentation function and automatically set.

fun.focal

Function for identifying focal amplifications. Defaults to findFocal.

args.focal

Arguments for focal amplification function.

sampleid

Sample id, provided in output files etc.

min.ploidy

Minimum ploidy to be considered.

max.ploidy

Maximum ploidy to be considered.

test.num.copy

Copy numbers tested in the grid search. Note that focal amplifications can have much higher copy numbers, but they will be labeled as subclonal (because they do not fit the integer copy numbers).

test.purity

Considered tumor purity values.

prior.purity

numeric(length(test.purity)) with priors for tested purity values. If NULL, use flat priors.

prior.K

This defines the prior probability that the multiplicity of a SNV corresponds to either the maternal or the paternal copy number (for somatic variants additionally to a multiplicity of 1). For perfect segmentations, this value would be 1; values smaller than 1 thus may provide some robustness against segmentation errors.

prior.contamination

The prior probability that a known SNP is from a different individual.

max.candidate.solutions

Number of local optima considered in optimization and variant fitting steps. If there are too many local optima, it will use specified number of top candidate solutions, but will also include all optima close to diploid, because silent genomes have often lots of local optima.

candidates

Candidates to optimize from a previous run (return.object$candidates). If NULL, do 2D grid search and find local optima.

min.coverage

Minimum coverage in both normal and tumor. Intervals and variants with lower coverage are ignored. This value is provided to the args.filterIntervals and args.filterVcf lists, but can be overwritten in these lists if different cutoffs for the coverage and variant filters are wanted. To increase the sensitivity of homozygous deletions in high purity samples, the coverage cutoff in tumor is automatically lowered by 50 percent if the normal coverage is high.

max.coverage.vcf

This will set the maximum number of reads in the SNV fitting. This is to avoid that small non-reference biases that come apparent only at high coverages have a dramatic influence on likelihood scores. Only relevant for model = "beta".

max.non.clonal

Maximum genomic fraction assigned to a subclonal copy number state.

max.homozygous.loss

double(2) with maximum chromosome fraction assigned to homozygous loss and maximum size of a homozygous loss segment.

non.clonal.M

Average expected cellular fraction of sub-clonal somatic mutations. This is to calculate expected allelic fractions of a single sub-clonal bin for variants. For all somatic variants, more accurate cellular fractions are calculated.

max.mapping.bias

Exclude variants with high mapping bias from the likelihood score calculation. Note that bias is reported on an inverse scale; a variant with mapping bias of 1 has no bias.

max.pon

Exclude variants found more than max.pon times in pool of normals and not in germline databases. Requires mapping.bias.file in setMappingBiasVcf. Should be set to a value high enough to be much more likely an artifact and not a true germline variant not present in germline databases.

iterations

Maximum number of iterations in the Simulated Annealing copy number fit optimization. Note that this an integer optimization problem that should converge quickly. Allowed range is 10 to 250.

min.variants.segment

Flag segments with fewer variants. The minor copy number estimation is not reliable with insufficient variants.

log.ratio.calibration

Re-calibrate log-ratios in the window purity*log.ratio.calibration.

smooth.log.ratio

Smooth log.ratio using the DNAcopy package.

model.homozygous

Homozygous germline SNPs are uninformative and by default removed. In 100 percent pure samples such as cell lines, however, heterozygous germline SNPs appear homozygous in case of LOH. Setting this parameter to TRUE will keep homozygous SNPs and include a homozygous SNP state in the likelihood model. Not necessary when matched normal samples are available.

error

Estimated sequencing error rate. Used to calculate minimum number of supporting reads for variants using calculatePowerDetectSomatic. Also used to calculate the probability of homozygous SNP allelic fractions (assuming reference reads are sequencing errors).

interval.file

A mapping file that assigns GC content and gene symbols to each exon in the coverage files. Used for generating gene-level calls. First column in format CHR:START-END. Second column GC content (0 to 1). Third column gene symbol. This file is generated with the preprocessIntervals function.

max.dropout

Measures GC bias as ratio of coverage in AT-rich (GC < 0.5) versus GC-rich on-target regions (GC >= 0.5). High coverage drop-out might indicate that data was not GC-normalized (optional with larger pool of normal samples). A warning pointing to a normalized log-ratio drop-out likely indicates that the sample quality is insufficient. For log-ratio drop-out, a warning is thrown when half the max.dropout is reached since it is calculated using both tumor and normal. Requires interval.file.

min.logr.sdev

Minimum log-ratio standard deviation used in the model. Useful to make fitting more robust to outliers in very clean data.

max.logr.sdev

Flag noisy samples with segment log-ratio standard deviation larger than this. Assay specific and needs to be calibrated.

max.segments

Flag noisy samples with a large number of segments. Assay specific and needs to be calibrated.

min.gof

Flag purity/ploidy solutions with poor fit.

min.variants

Do not attempt to fit allelic fractions for samples with fewer variants passing all filters.

plot.cnv

Generate segmentation plots.

vcf.field.prefix

Prefix all newly created VCF field names with this string.

cosmic.vcf.file

Add a Cosmic.CNT info field to the provided vcf.file using a VCF file containing the COSMIC database. The default fun.setPriorVcf function will give variants found in the COSMIC database a higher prior probability of being somatic. Not used in likelhood model when matched normal is available in vcf.file. Should be compressed and indexed with bgzip and tabix, respectively.

DB.info.flag

Flag in INFO of VCF that marks presence in common germline databases. Defaults to DB that may contain somatic variants if it is from an unfiltered germline database.

POPAF.info.field

As alternative to a flag, use an info field that contains population allele frequencies. The DB info flag has priority over this field when both exist.

Cosmic.CNT.info.field

Info field containing hits in the Cosmic database

min.pop.af

Minimum population allele frequency in POPAF.info.field to set a high germline prior probability.

model

Use either a beta or a beta-binomial distribution for fitting observed to expected allelic fractions of alterations in vcf.file. The latter can be useful to account for significant overdispersion, for example due to mapping biases when no pool of normals is available or due to other unmodeled biases, e.g. amplification biases. The beta-binomial model is only recommended with a sufficiently sized pool of normal samples (more than 10 normals)

post.optimize

Optimize purity using final SCNA-fit and variants. This might take a long time when lots of variants need to be fitted, but will typically result in a slightly more accurate purity, especially for rather silent genomes or very low purities. Otherwise, it will just use the purity determined via the SCNA-fit.

speedup.heuristics

Tries to avoid spending computation time on local optima that are unlikely correct. Set to 0 to turn this off, to 1 to only apply heuristics that in worst case will decrease accuracy slightly or to 2 to turn on all heuristics.

BPPARAM

BiocParallelParam object. If NULL, does not use parallelization for fitting local optima.

log.file

If not NULL, store verbose output to file.

verbose

Verbose output.

Value

A list with elements

candidates

Results of the grid search.

results

All local optima, sorted by final rank.

input

The input data.

Author(s)

Markus Riester

References

Riester et al. (2016). PureCN: Copy number calling and SNV classification using targeted short read sequencing. Source Code for Biology and Medicine, 11, pp. 13.

Carter et al. (2012), Absolute quantification of somatic DNA alterations in human cancer. Nature Biotechnology.

See Also

correctCoverageBias segmentationCBS calculatePowerDetectSomatic

Examples


normal.coverage.file <- system.file('extdata', 'example_normal_tiny.txt',
    package = 'PureCN')
tumor.coverage.file <- system.file('extdata', 'example_tumor_tiny.txt',
    package = 'PureCN')
vcf.file <- system.file('extdata', 'example.vcf.gz',
    package = 'PureCN')
interval.file <- system.file('extdata', 'example_intervals_tiny.txt',
    package = 'PureCN')

# The max.candidate.solutions, max.ploidy and test.purity parameters are set to
# non-default values to speed-up this example.  This is not a good idea for real
# samples.
ret <-runAbsoluteCN(normal.coverage.file = normal.coverage.file,
    tumor.coverage.file = tumor.coverage.file, genome = 'hg19',
    vcf.file = vcf.file, sampleid = 'Sample1',
    interval.file = interval.file, max.ploidy = 4,
    test.purity = seq(0.3, 0.7, by = 0.05), max.candidate.solutions = 1)


# If a high-quality segmentation was obtained with third-party tools:
seg.file <- system.file('extdata', 'example_seg.txt',
    package = 'PureCN')

# By default, PureCN will re-segment the data, for example to identify
# regions of copy number neutral LOH. If this is not wanted, we can provide
# a minimal segmentation function which just returns the provided one:
funSeg <- function(seg, ...) return(seg)

res <- runAbsoluteCN(seg.file = seg.file, fun.segmentation = funSeg,
    max.ploidy = 4, test.purity = seq(0.3, 0.7, by = 0.05),
    max.candidate.solutions = 1,
    genome='hg19', interval.file = interval.file)


lima1/PureCN documentation built on Sept. 17, 2024, 5:48 a.m.