svevalOl: SV evaluation based on overlap and variant size

View source: R/svevalOl.R

svevalOlR Documentation

SV evaluation based on overlap and variant size

Description

Compares SVs from a call-set to SVs from a truth-set.

Usage

svevalOl(
  calls.gr,
  truth.gr,
  max.ins.dist = 20,
  min.ol = 0.5,
  min.del.rol = 0.1,
  range.seq.comp = FALSE,
  ins.seq.comp = FALSE,
  nb.cores = 1,
  min.size = 50,
  max.size = Inf,
  bed.regions = NULL,
  bed.regions.ol = 0.5,
  qual.field = c("GQ", "QUAL"),
  sample.name = NULL,
  outfile = NULL,
  out.bed.prefix = NULL,
  qual.ths = c(0, 2, 3, 4, 5, 7, 10, 12, 14, 21, 27, 35, 45, 50, 60, 75, 90, 99, 110,
    133, 167, 180, 250, 350, 450, 550, 650),
  qual.quantiles = seq(0, 1, 0.1),
  check.inv = FALSE,
  geno.eval = FALSE,
  stitch.hets = FALSE,
  stitch.dist = 20,
  merge.hets = FALSE,
  merge.rol = 0.8,
  method = c("coverage", "bipartite"),
  simprep = NULL,
  log.level = c("CRITICAL", "WARNING", "INFO")
)

Arguments

calls.gr

call set. A GRanges or the path to a VCF file.

truth.gr

truth set. A GRanges or the path to a VCF file.

max.ins.dist

maximum distance for insertions to be clustered. Default is 20.

min.ol

the minimum overlap/coverage to be considered a match. Default is 0.5

min.del.rol

minimum reciprocal overlap for deletions. Default is 0.1

range.seq.comp

compare sequence instead of only overlapping deletions/inversion/etc. Default is FALSE.

ins.seq.comp

compare sequence instead of insertion sizes. Default is FALSE.

nb.cores

number of processors to use. Default is 1.

min.size

the minimum SV size to be considered. Default 50.

max.size

the maximum SV size to be considered. Default is Inf.

bed.regions

If non-NULL, a GRanges object or path to a BED file (no headers) with regions of interest.

bed.regions.ol

minimum proportion of sv.gr that must overlap regions.gr. Default is 0.5

qual.field

fields to use as quality.

sample.name

the name of the sample to use if VCF files given as input. If NULL (default), use first sample.

outfile

the TSV file to output the results. If NULL (default), returns a data.frame.

out.bed.prefix

prefix for the output BED files. If NULL (default), no BED output.

qual.ths

the quality thresholds for the PR curve. If NULL, will use quantiles. see the qual.quantiles parameter below.

qual.quantiles

the quality quantiles for the PR curve, if qual.ths is NULL. Default is (0, .1, ..., .9, 1).

check.inv

should the sequence of MNV be compared to identify inversions.

geno.eval

should het/hom be evaluated separately (genotype evaluation). Default FALSE.

stitch.hets

should clustered hets be stitched together before genotype evatuation. Default is FALSE.

stitch.dist

the maximum distance to stitch hets during genotype evaluation.

merge.hets

should similar hets be merged into homs before genotype evaluation. Default is FALSE.

merge.rol

the minimum reciprocal overlap to merge hets before genotype evaluation.

method

the method to annotate the overlap. Either 'coverage' (default) for the cumulative coverage (e.g. to deal with fragmented calls); or 'bipartite' for a 1-to-1 matching of variants in the calls and truth sets.

simprep

optional simple repeat annotation. Default is NULL. If non-NULL, GRanges to be used to extend variants when overlapping/clustering

log.level

the level of information in the log. Default is "CRITICAL" (basically no log).

Details

Different overlapping approaches are available. See svOverlap for more details. We recommend using the default method='coverage' when evaluating the calls (absence/presence of SVs) and method='bipartite' when evaluating genotypes. The latter will match a variant in the call-set with at most one variant in the truth-set which will penalyze. one-to-many configurations as is usually preferred when comparing exact genotypes. To further switch on the 'genotype evaluation' mode, use geno.eval=TRUE. It can also help to stitch fragmented calls and merge heterozygotes into homozygotes using merge.hets=TRUE and stitch.hets=TRUE (see example below).

The evaluation will be performed for different quality thresholds on the call-set in order to make a precision-recall curve. If the VCF are to be read it will look for the field specified in qual.field (GQ by default, and QUAL if not found). If you don't want the PR curve, the evaluation can be sped up by running with for example qual.ths=0.

Equivalent SVs are sometimes recorded as quite different variants because placed at different locations of a short tandem repeat. For example, imagine a large 100 bp tandem repeat in the reference genome. An expansion of 50 bp might be represented as a 50 bp insertion at the beginning of the repeat in the callset but at the end of the repeat in the truth set. Because they are distant by 100 bp they might not match. Instead of increasing the distance threshold too much, passing an annotation of known simple repeats in the simprep= parameter provides a more flexible way of matching variants by first extending them with nearby simple repeats. In this example, because we know of this tandem repeat, both insertions will be extended to span the full annotated reference repeat, hence ensuring that they are matched and compared (e.g. by reciprocal size or sequence alignment distance) short tandem repeat.

Value

a list with

eval

a data.frame with TP, FP and FN for each SV type when including all variants

curve

a data.frame with TP, FP and FN for each SV type when using different quality thesholds

svs

a list of GRanges object with FP, TP and FN for each SV type (using quality threshold with best F1).

mqual.bestf1

the quality threshold that produces best F1 (and corresponding to 'svs' GRanges).

Author(s)

Jean Monlong

Examples

## Not run: 
## From VCF files
eval = svevalOl('calls.vcf', 'truth.vcf')

## From GRanges
calls.gr = readSVvcf('calls.vcf')
truth.gr = readSVvcf('truth.vcf')
eval = svevalOl(calls.gr, truth.gr)

## Genotype evaluation
eval = svevalOl(calls.gr, truth.gr, geno.eval=TRUE, merge.hets=TRUE,
                stitch.hets=TRUE, method='bipartite')

## End(Not run)

jmonlong/sveval documentation built on July 31, 2023, 7:50 p.m.