svevalOl | R Documentation |
Compares SVs from a call-set to SVs from a truth-set.
svevalOl(
calls.gr,
truth.gr,
max.ins.dist = 20,
min.ol = 0.5,
min.del.rol = 0.1,
range.seq.comp = FALSE,
ins.seq.comp = FALSE,
nb.cores = 1,
min.size = 50,
max.size = Inf,
bed.regions = NULL,
bed.regions.ol = 0.5,
qual.field = c("GQ", "QUAL"),
sample.name = NULL,
outfile = NULL,
out.bed.prefix = NULL,
qual.ths = c(0, 2, 3, 4, 5, 7, 10, 12, 14, 21, 27, 35, 45, 50, 60, 75, 90, 99, 110,
133, 167, 180, 250, 350, 450, 550, 650),
qual.quantiles = seq(0, 1, 0.1),
check.inv = FALSE,
geno.eval = FALSE,
stitch.hets = FALSE,
stitch.dist = 20,
merge.hets = FALSE,
merge.rol = 0.8,
method = c("coverage", "bipartite"),
simprep = NULL,
log.level = c("CRITICAL", "WARNING", "INFO")
)
calls.gr |
call set. A GRanges or the path to a VCF file. |
truth.gr |
truth set. A GRanges or the path to a VCF file. |
max.ins.dist |
maximum distance for insertions to be clustered. Default is 20. |
min.ol |
the minimum overlap/coverage to be considered a match. Default is 0.5 |
min.del.rol |
minimum reciprocal overlap for deletions. Default is 0.1 |
range.seq.comp |
compare sequence instead of only overlapping deletions/inversion/etc. Default is FALSE. |
ins.seq.comp |
compare sequence instead of insertion sizes. Default is FALSE. |
nb.cores |
number of processors to use. Default is 1. |
min.size |
the minimum SV size to be considered. Default 50. |
max.size |
the maximum SV size to be considered. Default is Inf. |
bed.regions |
If non-NULL, a GRanges object or path to a BED file (no headers) with regions of interest. |
bed.regions.ol |
minimum proportion of sv.gr that must overlap regions.gr. Default is 0.5 |
qual.field |
fields to use as quality. |
sample.name |
the name of the sample to use if VCF files given as input. If NULL (default), use first sample. |
outfile |
the TSV file to output the results. If NULL (default), returns a data.frame. |
out.bed.prefix |
prefix for the output BED files. If NULL (default), no BED output. |
qual.ths |
the quality thresholds for the PR curve. If NULL, will use quantiles. see the |
qual.quantiles |
the quality quantiles for the PR curve, if qual.ths is NULL. Default is (0, .1, ..., .9, 1). |
check.inv |
should the sequence of MNV be compared to identify inversions. |
geno.eval |
should het/hom be evaluated separately (genotype evaluation). Default FALSE. |
stitch.hets |
should clustered hets be stitched together before genotype evatuation. Default is FALSE. |
stitch.dist |
the maximum distance to stitch hets during genotype evaluation. |
merge.hets |
should similar hets be merged into homs before genotype evaluation. Default is FALSE. |
merge.rol |
the minimum reciprocal overlap to merge hets before genotype evaluation. |
method |
the method to annotate the overlap. Either 'coverage' (default) for the cumulative coverage (e.g. to deal with fragmented calls); or 'bipartite' for a 1-to-1 matching of variants in the calls and truth sets. |
simprep |
optional simple repeat annotation. Default is NULL. If non-NULL, GRanges to be used to extend variants when overlapping/clustering |
log.level |
the level of information in the log. Default is "CRITICAL" (basically no log). |
Different overlapping approaches are available. See svOverlap
for more details.
We recommend using the default method='coverage'
when evaluating the calls (absence/presence
of SVs) and method='bipartite'
when evaluating genotypes. The latter will match
a variant in the call-set with at most one variant in the truth-set which will penalyze.
one-to-many configurations as is usually preferred when comparing exact genotypes. To further
switch on the 'genotype evaluation' mode, use geno.eval=TRUE
. It can also help to stitch
fragmented calls and merge heterozygotes into homozygotes using merge.hets=TRUE
and
stitch.hets=TRUE
(see example below).
The evaluation will be performed for different quality thresholds on the call-set in order to make
a precision-recall curve. If the VCF are to be read it will look for the field specified in
qual.field
(GQ by default, and QUAL if not found). If you don't want the PR curve,
the evaluation can be sped up by running with for example qual.ths=0
.
Equivalent SVs are sometimes recorded as quite different variants because placed at
different locations of a short tandem repeat. For example, imagine a large 100 bp
tandem repeat in the reference genome. An expansion of 50 bp might be represented
as a 50 bp insertion at the beginning of the repeat in the callset but at the end
of the repeat in the truth set. Because they are distant by 100 bp they might not
match. Instead of increasing the distance threshold too much, passing an annotation of
known simple repeats in the simprep=
parameter provides
a more flexible way of matching variants by first extending them with nearby simple
repeats. In this example, because we know of this tandem repeat, both insertions will
be extended to span the full annotated reference repeat, hence ensuring that they are
matched and compared (e.g. by reciprocal size or sequence alignment distance)
short tandem repeat.
a list with
eval |
a data.frame with TP, FP and FN for each SV type when including all variants |
curve |
a data.frame with TP, FP and FN for each SV type when using different quality thesholds |
svs |
a list of GRanges object with FP, TP and FN for each SV type (using quality threshold with best F1). |
mqual.bestf1 |
the quality threshold that produces best F1 (and corresponding to 'svs' GRanges). |
Jean Monlong
## Not run:
## From VCF files
eval = svevalOl('calls.vcf', 'truth.vcf')
## From GRanges
calls.gr = readSVvcf('calls.vcf')
truth.gr = readSVvcf('truth.vcf')
eval = svevalOl(calls.gr, truth.gr)
## Genotype evaluation
eval = svevalOl(calls.gr, truth.gr, geno.eval=TRUE, merge.hets=TRUE,
stitch.hets=TRUE, method='bipartite')
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.