compareSVs: Compare the simulation with a set of SVs

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

A typical use case of structural variant (SV) simulation with simulateSV is the evaluation of SV detection algorithms. The function compareSV looks for breakpoint overlaps between the output of the simulation (ground truth) and the output of an SV detection program and computes the sensitivity and precision. There is currently no common standard format for SVs. Because the main information about SVs is their position in the genome and, sometimes, the breakpoint sequence (depending on the SV detection algorithm), compareSV expects the SV detections in a simple BED- or BEDPE format.

Usage

1
compareSV(querySVs, simSVs, tol=200)

Arguments

querySVs

The set detected of SVs. Either a filename for a table of SVs on disk or a data.frame. The table has to be in BED- or BEDPE-format (see details below). The querySVs may only contain SVs of one type and not a mixture of several kinds of SVs.

simSVs

The set of simulated SVs as returned by the function simulateSV. It can be either a filename for a table of SVs on disk or a data.frame.

tol

The tolerance in bp, when comparing the (approximate) positions in querySVs with the exact, simulated breakpoints simSVs. Positions in querySVs may only differ by at most tol base pairs from the simulated breakpoints.

Details

An overlap is defined as the overlap between the breakpoints / breakpoint regions up to the given tolerance in bp. Overlap does not mean the whole affected region between the start and end of the SV.
The comparison has to be done for each type of SV separately. It is required to use the returned tables from simulateSV for the argument simSVs. For querySVs, the tables have to be in BED- or BEDPE-format, a simple tab-separated table with genome coordinates and an optional column with the breakpoint sequence (bpSeq):

The BEDPE-format is required for insertions and translocations, since they involve two regions on the genome. For other SVs, the BEDPE allows to specify approximate regions for each breakpoint. For example a deletion "chr:start-end" can be given as BED-file with columns chr, start, end or (a little redundant) as BEDPE-file with columns chr, start, start, chr, end, end or (with some tolerance) like chr, start-tol, start+tol, chr, end-tol, end+tol. The tolerance can also be regulated by the function argument tol.

Value

The table of simulated SVs, as given in the function argument querySVs, but with additional columns for the overlapping region in querySVs and the percentage overlap between the breakpoint sequences (if they were provided as a column in querySVs).
Furthermore, the function prints the sensitivity and precision in the R console.

Author(s)

Christoph Bartenhagen

References

More information about the BED-format can be found on the BEDTools homepage: http://code.google.com/p/bedtools

See Also

simulateSV

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
  ## Toy example: Artificial genome with two chromosomes
  genome = DNAStringSet(c("AAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTT", "GGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC"))
  names(genome) = c("chr1","chr2")

  #######################
  ## Example 1: Deletions
  ## Simulation of 5 deletions of 5bp each
  sim = simulateSV(output=NA, genome=genome, dels=5, sizeDels=5, bpSeqSize=10, seed=246)
  simSVs = metadata(sim)$deletions

  ## An SV detection in BED format may look like this:
  ## Four of five deletions were detected; two with exact and two with an approximate breakpoint
  ## Two additional deletions were detected, which were not part of the simulation
  ## The column with the breakpoint sequence is optional, the column names not important (BED-files have no header)
  querySVs = data.frame(
    chr=c("chr1","chr1","chr1","chr2","chr2","chr2"), 
    start=c(4,12,20,10,21,34), 
    end=c(8,16,28,14,31,38), 
    bpSeq=c("AAAAAAAAAA", "AAAAAAAAAT", "AAAATTTTTT", "GGGGGGGGGG", "GGGGGGGCCC", "CCCCCCCCCC")
  )
  
  ## Compare the SVs with 0bp tolerance:
  ## Only the two exact detections have an overlap
  simSVs_overlap1 = compareSV(querySVs, simSVs, tol=0)
  simSVs_overlap1

  ## Increasing the breakpoint tolerance to +/- 3bp :
  ## Now, the overlap also includes the more imprecise detections
  ## And the sensitivity and precision are better
  ## Note that for deletion2, the breakpoint sequence matches only by 50%
  simSVs_overlap2 = compareSV(querySVs, simSVs, tol=3)
  simSVs_overlap2

  ############################
  ## Example 2: Translocations
  ## Simulation of 2 translocations (only one of them is balanced):
  sim = simulateSV(output=NA, genome=genome, trans=2, percBalancedTrans=0.5, bpSeqSize=10, seed=246)
  simSVs = metadata(sim)$translocations
  
  ## Detected translocations have to be given in BEDPE-format (i.e. at least six columns with chr,start,end for breakpoints on both chromosomes)
  ## In this example, the breakpoints were approximated up to 1 or 2bp
  ## Optional breakpoint sequences are missing
  querySVs = data.frame(
    chr=c("chr2", "chr1", "chr2"), 
    start1=c(25,3,9), 
    end1=c(29,7,12), 
    chr2=c("chr1","chr2","chr1"),
    start2=c(22,10,3),
    end2=c(25,13,4)
  )

  simSVs_overlap = compareSV(querySVs, simSVs, tol=0)
  simSVs_overlap

Budheimer/RSVSim documentation built on May 30, 2019, 1:26 p.m.