Given a set of chimeric reads, this methods computes all putative breakpoints. First, chimeric reads are clustered such that all reads spanning the same breakpoint form a cluster. Then, a consensus breakpoint sequence and breakpoint position is computed for each cluster.
A list storing chimeric reads as returned by
The maximum distance in base pairs between the breakpoints of two chimeric reads at which the reads are merge to a cluster.
Cluster whose size is below minClusterSize are be excluded from breakpoint detection.
If true, soft-clipped bases at the beginning or the end of a sequence are removed (see details below).
This method is usually invoked after calling
filterChimericReads and before calling
mergeBreakpoints. It first forms clusters of chimeric
reads (reads with exactly two local alignments) that span the same
breakpoint and than computes a consensus breakpoint sequence for each
To carry out a hierarchical clustering, a measure for the distance
between two chimeric reads must be defined. If reads span different
chromosomes, their distance is set to infinity. The strand
information of the local alignments may also indicate that two chimeric
reads do not span the same breakpoint even if they span the same
chromosomes. For example, the first reads has two local alignments on
the positive strand whereas the second read has one local alignment on
the positive strand and the other on the negative strand. In this case,
the distance is set to inifinty, too. Finally, the distance measure
distinguishes between the two breakpoints (sometimes
called the pathogenic and the reciproce breakpoint) that originate from
the same structual variant. The distance between a read from the
pathogenic and a read from the reciproce breakpoint is infinity so that
two different clusters will emerge. These two related breakpoints can be
merge later using the
mergeBreakpoints method. We observed
that the breakpoints of these two cases often differ by a few ten or even
a few hundred basepairs.
If the chromosome and strand information between two reads x and y are coherent, the Euclidian distance is used:
d(x,y) = (bp(x, ChrA) - bp(y, ChrA))^2 + (bp(x, ChrB) - bp(y, ChrB))^2
where bp gives the coordinates of the breakpoint for the given
read and chromosome. Hierarchical clustering is applied with complete
linkage and the dendrogram is cutted at a height of
obtain the final clusters. The
bpDist argument does usually not
influence the result, because we observed that reads spanning the same
breakpoint have very little variation (only a few base pairs) in their
local alignments due to sequencing errors or due to ambiguity caused by
same/similar sequence of both chromosome near the breakpoint.
Although the given set of reads may belong to the same chimeric DNA,
their individual breakpoints may differ in a few base pairs.
Furthermore, a single read may have more than one possible breakpoint
if a (small) part of the read was aligned to both parts.
The following step determines a consensus breakpoint for each cluster. It uses the supplied
bsGenome to construct a
chimeric reference sequence for all possible breakpoints over all reads within
each cluster. After the reads were realigned to the chimeric reference
sequences, the one that yields the highest alignment score is taken to
represent best the chimeric DNA and its breakpoints.
As a preprocessing step,
detectBreakpoints offers to remove soft
clips occuring after the alignment:
Some reads may contain soft-clipped bases (e.g. linker sequences) at the beginning of the first part of the read or at the end of the second part. By default,
detectBreakpoints removes these unaligned subsequences
and adjusts the cigar string, the sequence, the sequence width
(qwidth) and the local start/end coordinates.
detectBreakpoints returns an object of class
breakpoints, which is a list of breakpoint clusters, which gives
access to all alignments and consensus breakpoints:
A dataframe listing the breakpoints for both parts of the chimeric reference, the associated chromosome, strand and the reference sequence itself, including positions "localStart"/"localEnd" indicating which part of the reference belongs to which breakpoint.
An object of class
On the basis of
Hans-Ulrich Klein, Christoph Bartenhagen
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
# Construct a small example with three chimeric reads # (=6 local alignments) in bam format as given by # aligners such as BWA-SW. # The first two reads originate from the same case but # from different strands. The third read originate from # the reciprocal breakpoint. library("BSgenome.Scerevisiae.UCSC.sacCer2") bamReads = list() bamReads[] = list( qname=c("seq1", "seq1", "seq2", "seq2", "seq3", "seq3"), flag = as.integer(c(0, 0, 16, 16, 0, 0)), rname = factor(c("II", "III", "III", "II", "III", "II")), strand = factor(c("+", "+", "-", "-", "+", "+")), pos = as.integer(c(99951, 200000, 200000, 99951, 199950, 100001)), qwidth = as.integer(c(100, 100, 100, 100, 100, 100)), cigar = c("50M50S","50S50M","50S50M","50M50S","50M50S", "50S50M"), seq = DNAStringSet(c( paste(substr(Scerevisiae$chrII, start=99951, stop=100000), substr(Scerevisiae$chrIII, start=200000, stop=200049), sep=""), paste(substr(Scerevisiae$chrII, start=99951, stop=100000), substr(Scerevisiae$chrIII, start=200000, stop=200049), sep=""), paste(substr(Scerevisiae$chrIII, start=200000, stop=200049), substr(Scerevisiae$chrII, start=99951, stop=100000), sep=""), paste(substr(Scerevisiae$chrIII, start=200000, stop=200049), substr(Scerevisiae$chrII, start=99951, stop=100000), sep=""), paste(substr(Scerevisiae$chrIII, start=199950, stop=199999), substr(Scerevisiae$chrII, start=100001, stop=100050), sep=""), paste(substr(Scerevisiae$chrIII, start=199950, stop=199999), substr(Scerevisiae$chrII, start=100001, stop=100050), sep=""))) ) bps = detectBreakpoints(bamReads, minClusterSize=1, bsGenome=Scerevisiae) summary(bps) table(bps) mergeBreakpoints(bps)