Plot K-L Divergence Components for a Subset of k-mers to Inspect for Contamination
calcKL, which calculates the
Kullback-Leibler divergence between the k-mer distribution at each
position compared to the k-mer distribution across all
kmerKLPlot then plots each k-mer's contribution to
the total K-L divergence by stack bars, for a subset of the
k-mers. Since there are 4^k possible k-mers for some value k-mers,
plotting each often dilutes the interpretation; however one can
n.kmers to a number greater than the possible number
of k-mers to force
kmerKLPlot to plot the entire K-L divergence
and all terms (which are k-mers) in the sum.
x is a
list, the K-L k-mer plots are faceted by
sample; this allows comparison to a FASTA file of random reads.
Again, please note that this is not the total K-L divergence,
but rather the K-L divergence calculated on a subset of the sample
space (those of the top
n.kmers k-mers selected).
an S4 object a class that inherits from
a integer value indicating the size of top k-mers to include.
signature(x = "SequenceSummary")
kmerKLPlotwill plot the K-L divergence for a subset of k-mers for a single object that inherits from
signature(x = "list")
kmerKLPlotwill plot the K-L divergence for a susbet of k-mers for each of the objects that inherit from
SequenceSummaryin the list and display them in a series of panels.
The K-L divergence calculation in
calcKL uses base 2 in the
log; the units are in bits.
Also, note that
ggplot2 warns that "Stacking is not well defined when ymin
!= 0". This occurs when some k-mers are less frequent in the positional
distribution than the distribution across all positions, and the term of
the K-L sum is negative (producing a bar below zero). This does not
appear to affect the plot much. In examples below, warnings are
suppressed, but the given this is a valid concern from
warnings are not suppressed in the function itself.
Vince Buffalo <email@example.com>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## Load a somewhat contaminated FASTQ file s.fastq <- readSeqFile(system.file('extdata', 'test.fastq', package='qrqc'), hash.prop=1) ## Load a really contaminated FASTQ file s.contam.fastq <- readSeqFile(system.file('extdata', 'test-contam.fastq', package='qrqc'), hash.prop=1) ## Load a random (equal base frequency) FASTA file s.random.fasta <- readSeqFile(system.file('extdata', 'random.fasta', package='qrqc'), type="fasta", hash.prop=1) ## Make K-L divergence plot - shows slight 5'-end bias. Note units ## (bits) suppressWarnings(kmerKLPlot(s.fastq)) ## Plot multiple K-L divergence plots suppressWarnings(kmerKLPlot(list("highly contaminated"=s.contam.fastq, "less contaminated"=s.fastq, "random"=s.random.fasta)))