kmerKLPlot
calls calcKL
, which calculates the
KullbackLeibler divergence between the kmer distribution at each
position compared to the kmer distribution across all
positions. kmerKLPlot
then plots each kmer's contribution to
the total KL divergence by stack bars, for a subset of the
kmers. Since there are 4^k possible kmers for some value kmers,
plotting each often dilutes the interpretation; however one can
increase n.kmers
to a number greater than the possible number
of kmers to force kmerKLPlot
to plot the entire KL divergence
and all terms (which are kmers) in the sum.
If a x
is a list
, the KL kmer plots are faceted by
sample; this allows comparison to a FASTA file of random reads.
Again, please note that this is not the total KL divergence,
but rather the KL divergence calculated on a subset of the sample
space (those of the top n.kmers
kmers selected).
1  kmerKLPlot(x, n.kmers=20)

x 
an S4 object a class that inherits from 
n.kmers 
a integer value indicating the size of top kmers to include. 
signature(x = "SequenceSummary")
kmerKLPlot
will plot the KL divergence for a subset of kmers for a single object that
inherits from SequenceSummary
.
signature(x = "list")
kmerKLPlot
will plot the KL divergence for a susbet of
kmers for each of the objects that inherit from
SequenceSummary
in the list and display them in a series of
panels.
The KL divergence calculation in calcKL
uses base 2 in the
log; the units are in bits.
Also, note that ggplot2
warns that "Stacking is not well defined when ymin
!= 0". This occurs when some kmers are less frequent in the positional
distribution than the distribution across all positions, and the term of
the KL sum is negative (producing a bar below zero). This does not
appear to affect the plot much. In examples below, warnings are
suppressed, but the given this is a valid concern from ggplot2
,
warnings are not suppressed in the function itself.
Vince Buffalo <vsbuffalo@ucdavis.edu>
getKmer
, calcKL
,
kmerEntropyPlot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  ## Load a somewhat contaminated FASTQ file
s.fastq < readSeqFile(system.file('extdata', 'test.fastq',
package='qrqc'), hash.prop=1)
## Load a really contaminated FASTQ file
s.contam.fastq < readSeqFile(system.file('extdata',
'testcontam.fastq', package='qrqc'), hash.prop=1)
## Load a random (equal base frequency) FASTA file
s.random.fasta < readSeqFile(system.file('extdata',
'random.fasta', package='qrqc'), type="fasta", hash.prop=1)
## Make KL divergence plot  shows slight 5'end bias. Note units
## (bits)
suppressWarnings(kmerKLPlot(s.fastq))
## Plot multiple KL divergence plots
suppressWarnings(kmerKLPlot(list("highly contaminated"=s.contam.fastq, "less
contaminated"=s.fastq, "random"=s.random.fasta)))

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.
Please suggest features or report bugs with the GitHub issue tracker.
All documentation is copyright its authors; we didn't write any of that.