Plot K-L Divergence Components for a Subset of k-mers to Inspect for Contamination

Share:

Description

kmerKLPlot calls calcKL, which calculates the Kullback-Leibler divergence between the k-mer distribution at each position compared to the k-mer distribution across all positions. kmerKLPlot then plots each k-mer's contribution to the total K-L divergence by stack bars, for a subset of the k-mers. Since there are 4^k possible k-mers for some value k-mers, plotting each often dilutes the interpretation; however one can increase n.kmers to a number greater than the possible number of k-mers to force kmerKLPlot to plot the entire K-L divergence and all terms (which are k-mers) in the sum.

If a x is a list, the K-L k-mer plots are faceted by sample; this allows comparison to a FASTA file of random reads.

Again, please note that this is not the total K-L divergence, but rather the K-L divergence calculated on a subset of the sample space (those of the top n.kmers k-mers selected).

Usage

1
  kmerKLPlot(x, n.kmers=20)

Arguments

x

an S4 object a class that inherits from SequenceSummary from readSeqFile or a list of objects that inherit from SequenceSummary with names.

n.kmers

a integer value indicating the size of top k-mers to include.

Methods

signature(x = "SequenceSummary")

kmerKLPlot will plot the K-L divergence for a subset of k-mers for a single object that inherits from SequenceSummary.

signature(x = "list")

kmerKLPlot will plot the K-L divergence for a susbet of k-mers for each of the objects that inherit from SequenceSummary in the list and display them in a series of panels.

Note

The K-L divergence calculation in calcKL uses base 2 in the log; the units are in bits.

Also, note that ggplot2 warns that "Stacking is not well defined when ymin != 0". This occurs when some k-mers are less frequent in the positional distribution than the distribution across all positions, and the term of the K-L sum is negative (producing a bar below zero). This does not appear to affect the plot much. In examples below, warnings are suppressed, but the given this is a valid concern from ggplot2, warnings are not suppressed in the function itself.

Author(s)

Vince Buffalo <vsbuffalo@ucdavis.edu>

See Also

getKmer, calcKL, kmerEntropyPlot

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
  ## Load a somewhat contaminated FASTQ file
  s.fastq <- readSeqFile(system.file('extdata', 'test.fastq',
    package='qrqc'), hash.prop=1)

  ## Load a really contaminated FASTQ file
  s.contam.fastq <- readSeqFile(system.file('extdata',
    'test-contam.fastq', package='qrqc'), hash.prop=1)

  ## Load a random (equal base frequency) FASTA file
  s.random.fasta <- readSeqFile(system.file('extdata',
    'random.fasta', package='qrqc'), type="fasta", hash.prop=1)

  ## Make K-L divergence plot - shows slight 5'-end bias. Note units
  ## (bits)
  suppressWarnings(kmerKLPlot(s.fastq))

  ## Plot multiple K-L divergence plots
  suppressWarnings(kmerKLPlot(list("highly contaminated"=s.contam.fastq, "less
    contaminated"=s.fastq, "random"=s.random.fasta)))

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.