bamReadStats: Quality Control Statistics of Raw Sequencing Reads

bamReadStatsR Documentation

Quality Control Statistics of Raw Sequencing Reads


Assess the various quality metrics of a file of raw sequencing reads, looking at read length, Phred scores, nucleotide distributions, flow cell tile variation, etc.


bamReadStats(filein, sampleID, statsPath = "BamReadStats", 
	calcStats = TRUE, plotAllTiles = FALSE, baseOrder = "ACGTN", 
	chunkSize = 1e+05, maxReads = NULL, pause = 0)

fastqReadStats(filein, sampleID, statsPath = "FastqReadStats", 
	calcStats = TRUE, plotAllTiles = FALSE, baseOrder = "ACGTN", 
	chunkSize = 1e+05, maxReads = NULL, pause = 0)

readStats(filein, sampleID, ...)



Full pathname to an existing file of raw sequencing reads, may be 'gzip' compressed for FASTQ.


SampleID for this file, used as a prefix on the names of created files and plots.


Destination folder to receive the created plots and summary data.


Logical, either calculate the statistics or just replot using a previous file of statistics.


Create separate plots of each tile in the file, in addition to the one overall plot per flow cell lane.


Order to display the nucleotides in the base call bar plots.


Buffer size, in reads, for processing the file. Small buffer size yields rapid plots and updates of progress, slower overall performance, and uses less memory.


The maximum number of reads to process, NULL means use all. Note that sorted BAM files place un-aligned reads at the end.


Delay in seconds for viewing each plot.


This function analyzes several metrics about the raw sequencing file, to assess the quality of the sequencing run, and to help select suitable alignment parameters for tuning the alignment pipeline. With default arguments, it generates about 4 plots per sample, focusing on Phred scores, nucleotide distributions, and the variance between tiles.

There are separate functions for BAM and FASTQ files, and a wrapper function readStats that uses the file extension to dispatch based on the given file type.


In addition to the plots created, one rather complex data object is written to disk, containing the details that are used to generate the plots. It can be loaded to extract those numeric details.


With each new release of sequencing machine software, the details about the coordinates of each read change inside the ReadID. Parsing out the "lane:tile:X:Y" terms is a perpetual work in progress. This tool may break/fail on new ReadID formats.


Bob Morrison

robertdouglasmorrison/DuffyNGS documentation built on March 24, 2024, 4:16 p.m.