readSeqFile: Read and Summarize a Sequence (FASTA or FASTQ) File

readSeqFile reads a FASTQ or FASTA file, summarizing the nucleotide distribution across position (cycles) and the sequence length distributions. If type is ‘fastq’, the distribution of qualities across position will also be recorded. If hash is TRUE, the unique sequences will be hashed with counts of their frequency. By default, only 10% of the reads will be hashed; this proportion can be controlled with hash.prop. If kmer=TRUE, k-mers of length k will be hashed by position, also with the sampling proportion controlled by hash.prop.


  readSeqFile(filename, type=c("fastq", "fasta"), max.length=1000,
              quality=c("sanger", "solexa", "illumina"), hash=TRUE,
              hash.prop=0.1, kmer=TRUE, k=6L, verbose=FALSE)



the name of the file which the sequences are to be read from.


either ‘fastq’ or ‘fasta’, representing the type of the file. FASTQ files will have the quality distribution by position summarized.


the largest sequence length likely to be encountered. For efficiency, a matrix larger than the largest sequence is allocated to *this* size in C, populated, and then trimmed in R. Specifying a value too small will lead to an error and the function will need to be re-run.


either ‘illumina’, ‘sanger’, or ‘solexa’, this determines the quality offsets and range. See the values of QUALITY.CONSTANTS for more information.


a logical value indicating whether to hash sequences


a numeric value in (0, 1] that functions as the proportion of reads to hash.


a logical value indicating whether to hash k-mers by position.


an integer value indicating the k-mer size.


a logical value indicating whether be verbose (in the C backend).


An S4 object of FASTQSummary or FASTASummary containing the summary statistics.


Identifying the correct quality can be difficult. readSeqFile will error out if it a base quality outside of the range of a known quality type, but it is possible one could have reads with a different quality type that won't fall outside of the another type.

Here is a bit more about quality:


PHRED quality scores (e.g. from Roche 454). ASCII with no offset, range: [4, 60]. This has been removed as an option since sequence reads with this type are very, very uncommon.


Sanger are PHRED ASCII qualities with an offset of 33, range: [0, 93]. From NCBI SRA, or Illumina pipeline 1.8+.


Solexa (also very early Illumina - pipeline < 1.3). ASCII offset of 64, range: [-5, 62]. Uses a different quality-to-probabilities conversion than other schemes.


Illumina output from pipeline versions between 1.3 and 1.7. ASCII offset of 64, range: [0, 62].


Vince Buffalo <>

See Also

FASTQSummary and FASTASummary are the classes of the objects returned by readSeqFile.

basePlot is a function that plots the distribution of bases over sequence length for a particular FASTASummary or FASTQSummary object. gcPlot combines and plots the GC proportion.

qualPlot is a function that plots the distribution of qualities over sequence length for a particular FASTASummary or FASTQSummary object.

seqlenPlot is a function that plots a histogram of sequence lengths for a particular FASTASummary or FASTQSummary object.

kmerKLPlot is a function that plots K-L divergence of k-mers to look for possible biase in reads.


  ## Load a FASTQ file, with sequence hashing.
  s.fastq <- readSeqFile(system.file('extdata', 'test.fastq', package='qrqc'))

  ## Load a FASTA file, without sequence hashing.
  s.fasta <- readSeqFile(system.file('extdata', 'test.fasta', package='qrqc'),
                         type='fasta', hash=FALSE)

