Read and Summarize a Sequence (FASTA or FASTQ) File
readSeqFile reads a FASTQ or FASTA file, summarizing the
nucleotide distribution across position (cycles) and the sequence
length distributions. If
type is ‘fastq’, the distribution
of qualities across position will also be recorded. If
TRUE, the unique sequences will be hashed with counts of their
frequency. By default, only 10% of the reads will be hashed; this
proportion can be controlled with
kmer=TRUE, k-mers of length
k will be hashed by
position, also with the sampling proportion controlled by
1 2 3
the name of the file which the sequences are to be read from.
either ‘fastq’ or ‘fasta’, representing the type of the file. FASTQ files will have the quality distribution by position summarized.
the largest sequence length likely to be encountered. For efficiency, a matrix larger than the largest sequence is allocated to *this* size in C, populated, and then trimmed in R. Specifying a value too small will lead to an error and the function will need to be re-run.
either ‘illumina’, ‘sanger’, or ‘solexa’, this determines the quality offsets and range. See the values of QUALITY.CONSTANTS for more information.
a logical value indicating whether to hash sequences
a numeric value in (0, 1] that functions as the proportion of reads to hash.
a logical value indicating whether to hash k-mers by position.
an integer value indicating the k-mer size.
a logical value indicating whether be verbose (in the C backend).
An S4 object of
FASTASummary containing the summary statistics.
Identifying the correct quality can be difficult.
will error out if it a base quality outside of the range of a known
quality type, but it is possible one could have reads with a different
quality type that won't fall outside of the another type.
Here is a bit more about quality:
PHRED quality scores (e.g. from Roche 454). ASCII with no offset, range: [4, 60]. This has been removed as an option since sequence reads with this type are very, very uncommon.
Sanger are PHRED ASCII qualities with an offset of 33, range: [0, 93]. From NCBI SRA, or Illumina pipeline 1.8+.
Solexa (also very early Illumina - pipeline < 1.3). ASCII offset of 64, range: [-5, 62]. Uses a different quality-to-probabilities conversion than other schemes.
Illumina output from pipeline versions between 1.3 and 1.7. ASCII offset of 64, range: [0, 62].
Vince Buffalo <firstname.lastname@example.org>
FASTASummary are the classes of the
objects returned by
basePlot is a function that plots the distribution of
bases over sequence length for a particular
gcPlot combines and plots
the GC proportion.
qualPlot is a function that plots the distribution of
qualities over sequence length for a particular
seqlenPlot is a function that plots a histogram of
sequence lengths for a particular
kmerKLPlot is a function that plots K-L divergence
of k-mers to look for possible biase in reads.
1 2 3 4 5 6