read_fastq: Load a fastq file.

View source: R/read_fastq.R

read_fastqR Documentation

Load a fastq file.

Description

Efficiently load headers, sequences & qualities from a fastq file.

Usage

read_fastq(file,
		   include_headers      = TRUE,
		   include_sequences    = TRUE,
		   include_qualities    = TRUE,
		   include_phred_scores = FALSE,
		   include_error_probs  = FALSE,
		   truncate_headers_at  = NULL,
		   phred_offset         = NULL,
		   max_sequences        = Inf,
		   max_lines            = Inf)

Arguments

file

A character, path to the input fastq file. This file may be gzipped with extension ".gz".

include_headers

Logical, whether to load the headers. If you don't need the headers you can set this to FALSE for efficiency.

include_sequences

Logical, whether to load the sequences. If you don't need the sequences you can set this to FALSE for efficiency.

include_qualities

Logical, whether to load the raw qualities, encoded as ASCII characters. If you don't need the raw qualities you can set this to FALSE for efficiency.

include_phred_scores

Logical, whether to compute and return the Phred quality scores, in the form of integers. These contain the same information as the raw qualities, but converted from character representation to integer scores. Also see option phred_offset.

include_error_probs

Logical, whether to compute and return the nominal error probabilities, based on the qualities. The nominal error probability of each nucleobase is computed as 10^{-Q/10}, where Q is the Phred score.

truncate_headers_at

Optional character, needle at which to truncate headers. Everything at and after the first instance of the needle will be removed from the headers.

phred_offset

Optional integer, Phred offset to assume for converting raw quality characters to Phred scores. If NULL, this is automatically chosen among either 33 or 64.

max_sequences

Optional integer, maximum number of sequences to load. Note that in the case of a gzipped input file the whole file is temporarily decompressed (up to max_lines lines) regardless of max_sequences.

max_lines

Optional integer, maximum number of lines to load. Any trailing sequence truncated due to this limit will be discarded. In contrast to max_sequences, for gzipped inputs this limit is already applied at the decompression stage, so it is more effective at reducing computing time. Keep in mind that typically each fastq record (header+sequence+qualities) spans 4 lines, however in some rare cases sequences and/or qualities may be split across multiple lines.

Details

This function is a fast and simple fastq loader. It can be used to load entire files into memory, or to only sample a small portion of sequences without reading the entire file (using max_lines).

Value

A named list with the following elements:

success

Logical, indicating whether the file was loaded successfully. If FALSE, then an error message will be specified by the element error, and all other elements may be undefined.

headers

Character vector, listing the loaded headers in the order encountered. Only included if include_headers was TRUE.

sequences

Character vector, listing the loaded sequences in the order encountered. Only included if include_sequences was TRUE.

qualities

Character vector, listing the loaded raw qualities in the order encountered. Only included if include_qualities was TRUE.

phred_scores

List of integer vectors, listing the loaded Phred scores in the order encountered. Hence, phred_scores[[k]] is an integer vector specifying the Phred scores for the k-th sequence. Only included if include_phred_scores was TRUE.

error_probs

List of numeric vectors, listing the loaded error probabilities in the order encountered. Hence, error_probs[[k]] is a numeric vector specifying the error probabilities for the k-th loaded sequence. Only included if include_error_probs was TRUE.

Nlines

Integer, number of lines encountered.

Nsequences

Integer, number of sequences loaded.

Author(s)

Stilianos Louca

See Also

read_fasta, read_tree

Examples

## Not run: 
# load a gzipped fastq file, considering only the first 1000 lines
fastq = read_fastq(file="mysequences.fastq.gz", max_lines=1000)

# print the first sequence and its error probabilities
cat(fastq$sequences[1])
print(fastq$error_probs[[1]])

## End(Not run)

castor documentation built on Aug. 25, 2025, 1:10 a.m.

Related to read_fastq in castor...