preprocessReads: Preprocess Short Reads
In fmicompbio/QuasR: Quantify and Annotate Short Reads in R

View source: R/preprocessReads.R

preprocessReads

R Documentation

Preprocess Short Reads

Description

Truncate sequences, remove parts matching to adapters and filter out low quality or low complexity sequences from (compressed) 'fasta' or 'fastq' files.

Usage

preprocessReads(
  filename,
  outputFilename = NULL,
  filenameMate = NULL,
  outputFilenameMate = NULL,
  truncateStartBases = NULL,
  truncateEndBases = NULL,
  Lpattern = "",
  Rpattern = "",
  max.Lmismatch = rep(0:2, c(6, 3, 100)),
  max.Rmismatch = rep(0:2, c(6, 3, 100)),
  with.Lindels = FALSE,
  with.Rindels = FALSE,
  minLength = 14L,
  nBases = 2L,
  complexity = NULL,
  nrec = 1000000L,
  clObj = NULL
)

Arguments

`filename`	the name(s) of the input sequence file(s).
`outputFilename`	the name(s) of the output sequence file(s).
`filenameMate`	for paired-end experiments, the name(s) of the input sequence file(s) containing the second read (mate) of each pair.
`outputFilenameMate`	for paired-end experiments, the name(s) of the output sequence file(s) containing the second read (mate) of each pair.
`truncateStartBases`	integer(1): the number of bases to be truncated (removed) from the beginning of each sequence.
`truncateEndBases`	integer(1): the number of bases to be truncated (removed) from the end of each sequence.
`Lpattern`	character(1): the left (5'-end) adapter sequence.
`Rpattern`	character(1): the right (3'-end) adapter sequence.
`max.Lmismatch`	mismatch tolerance when searching for matches of `Lpattern` (see ‘Details’).
`max.Rmismatch`	mismatch tolerance when searching for matches of `Rpattern` (see ‘Details’).
`with.Lindels`	if `TRUE`, indels are allowed in the alignments of the suffixes of `Lpattern` with the subject, at its beginning (see ‘Details’).
`with.Rindels`	same as `with.Lindels` but for alignments of the prefixes of `Rpattern` with the subject, at its end (see ‘Details’).
`minLength`	integer(1): the minimal allowed sequence length.
`nBases`	integer(1): the maximal number of Ns allowed per sequence.
`complexity`	`NULL` (default) or numeric(1): If not `NULL`, the minimal sequence complexity, as a fraction of the average complexity in the human genome (~3.9bits). For example, `complexity = 0.5` will filter out sequences that do not have at least half the complexity of the human genome. See ‘Details’ on how the complexity is calculated.
`nrec`	integer(1): the number of sequence records to read at a time.
`clObj`	a cluster object to be used for parallel processing of multiple files (see ‘Details’).

Details

Sequence files can be in fasta or fastq format, and can be compressed by either gzip, bzip2 or xz (extensions .gz, .bz2 or .xz). Multiple files can be processed by a single call to preprocessReads; in that case all sequence file vectors must have identical lengths.

nrec can be used to limit the memory usage when processing large input files. preprocessReads iteratively loads chunks of nrec sequences from the input until all data been processed.

Sequence pairs from paired-end experiments can be processed by specifying pairs of input and output files (filenameMate and outputFilenameMate arguments). In that case, it is assumed that pairs appear in the same order in the two input files, and only pairs in which both reads pass all filtering criteria are written to the output files, maintaining the consistent ordering.

If output files are compressed, the processed sequences are first written to temporary files (created in the same directory as the final output file), and the output files are generated at the end by compressing the temporary files.

For the trimming of left and/or right flanking sequences (adapters) from sequence reads, the trimLRPatterns function from package Biostrings is used, and the arguments Lpattern, Rpattern, max.Lmismatch, max.Rmismatch, with.Lindels and with.Rindels are used in the call to trimLRPatterns. Lfixed and Rfixed arguments of trimLRPatterns are set to TRUE, thus only fixed patterns (without IUPAC codes for ambigous bases) can be used. Currently, trimming of adapters is only supported for single read experiments.

Sequence complexity (H) is calculated based on the dinucleotide composition using the formula (Shannon entropy):

H = -\sum_i {f_i \log_2 f_i},

where f_i is the fraction of dinucleotide i from all dinucleotides in the sequence. Sequence reads that fulfill the condition H/H_r \ge c are retained (not filtered out), where H_r = 3.908 is the reference complexity in bits obtained from the human genome, and c is the value given to the argument complexity.

If an object that inherits from class cluster is provided to the clObj argument, for example an object returned by makeCluster from package parallel, multiple files will be processed in parallel using clusterMap from package parallel.

Value

A matrix with summary statistics on the processed sequences, containing:

One column per input file (or pair of input files for paired-end experiments).
The number of sequences or sequence pairs in rows:

totalSequences
- the total number in the input

matchTo5pAdaptor
- matching to Lpattern

matchTo3pAdaptor
- matching to Rpattern

tooShort
- shorter than minLength

tooManyN
- more than nBases Ns

lowComplexity
- relative complexity below complexity

totalPassed
- the number of sequences/sequence pairs that pass all filtering criteria and were written to the output file(s).

Author(s)

Anita Lerch, Dimos Gaidatzis and Michael Stadler

Examples

# sample files
infiles <- system.file(package="QuasR", "extdata",
                       c("rna_1_1.fq.bz2","rna_1_2.fq.bz2"))
outfiles <- paste(tempfile(pattern=c("output_1_","output_2_")),".fastq",sep="")
# single read example
preprocessReads(infiles, outfiles, nBases=0, complexity=0.6)
unlink(outfiles)
# paired-end example
preprocessReads(filename=infiles[1],
                outputFilename=outfiles[1],
                filenameMate=infiles[2],
                outputFilenameMate=outfiles[2],
                nBases=0, complexity=0.6)
unlink(outfiles)

fmicompbio/QuasR documentation built on June 13, 2025, 12:46 p.m.