filterBadSeqs: Performs quality checks, then filters reads for quality

Description Usage Arguments Details Value See Also

View source: R/qc.R

Description

The function trims poor-quality bases and unknown bases from the ends of the sequences. Any reads which are too short, or contain any unknown bases (N), are removed from the file.

Usage

1
2
filterBadSeqs(dataFile, minlength = 30, Phred = 25, blockSize = 1e+08,
  readerBlockSize = 1e+05, mc.cores = 1)

Arguments

dataFile

An R data frame with the data to be processed. The R object is a standard format, and must contain the following headings: File, PE, Sample, Replicate, FilteredFile. More information about the file is available at datafileTemplate.

Phred

An integer which specifies Phred (ascii) quality score. Any two consecutive nucleotides with a quality score lower than this threshold will be discarded. Default score is 30.

blockSize

An integer which specifies the number of reads to be read at a time when processing. Default is 1e8.

mc.cores

The number of cores to use when parallelizing. Default is 1 (i.e. no parallelisation)

minLength

An integer which specifies the minimum length for a read. Reads shorter than this length will be discarded. Default is 30 nucleotides.

readBlockSize

An integer which specifies the number of bytes (characters) to be read at one time. Smaller readBlockSize reduces memory requirements, but is less efficient. Default is 1e5.

Details

The function should be run in the working directory, where all fastq files are found.

filterBadSeqs iterates over each file specified in the "datafile", and filters and trims the reads for quality. This is done by iterating over chunks of reads in the fastq files at a time. The size of the chunks are decided by the "blockSize" and "readerBlockSize" parameters. More information about how this is done is available in the ShortRead package.

* it removes any trailing or leadining N's from each sequence,

* it removes any reads wich still contain N's,

* it trims the trailing end when it finds a minimum of 2 poor-quality bases in a window of 5. The threshold for poor quality is determined by the parameter "Phred", where the Phred score is logarithmically related to the probability of errors at each base,

* it removes any reads shorter than a minimum length (this is specified by the "minLength" parameter).

The function produces a new set of fastq files which have been filtered. The user must specify in the "FILTEREDFILE" column of the data file the output file. The user may specify the same output file for multiple input files - this will append new output to existing files, thereby allowing de-multiplexing of samples which have been run on different lanes. A new R object (QualityFilterResults) is created, which contains pointers to the input and output fastq files, as well as a summary of how many reads have been trimmed or removed.

Value

A data frame summarising for each file how many sequences have been trimmed or removed.

See Also

https://en.wikipedia.org/wiki/Phred_quality_score for more about quality scores.

ShortRead for more information about blockSize (n) and readerBlockSize.


nixstix/RNASeqAnalysis documentation built on May 23, 2019, 7:06 p.m.