qQCReport: QuasR Quality Control Report

View source: R/qQCReport.R

qQCReportR Documentation

QuasR Quality Control Report

Description

Generate quality control plots for a qProject object or a vector of fasta/fastq/bam files. The available plots vary depending on the types of available input (fasta, fastq, bam files or qProject object; paired-end or single-end).

Usage

qQCReport(
  input,
  pdfFilename = NULL,
  chunkSize = 1000000L,
  useSampleNames = FALSE,
  clObj = NULL,
  a4layout = TRUE,
  ...
)

Arguments

input

A vector of files or a qProject object as returned by qAlign.

pdfFilename

The path and name of a pdf file to store the report. If NULL, the quality control plots will be generated in separate plotting windows on the standard graphical device.

chunkSize

The number of sequences, sequence pairs (for paired-end data) or alignments that will be sampled from each data file to collect quality statistics.

useSampleNames

If TRUE, the plots will be labelled using the sample names instead of the file names. Sample names are obtained from the qProject object, or from names(input) if input is a named vector of file names. Please not that if there are multiple files for the same sample, the sample names will not be unique.

clObj

A cluster object to be used for parallel processing of multiple input files.

a4layout

A logical scalar. If TRUE, the output of mapping rate and uniqueness plots will be adjusted for a4 format devices.

...

Additional arguments that will be passed to the functions generating the individual quality control plots, see ‘Details’.

Details

This function generates quality control plots for all input files or the sequence and alignment files contained in a qProject object, allowing assessment of the quality of a sequencing experiment. qQCReport uses functionality from the ShortRead package to collect quality data, and visualizes the results similarly as the ‘FastQC’ quality control tool from Simon Andrews (see ‘References’ below). It is recommended to create PDF reports (pdfFilename argument), for which the plot layouts have been optimised.

Some plots will only be generated if the necessary information is available (e.g. base qualities in fastq sequence files).

The currently available plot types are:

Quality score boxplot

shows the distribution of base quality values as a box plot for each position in the input sequence. The background color (green, orange or red) indicates ranges of high, intermediate and low qualities.

Nucleotide frequency

plot shows the frequency of A, C, G, T and N bases by position in the read.

Duplication level

plot shows for each sample the fraction of reads observed at different duplication levels (e.g. once, two-times, three-times, etc.). In addition, the most frequent sequences are listed.

Mapping statistics

shows fractions of reads that were (un)mappable to the reference genome.

Library complexity

shows fractions of unique read(-pair) alignment positions, as a measure of the complexity in the sequencing library. Please note that this measure is not independent from the total number of reads in a library, and is best compared between libraries of similar sizes.

Mismatch frequency

shows the frequency and position (relative to the read sequence) of mismatches in the alignments against the reference genome.

Mismatch types

shows the frequency of read bases that caused mismatches in the alignments to the reference genome, separately for each genome base.

Fragment size

shows the distribution of fragment sizes inferred from aligned read pairs.

One approach to assess the quality of a sample is to compare its control plots to the ones from other samples and search for relative differences. Special quality measures are expected for certain types of experiments: A genomic re-sequencing sample with an overrepresentation of T bases may be suspicious, while such a nucleotide bias is normal for a directed bisulfite-sequencing sample.

Additional arguments can be passed to the internal functions that generate the individual quality control plots using ...{}:

lmat:

a matrix (e.g. matrix(1:12, ncol=2)) used by an internal call to the layout function to specify the positioning of multiple plot panels on a device page. Individual panels correspond to different samples.

breaks:

a numerical vector (e.g. c(1:10)) defining the bins used by the ‘Duplication level’ plot.

Value

The function is called for its side effect of generating quality control plots. It invisibly returns a list with components that contain the data used to generate each of the QC plots. Available components are (depending on input data, see ‘Details’):

qualByCycle

: quality score boxplot

nuclByCycle

: nucleotide frequency plot

duplicated

: duplication level plot

mappings

: mapping statistics barplot

uniqueness

: library complexity barplot

errorsByCycle

: mismatch frequency plot

mismatchTypes

: mismatch type plot

fragDistribution

: fragment size distribution plot

Author(s)

Anita Lerch, Dimos Gaidatzis and Michael Stadler

References

FastQC quality control tool at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

See Also

qProject, qAlign, ShortRead package

Examples

# copy example data to current working directory
file.copy(system.file(package="QuasR", "extdata"), ".", recursive=TRUE)

# create alignments
sampleFile <- "extdata/samples_chip_single.txt"
genomeFile <- "extdata/hg19sub.fa"

proj <- qAlign(sampleFile, genomeFile)

# create quality control report
qQCReport(proj, pdfFilename="qc_report.pdf")


fmicompbio/QuasR documentation built on Dec. 11, 2024, 11:22 p.m.