## Load the necessary packages.
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(data.table))

## Global options.
opts_chunk$set(echo = FALSE,
               cache = FALSE,
               prompt = FALSE,
               tidy = TRUE,
               comment = NA,
               message = FALSE,
               warning = FALSE)
if (is.null(experiment)) experiment <- "Sequencing data"
if (is.null(author)) author <- Sys.info()['user']
qc <- read_multifastqc(unique(dirname(qc.files)))
ag <- aggregate_fqc(qc)

Introduction

This document aggregates the quality control metrics for the r length(qc.files) FASTQC files into a single report. Quality control of the raw reads was performed using FASTQC and this report has been generated using the fqcr package.

Basic Statistics

Basic statistics shows basic data metrics including the total number of sequences, the number of sequences flagged as poor quality, the mean sequence lengths, the overall percentage GC content, and the percentage of duplicated reads.

tb <- ag[[1]]
colnames(tb) <- c("sample", "total seq", "poor qual seq", "mean seq length",
               "GC %", "Dupl %")
as.data.frame(tb)

Summary

Summary shows for each sample, the FASTQC modules tested and which were passed, gave rise to warnings or were failed. It is important to stress that although the analysis results give a pass/fail result, these evaluations must be taken in the context of what is expected from the library. Some experiments may be expected to produce libraries which are biased in particular ways. The summary evaluations should therefore be considered as pointers as to where attention should be concentrated rather than absolute indicators of quality.

md <- module_fqc(qc)
as.data.frame(md$all)

The tables below provide the same information as above, but in more compact formats.
The first table shows, for each FastQC module, the number of samples that passed, failed or warned.

as.data.frame(md$by_module$summary)

The next table shows for each sample, the number of modules that were passed, failed or warned.

as.data.frame(md$by_sample$summary)

Per base sequence quality {.tabset .tabset-fade}

The Per base sequence quality plot gives an overview of the range of quality scores across all bases at each position in the FastQ file using box-and-whisker plots. The per-base sequence quality is determined by the Phred score. The Phred score ($Q$) is defined as $Q = -10 \times log_{10}P$, where $P$ is the base-calling error probability. Therefore, if a base has a quality score of 10, this means there is a 1 in 10 chance it is wrong, and if it has a Phred score of 30, it has a 1 in 1000 chance of being wrong. Thus, the higher the Phred score, the more reliable the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read.

Problems:

Common reasons for problems:

for (i in 1:length(qc)) {
  cat("\n##", gsub("_fastqc.zip", "", basename(qc.files[i])), "\n")

  p <- .plot_bq(qc[[i]])
  print(p)

  cat("\n")
}

Per sequence quality scores

The Per sequence quality scores plot shows the frequencies of quality scores in each sample and identifies whether a specific sample has universally low quality values. It is not unusual for some sequences to be poorly imaged (for example, if they are on the edge of the field of view), but these should represent only a small percentage of the total sequences. If a significant proportion of the sequences in a run have an overall low quality then this could indicate a systematic problem that needs addressing.

Problems:

Common reasons for problems:

names(qc) <- gsub("_fastqc.zip", "", names(qc))
d <- lapply(qc, function(x) x$per_sequence_quality_scores)
d <- lapply(d, function(x) y <- rle(rep(x$Quality, x$Count)))
d <- lapply(d, function(x) {
  x$lengths <- round(x$lengths/100)
  x <- unclass(x)
  data.frame(x)
})
names(d) <- paste0(names(d), "     ")

count <- lapply(d, function(x) {
  cnt <- data.table::as.data.table(rep(x$values, x$lengths), ncol = 1, byrow = TRUE)
})
count <- data.table::rbindlist(count, idcol = TRUE)


ggplot(count, aes_string(x = ".id", y = "V1", fill = ".id")) +
    geom_boxplot(outlier.shape = NA) +
    labs(title = "Per sample per sequence quality scores",
                  subtitle = "Quality score distribution over all sequences for each sample",
                  y = "Sequence Quality (Phred Score)", x = "") +
    .theme_Publication() +
  theme(plot.title = element_text(hjust = 0),
                 legend.position = "bottom", legend.direction = "horizontal", 
                 axis.text.x = element_blank()) + 
  guides(fill = guide_legend(ncol = 4, title = NULL))

Per base sequence content {.tabset .tabset-fade}

The Per base sequence content displays the per-base nucleotide frequencies at each position along the reads. Since the reads are random fragments, it is expected that the contribution of A and T, and C and G should be similar at each position, and the plot should show a parallel straight lines for each of the four nucleotides. In reality, this is often not the case for the first positions. Libraries produced by random hexamer priming or those which were fragmented using transposases almost always show a bias in the first positions of the reads. This bias is not due to a single sequence, but results from enrichement of a number of different K-mers at the 5' end of the reads (this usually has very little effect on downstream analysis). A bias which is consistent across all bases either indicates that the original library was sequence biased, or that there was a systematic problem during the sequencing of the library.

Problems:

Common reasons for problems:

for (i in 1:length(qc)) {
  cat("\n##", gsub("_fastqc.zip", "", basename(qc.files[i])), "\n")

  p <- .plot_sc(qc[[i]])
  print(p)

  cat("\n")
}

Per sequence GC content

The Per sequence GC content plot displays the GC content across the whole length of each sequence in the library. For a normal random library a roughly normal distribution of GC content is expected. The peak should correspond to the overall GC content of the underlying genome. An unusually shaped distribution may indicate a contaminated library or the presence of a biased subset of sequences (e.g. promoters, CpG islands).

d <- lapply(qc, function(x) x$per_sequence_gc_content)
d <- data.table::rbindlist(d, idcol = TRUE)
ggplot(d, aes_string(x = d$`GC Content`, y = "Count", 
                                       colour = ".id")) +
  geom_line() +
  labs(title = "Per sequence GC content", x = "Mean GC Content (%)") +
  .theme_Publication() +
  theme(plot.title = element_text(hjust = 0)) +
  theme(plot.title = element_text(hjust = 0),
                 legend.position = "right", legend.direction = "vertical", 
                 axis.text.x = element_blank()) + 
  guides(colour = guide_legend(ncol = 1, title = NULL))

Sequence duplication levels {.tabset .tabset-fade}

In a diverse library most sequences will occur only once in the final set. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias (e.g., PCR over-amplification). The Sequence duplication levels module counts the degree of duplication of every sequence in the library this is used to creat a plot showing the relative number of sequences with different degrees of duplication.

Problems:

Common reasons for problems:

for (i in 1:length(qc)) {
  cat("\n##", gsub("_fastqc.zip", "", basename(qc.files[i])), "\n")

  p <- .plot_dl(qc[[i]])
  print(p)

  cat("\n")
}

Useful Links

Bibliography

R session info and parameters

si <- as.character(toLatex(sessionInfo()))
si <- si[-c(1, length(si))]
si <- gsub("(\\\\verb)|(\\|)", "", si)
si <- gsub("~", " ", si)
si <- paste(si, collapse = " ")
si <- unlist(strsplit(si, "\\\\item"))
cat(paste(si, collapse = "\n -"), "\n")


anilchalisey/parseR documentation built on May 7, 2019, 7:45 a.m.