Introduction to qckitfastq

library(knitr)
library(kableExtra)

Introduction

qckitfastq is part of the planned qckit suite of packages from the Computational Biology Core at Brown University. The long-term goal of the qckit suite is to not only provide comprehensive quality control metrics for common genomics sequencing workflows, but to 1) also provide quality control visualizations for multiple samples through updated modules in multiQC [@Ewels2016], and 2) provide a public quality control database which allows benchmarking of QC metrics for an experiment against other similar experiments. Users of qckitfastq will have the option to enter SRA metadata and create visualizations from our database as well as submit their qc results (if submitting an SRA archive) to our database once these features have been implemented. The purpose of this particular package is to run quality control on FASTQ files from genome sequencing.

Why use qckitfastq?

Indeed there are many other quality control packages for FASTQ files existing already, including ShortReads [@shortread] and seqTools [@seqtools] for R and the popular FASTQC [@Andrews2010] Java-based program. qckitfastq offers a few advantages compared to these 3 programs for users who need such features:

  1. access to raw sequence and quality data
  2. access to data frames of the quality control results in addition to plots
  3. quality control analyses of the entire FASTQ file
  4. fast file processing

To break it down further, seqTools and ShortReads do not offer as comprehensive set of quality control metrics as qckitfastq and FASTQC. seqTools further provides limited access to raw data and intermediate analysis results. ShortRead provides users with access to the raw sequencing data and intermedite analysis results, but is inefficient on datasets exceeding 10 million reads. FASTQC meanwhile truncates any reads longer than 75bp as well as estimates overall quality only based on the first 100,000 reads of any FASTQ file. qckitfastq does not contain any of these limitations.

qckitfastq <- c("yes","yes","yes+","yes")
seqTools <- c("no","yes","yes","yes")
ShortRead <- c("no","no","no","yes")
FASTQC <- c("yes","yes*","yes*","yes*")
metrics <- data.frame(qckitfastq,seqTools,ShortRead,FASTQC)
rownames(metrics) <- c("Read Length Distribution",
                       "Per Base Read Quality", "Nucleotide Read Content", "GC Content")
kable(metrics)
# need to do per read sequence quality
# + indicates that the program...
# Here, '*' indicates that the program truncates the file or computes on only the first x samples

Running qckitfastq

qckitfastq provides the following metrics:

The simplest way to run qckitfastq is by executing run_all, a single command that will produce a report of all of the included metrics in a user-provided directory with some default parameters. However, each metric can also be run separately for closer examination.

Individual metrics

For each individual metric that qckitfastq provides, the user can choose to save either the data frame for the metric, the associated plot, or both.

Our example in this vignette has 25,000 reads, each 100bp long. The majority of metrics are run on the path to the FASTQ file. Some functions for quality control in this package are simply wrappers around seqTools due to the fact that their functions are fast. We provide these wrappers for the sake of completeness in quality control metrics. These wrappers require processing the FASTQ data through the seqTools::fastqq command first:

library(qckitfastq)
infile <- system.file("extdata", "10^5_reads_test.fq.gz", package = "qckitfastq")
fseq <- seqTools::fastqq(infile)

Read length distribution {#read-length-distribution}

read_length generates a data frame of read lengths and the number of reads with that particular read length. plot_read_length generates a distribution plot of the length of all reads. The generated plot would show the sequence length of all the sequences throughout the file. The plot is considered an indication of good data quality is all sequences have roughly the same sequence length with minimal deviations. The following plot shows that all reads in the file have sequence length of 100.

read_len <- read_length(fseq)
kable(head(read_len)) %>% kable_styling()
plot_read_length(read_len)

Per base quality {#per-base-quality}

The per_base_quality function calculates quality score per base summary statistics of 10th, 25th, median, 75th and 90th quantiles across sequences. Currently it treats all quality score encodings as Phred+33. We can use the result to create a quality score distribution per position plot using plot_per_base_quality. As a basic heuristic, quality scores above 28 can be categorized as good (green), those from 20 to 28 can be categorized as medium (yellow), and under 20 is bad (red).

bs <- per_base_quality(infile)
kable(head(bs)) %>% kable_styling()
plot_per_base_quality(bs)

Per read quality score {#per-read-quality}

The per_read_quality function compute the mean quality score per read. We can then use plot_per_read_quality to generate a histogram of this statistics. The histogram is considered an indication of good data quality if the majority of reads have mean quality scores greater than 30. If a significant portion of reads have quality scores less than 30, then the data most likely has issues that need to be examined.

prq <- per_read_quality(infile)
kable(head(prq)) %>% kable_styling()
plot_per_read_quality(prq)

Per read GC content {#gc-content}

The GC_content function computes the GC nucleotide content percentage per read, and plot_GC_content plots the distribution of GC content. As a general rule, an indication of good data quality is when the GC content percentage in each read is between 30 and 50% and roughly follows a normal distribution.

gc_df <- GC_content(infile)
kable(head(gc_df)) %>% kable_styling()
plot_GC_content(gc_df)

Per position nucleotide read content {#nucleotide-read-content}

read_content calculates the total numbers across reads of each nucleotide base by position. plot_read_content plots the percentage of nucleotide content per position. We also provide an additional function read_base_content that allows the user to get the nucleotide base content by position across reads for a specific nucleotide (options are a,c,t,g,n). As a general rule, the plot would be considered an indication of good data quality when the percentage of each nucleotide sequence content is about evenly distributed across all bases. However, there are some types of analyses for which this will not be the case. For example, RNA-Seq will have an uneven sequence content distribution in the first 10 bases, and RRBS will have almost no cytosines and very high thymine content because the library prep protocol converts most C to T [@Meissner2005]. Knowledge of the library prep protocol is thus important for evaluating quality in terms of nucleotide sequence content.

scA <- read_base_content(fseq, content = "A")
kable(head(scA)) %>% kable_styling()
rc <- read_content(fseq)
kable(head(rc)) %>% kable_styling()
plot_read_content(rc)

Per position kmer count {#kmer-count}

kmer_count function produces the per position kmer count with given path to the FASTQ file and the kmer length specified.

km <- kmer_count(infile,k=6)
kable(head(km)) %>% kable_styling()

Overrepresented reads {#overrep-reads}

overrep_reads produces a data frame consisting of overrepresented reads and their counts in decreasing order of counts. Here overrepresented is defined as unique reads that have counts larger than 0.1% of the total reads in the file. plot_overrep_reads produces a density plot of the counts and marks the top 5 overrepresented reads in red.

overrep_reads<-overrep_reads(infile)
knitr::kable(head(overrep_reads,n = 5)) %>% kable_styling()
plot_overrep_reads(overrep_reads)

Overrepresented kmers {#overrep-kmers}

overrep_kmer generates a data frame of overrepresented kmers with its maximum log2(observed/expected) ratio and the position of the maximum obs/exp ratio in descending order. Only those kmers with a ratio greater than 2 are returned in the data frame. We can also create a boxplot of the obs/exp ratio thatn includes plotting the top 2 (or n) kmer outliers.

overkm <-overrep_kmer(infile,7)
knitr::kable(head(overkm,n=10)) %>% kable_styling()
plot_overrep_kmer(overkm)

Adapter content {#adapter-content}

adapter_content computes the counts of a pre-determined set of adapter sequences and reports back any exceeding 0.1% of the total reads, sorted from most abundant to least abundant. plot_adapter_content creates a bar plot of the top 5 most common adapters. In this instance we will use a different file to compute adapter content, as the first example has no adapter contamination. This function is only available for macOS/Linux due to a dependency on RSeqAn/C++14, which is not supported on current Bioconductor Windows build machines.

if(.Platform$OS.type != "windows") {
    infile2 <- system.file("extdata", "test.fq.gz", package = "qckitfastq")
    ac_sorted <- adapter_content(infile2)
    kable(head(ac_sorted)) %>% kable_styling()
    plot_adapter_content(ac_sorted)
}
### Benchmarking

#To demonstrate the utility of our functions on large datasets...
#(need to benchmark against ShortRead)
#library(seqTools)
#library(ShortRead)
#library(rbenchmark)
#sampler <- FastqSampler('E-MTAB-1147/fastq/ERR127302_1.fastq.gz', 20000)

References



Try the qckitfastq package in your browser

Any scripts or data that you put into this service are public.

qckitfastq documentation built on Nov. 8, 2020, 5:24 p.m.