In MahShaaban/Rfastqc: Read FastQC output

knitr::opts_chunk$set(error = FALSE, message = FALSE, fig.align = 'center')

title: "Using Rfastqc" author: "Mahmoud Ahmed" date: "r Sys.Date()" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using Rfastqc} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

Overview

Getting Started

To get started, we assume you ran the command line utility FastQC on a number of raw reads fastq or alignment sam files. The output includes a an html report and a zipped file. The latter contains among others the output text files txt, we will be working with.
Rfastqc comes with sample text output files genrated from real runnin the FastQC utility on 3 fastq files. We will explain the workflow for generating these files and loading them into R in a latter section. For now, we will start by installing the Rfastqc package and load the sample files in R.
Rfastqc is available on github and can be installed using devtools as following.

# installing Rfastqc
devtools::install_github('MahShaaban/Rfastqc')

These are the required libraries to access the full functionality of Rfastqc.

# loading required libraries
library(Rfastqc)
library(stringr)
library(dplyr)
library(tidyr)

To access the sample text output files we use system.file to get the path to one of them.

# get a sample text output file
fl <- system.file('extdata', 'sample_1.txt', package = 'Rfastqc')
file.exists(fl)

Reading such files is as easy as running a single function parse_fqc followed by construction and object of class FastQC using a function with the same name.

# read file and make object
fqc <- parse_fqc(fl)
fqc <- FastQC(fqc)

To have a general overview of the content of the file, we use the summary method. The output is a simple data.frame of two columns; the first is the test/module name and the other is the evaluation of the FastQC utility for whether the module passes the pre-defined cutt-off (more on that later).

# use the summary method
summary(fqc)

Finally, we use the plot method to generate quick graphs of two of the modules. The plot method requires only to argumnets; fqc for the FastQC object and test for the name of the test as displyed in the the output of the previous function call.

# use the plot method
# Per base sequence quality module
par(mfrow=c(1,2))
plot(fqc,
     test = 'Per base sequence quality',
     xlab = 'Base',
     ylab = 'Quality Score')

# Per sequence GC content module
plot(fqc,
     test = 'Per sequence GC content',
     xlab = 'Base',
     ylab = 'Content')

Rfastqc workflow

Running FastQC within an R session

Reading FastQC output in R

The raw reads and alignment files are usually large. It takes longer to download and run the FastQC on them. To side step this delay, we will leave it to the user to obtain the fastq files, intall and run FastQC to get the text output that we will be working with in the rest of this workflow.
If everything works fine, the user should be able to get 3 txt files containing the data of all module results. Rfastqc comes with 3 of these files generated by running the same code in the previous section. To load them in the R session, we first need to construct the paths to them from the package.

# get text output files
path <- system.file('extdata', package = 'Rfastqc')
fls <- list.files(path, full.names = TRUE)[1:3]
all(sapply(fls, file.exists))

To read a singl file, we use parse_fqc followed by the object constructor FastQC.

# read file and make object
fqc <- parse_fqc(fls[1])
fqc <- FastQC(fqc)

class(fqc)

To read multiple files, 3 in this case, we use a similar process; a parse function parse_collection then a constructor multiFastQC.

# read multiple text output files and make object
mfqc <- parse_collection(fls,
                         names = paste('sample', 1:3, sep = '_'))
mfqc <- multiFastQC(mfqc)

class(mfqc)

Exploring FastQC and multiFastQC objects

The S3 objects FastQC and multiFastQC for single and mulstiple quality control files are of very simple structure and used mainly to dispatch the different methods accordingly. Both objects conform to a similar structure; a named list of data.frame. The names of the list correspond to the names of the modules/tests in the files. These names should be identical to the ones shown by calling summary in the object.
Calling str on the object should provide a clear view of its content. Moreover, this should serve as to make sure the parse and construct steps havn't faild silently.

# structure of FastQC object
str(fqc)

To show the names of the methods that can be applied to the FastQC object, we call methods.

# methods applied to FastQC object
methods(class = 'FastQC')

plot is currently the most important method. We dedicated a large portion of the remaining section to discuss it. Mainly, we will discuss its required arguments, optional arguments and how it's used to gain insights in the quality metircs and comparing qualities of multiple files.

The plot function requires to argumnets; fqc and test otherwise would fail with an error. Optionally, the user can provide other plotting arguments to modify the plot. Also, it can be used with other ploting functions to customize the plot and add data to it.
The function test_get takes the two required arguments of plot and returns the data.frame of a particular test and can be used consistently to modify the corresponding plots.
Here we show the call to the plot method with only required argument on the left and with customization and adding data using other graphics functions.

par(mfrow = c(1,2))

# default plot
plot(fqc, test = 'Per base sequence quality')

# use test_get to get data of a particular tes
# use the data to modify the plot
plot(fqc,
     test = 'Per base sequence quality',
     xlab = 'Base',
     ylab = 'Quality Score',
     ylim = c(0, 40))

# get module data and modify the plot
# add median and cut-off mark at 30
dat <- test_get(fqc, test = 'Per base sequence quality')
lines(x = dat$Base, y = dat$Median, col = 'red')
abline(h = 30, lty = 2, col = 'gray')

multiFastQC is of very similar structure. A named list of data.frames. The only differnce is that each data.frame contains module data from multiple files, this is denoted by adding a column file to each of the data.frames. By default, the entries in the file columns will be the path to the files, this can be changed when calling parse_collection using the argument names.
Similarly, we use str and methods to inspect the structure and the available methods for the object multiFastQC.

str(mfqc)

methods(class = 'multiFastQC')

The plot method behaves differently when used with multiFastQC objects due to the fact that it represents quality metrics of different files. It still requires two arguments; fqc for a multiFastQC object and test for the module/test name. Optionally, users can provide additional arguments to modify the plot. The function in that case returns a series of plots, one for each file.

par(mfrow = c(1,3))
plot(mfqc,
     test = 'Per base sequence quality',
     xlab = 'Base',
     ylab = 'Quality Score')

Interpreting FastQC analysis modules results

The creators of FastQC has written a through documentation of the different analysis modules and their possible outcomes. Documenations and examples are available here. In this section, we will use the functions available in the Rfastqc package to explore the different aspects of the analysis modules.
As expained in the original documentation, for each module three possible outcomes are possible.
- Pass
- Warning
- Fail
To get a summary of which modules has passed or failed, we use the method summary. The resulting data.frame has the name and output of each module.

summary(fqc)

The first module in the object is Basic Statistics. It's rather a self explaintory text output of general information about the files the was tested using FastQC. There is no plotting method for that module, however the data in it can be extracted using the function test_get.

test_get(fqc, test = 'Basic Statistics')

Next,

plot(fqc,
     test = 'Per base sequence quality',
     ylim = c(0,40),
     xlab = 'Base',
     ylab = 'Quality Score')
dat <- test_get(fqc, 'Per base sequence quality')
lines(dat$Base, dat$`Lower Quartile`, col = 'red')
abline(h = c(20, 30), lty = 2, col = 'gray')

plot(fqc, 
     test = 'Per sequence quality scores',
     xlab = 'Quality Score',
     ylab = 'Count')
abline(v = 27, lty = 2, col = 'gray')

cols = c('red', 'blue', 'green', 'orange')
plot(fqc, 
     test = 'Per base sequence content',
     ylim = c(0, 100),
     xlab = 'Base',
     ylab = 'Content (%)',
     color = cols)
dat <- test_get(fqc, test = 'Per base sequence content')
legend('topright', legend = names(dat)[-1], fill = cols)

plot(fqc,
     test = 'Per sequence GC content',
     xlab = 'Base',
     ylab = 'Content')

plot(fqc,
     test = 'Per base N content',
     ylim = c(0, 1),
     xlab = 'Base',
     ylab = 'Content (%)')
abline(h = .05, lty = 2, col = 'gray')

# plot(fqc,
#      test = 'Sequence Length Distribution')

cols <- c('red', 'blue')
plot(fqc,
     test = 'Sequence Duplication Levels',
     ylim = c(0, 100),
     xlab = 'Duplication level',
     ylab = 'Percentage',
     color = cols)
abline(h = c(20, 50), lty = 2, col = 'gray')
dat <- test_get(fqc, test = 'Sequence Duplication Levels')
legend('topright', 
       legend = names(dat)[-1],
       fill = cols)

# plot(fqc,
#      test = 'Overrepresented sequences')

cols <- c('red', 'blue', 'green')
plot(fqc,
     test = 'Adapter Content',
     ylim = c(0, .001),
     xlab = 'Base',
     ylab = 'Content (%)')
dat <- test_get(fqc, 'Adapter Content')
legend('topright',
       legend = names(dat)[-1],
       fill = cols)