knitr::opts_chunk$set(error = FALSE, message = FALSE, fig.align = 'center')
title: "Using Rfastqc"
author: "Mahmoud Ahmed"
date: "r Sys.Date()
"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Using Rfastqc}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
To get started, we assume you ran the command line utility FastQC
on a number of raw reads fastq
or alignment sam
files. The output includes a an html
report and a zipped file. The latter contains among others the output text files txt
, we will be working with.
Rfastqc
comes with sample text output files genrated from real runnin the FastQC
utility on 3 fastq
files. We will explain the workflow for generating these files and loading them into R in a latter section. For now, we will start by installing the Rfastqc
package and load the sample files in R.
Rfastqc
is available on github and can be installed using devtools
as following.
# installing Rfastqc devtools::install_github('MahShaaban/Rfastqc')
These are the required libraries to access the full functionality of Rfastqc
.
# loading required libraries library(Rfastqc) library(stringr) library(dplyr) library(tidyr)
To access the sample text output files we use system.file
to get the path to one of them.
# get a sample text output file fl <- system.file('extdata', 'sample_1.txt', package = 'Rfastqc') file.exists(fl)
Reading such files is as easy as running a single function parse_fqc
followed by construction and object of class FastQC
using a function with the same name.
# read file and make object fqc <- parse_fqc(fl) fqc <- FastQC(fqc)
To have a general overview of the content of the file, we use the summary
method. The output is a simple data.frame
of two columns; the first is the test/module name and the other is the evaluation of the FastQC
utility for whether the module passes the pre-defined cutt-off (more on that later).
# use the summary method summary(fqc)
Finally, we use the plot
method to generate quick graphs of two of the modules. The plot
method requires only to argumnets; fqc
for the FastQC
object and test
for the name of the test as displyed in the the output of the previous function call.
# use the plot method # Per base sequence quality module par(mfrow=c(1,2)) plot(fqc, test = 'Per base sequence quality', xlab = 'Base', ylab = 'Quality Score') # Per sequence GC content module plot(fqc, test = 'Per sequence GC content', xlab = 'Base', ylab = 'Content')
The raw reads and alignment files are usually large. It takes longer to download and run the FastQC
on them. To side step this delay, we will leave it to the user to obtain the fastq
files, intall and run FastQC
to get the text output that we will be working with in the rest of this workflow.
If everything works fine, the user should be able to get 3 txt
files containing the data of all module results. Rfastqc
comes with 3 of these files generated by running the same code in the previous section. To load them in the R session, we first need to construct the paths to them from the package.
# get text output files path <- system.file('extdata', package = 'Rfastqc') fls <- list.files(path, full.names = TRUE)[1:3] all(sapply(fls, file.exists))
To read a singl file, we use parse_fqc
followed by the object constructor FastQC
.
# read file and make object fqc <- parse_fqc(fls[1]) fqc <- FastQC(fqc)
class(fqc)
To read multiple files, 3 in this case, we use a similar process; a parse function parse_collection
then a constructor multiFastQC
.
# read multiple text output files and make object mfqc <- parse_collection(fls, names = paste('sample', 1:3, sep = '_')) mfqc <- multiFastQC(mfqc)
class(mfqc)
The S3 objects FastQC
and multiFastQC
for single and mulstiple quality control files are of very simple structure and used mainly to dispatch the different methods accordingly. Both objects conform to a similar structure; a named list
of data.frame
. The names
of the list
correspond to the names of the modules/tests in the files. These names
should be identical to the ones shown by calling summary
in the object.
Calling str
on the object should provide a clear view of its content. Moreover, this should serve as to make sure the parse and construct steps havn't faild silently.
# structure of FastQC object str(fqc)
To show the names of the methods that can be applied to the FastQC
object, we call methods
.
# methods applied to FastQC object methods(class = 'FastQC')
plot
is currently the most important method. We dedicated a large portion of the remaining section to discuss it. Mainly, we will discuss its required arguments, optional arguments and how it's used to gain insights in the quality metircs and comparing qualities of multiple files.
The plot
function requires to argumnets; fqc
and test
otherwise would fail with an error
. Optionally, the user can provide other plotting arguments to modify the plot. Also, it can be used with other ploting functions to customize the plot and add data to it.
The function test_get
takes the two required arguments of plot
and returns the data.frame
of a particular test and can be used consistently to modify the corresponding plots.
Here we show the call to the plot
method with only required argument on the left and with customization and adding data using other graphics
functions.
par(mfrow = c(1,2)) # default plot plot(fqc, test = 'Per base sequence quality') # use test_get to get data of a particular tes # use the data to modify the plot plot(fqc, test = 'Per base sequence quality', xlab = 'Base', ylab = 'Quality Score', ylim = c(0, 40)) # get module data and modify the plot # add median and cut-off mark at 30 dat <- test_get(fqc, test = 'Per base sequence quality') lines(x = dat$Base, y = dat$Median, col = 'red') abline(h = 30, lty = 2, col = 'gray')
multiFastQC
is of very similar structure. A named list
of data.frame
s. The only differnce is that each data.frame
contains module data from multiple files, this is denoted by adding a column file
to each of the data.frame
s. By default, the entries in the file
columns will be the path to the files, this can be changed when calling parse_collection
using the argument names
.
Similarly, we use str
and methods
to inspect the structure and the available methods for the object multiFastQC
.
str(mfqc)
methods(class = 'multiFastQC')
The plot
method behaves differently when used with multiFastQC
objects due to the fact that it represents quality metrics of different files. It still requires two arguments; fqc
for a multiFastQC
object and test
for the module/test name. Optionally, users can provide additional arguments to modify the plot. The function in that case returns a series of plots, one for each file.
par(mfrow = c(1,3)) plot(mfqc, test = 'Per base sequence quality', xlab = 'Base', ylab = 'Quality Score')
The creators of FastQC has written a through documentation of the different analysis modules and their possible outcomes. Documenations and examples are available here. In this section, we will use the functions available in the Rfastqc
package to explore the different aspects of the analysis modules.
As expained in the original documentation, for each module three possible outcomes are possible.
- Pass
- Warning
- Fail
To get a summary of which modules has passed or failed, we use the method summary
. The resulting data.frame
has the name
and output
of each module.
summary(fqc)
The first module in the object is Basic Statistics
. It's rather a self explaintory text output of general information about the files the was tested using FastQC
. There is no plotting method for that module, however the data in it can be extracted using the function test_get
.
test_get(fqc, test = 'Basic Statistics')
Next,
plot(fqc, test = 'Per base sequence quality', ylim = c(0,40), xlab = 'Base', ylab = 'Quality Score') dat <- test_get(fqc, 'Per base sequence quality') lines(dat$Base, dat$`Lower Quartile`, col = 'red') abline(h = c(20, 30), lty = 2, col = 'gray')
plot(fqc, test = 'Per sequence quality scores', xlab = 'Quality Score', ylab = 'Count') abline(v = 27, lty = 2, col = 'gray')
cols = c('red', 'blue', 'green', 'orange') plot(fqc, test = 'Per base sequence content', ylim = c(0, 100), xlab = 'Base', ylab = 'Content (%)', color = cols) dat <- test_get(fqc, test = 'Per base sequence content') legend('topright', legend = names(dat)[-1], fill = cols)
plot(fqc, test = 'Per sequence GC content', xlab = 'Base', ylab = 'Content')
plot(fqc, test = 'Per base N content', ylim = c(0, 1), xlab = 'Base', ylab = 'Content (%)') abline(h = .05, lty = 2, col = 'gray')
# plot(fqc, # test = 'Sequence Length Distribution')
cols <- c('red', 'blue') plot(fqc, test = 'Sequence Duplication Levels', ylim = c(0, 100), xlab = 'Duplication level', ylab = 'Percentage', color = cols) abline(h = c(20, 50), lty = 2, col = 'gray') dat <- test_get(fqc, test = 'Sequence Duplication Levels') legend('topright', legend = names(dat)[-1], fill = cols)
# plot(fqc, # test = 'Overrepresented sequences')
cols <- c('red', 'blue', 'green') plot(fqc, test = 'Adapter Content', ylim = c(0, .001), xlab = 'Base', ylab = 'Content (%)') dat <- test_get(fqc, 'Adapter Content') legend('topright', legend = names(dat)[-1], fill = cols)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.