knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This fastq_processing
vignette re-introduces some of the content from the
BasicQC
tutorial concept but derives the primary information from the FASTQ
file for users who have not maintained their sequencing_summary file. This is
intended to facilitate the development of workflows and reports that are
decoupled from the requirement for the sequencing_summary
file.
The floundeR
package is distributed with a collection of canned datasets.
These include an example FASTQ
file that has been gzip compressed and contains
a somewhat lacklustre historical dataset that is interesting only in its
compactness.
library(floundeR) canonical_fastq <- flnDr("example.fastq.gz") fastq <- Fastq$new(canonical_fastq) print(fastq) fastq$as_tibble()
So what have we done here? We have identified the packaged fastq file and we
have used this file to instantiate the Fastq
object - this can be displayed
using the print()
command and we can have a quick look at the data that has
been extracted using the $as_tibble()
function that is exported by the
package.
fastq %>% to_sequencing_set()
The SequencingSet
in turn has a collection of methods that can be used to
structure and visualise the data. The first that we'll have a look at is the
$enumerate
method that returns an Angenieux
object for data visualisation.
knitr::include_graphics( fastq$sequencingset$enumerate$to_file("figure_5.png")$plot())
There are a plethora of ways through which the Angenieux
object can be used
to style, colour and manipulate the graph - please do have a look at the methods
documentation.
The SequencingSet
object can also be used to access simple but primitive
summary statistics such as mean
sequence length, N50
length etc
fastq$sequencingset$N50 fastq$sequencingset$mean
The distribution of sequence lengths is an important metric that is impacted by
choice of library preparation, starting DNA isolation etc. A plot of length
distributions is prepared from the same SequencingSet
object that we
reviewed in the previous section.
knitr::include_graphics( fastq$sequencingset$read_length_bins(bins=35, outliers=0.001)$ to_file("figure_6.png")$ plot(style="stacked"))
The distribution of sequence lengths is an important metric that is impacted by
choice of library preparation, starting DNA isolation etc. A plot of length
distributions is prepared from the same SequencingSet
object that we
reviewed in the previous section.
knitr::include_graphics( fastq$sequencingset$quality_bins(bins=100)$ to_file("figure_7.png")$ plot(style="stacked"))
The Guppy
basecalling software converts the Nanopore format FAST5 raw sequence
files into the FASTQ files that we have reviewed in the previous section. The
FASTQ entries prepared by Guppy
contain additional information in their
header fields. These additional information contain metadata that relates
to the sequencing run and are, for example, used by the EPI2ME software for the
preparation of the rich real-time reports. The fishy_fastq
method used in the
previous section can also parse these sequence metadata facets from the FASTQ
file.
tibble
from these Guppy based FASTQ dataAdd the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.