knitr::opts_chunk$set(message = FALSE, warning = FALSE)
ngsReports is designed to bolt into data processing pipelines and
produce combined plots for multiple FastQC reports generated across an entire
set of libraries or samples.
The primary functionality of the package is parsing FastQC reports, with import
methods also implemented for log files produced by tools as as
In addition to parsing files, default plotting methods are implemented.
Plots applied to a single file will replicate the default plots from
whilst methods applied to multiple FastQC reports summarise these and produce
a series of custom plots.
Plots are produced as standard
r CRANpkg("ggplot2") objects, with an
interactive option available using
As well as custom summary plots, tables of read counts and the like can also
be easily generated.
In addition to the usage demonstrated below, a
shiny app has been developed
for interactive viewing of FastQC reports.
This can be installed using:
A vignette for this app will be installed with the
In it's simplest form, a default summary report can be generated simply by
specifying a directory containing the output from FastQC and calling the
fileDir <- file.path("path", "to", "your", "FastQC", "Reports") writeHtmlReport(fileDir)
This function will transfer the default template to the provided directory and
produce a single
.html file containing interactive summary plots of any
FastQC output found in the directory.
FastQC output can be
*fastqc.zip files or the same files extracted as
The default template is provided as
ngsReports_Fastqc.Rmd in the package
This template can be easily modified and supplied as an alternate template to
the above function using your modified file as the template RMarkdown file.
altTemplate <- file.path("path", "to", "your", "new", "template.Rmd") writeHtmlReport(fileDir, template = altTemplate)
ngsReports introduces two main
FastqcData objects hold the parsed data from a single report as
generated by the stand-alone tool
These are then extended into lists for more than one file as a
For most users, the primary class of interest will be the
To load a set of
FastQC reports into
R as a
FastqcDataList, specify the
vector of file paths, then call the function
In the rare case you'd like an individual file, this can be performed by
FastqcData() on an individual file, or subsetting the output from
FastqcDataList() using the
[] operator as with any list object.
fileDir <- system.file("extdata", package = "ngsReports") files <- list.files(fileDir, pattern = "fastqc.zip$", full.names = TRUE) fdl <- FastqcDataList(files)
From here, all FastQC modules can be obtained as a
using the function
getModule() and choosing one of the following modules:
Summary(The PASS/WARN/FAIL status for each module)
Capitalisation and spelling of these module names follows the default patterns from FastQC reports with spaces replaced by underscores. One additional module is available and taken directly from the text within the supplied reports
In addition, the read totals for each file in the library can be obtained
readTotals(), which can be easily used to make a table of read totals.
This essentially just returns the first two columns from
reads <- readTotals(fdl)
pander can also be extremely useful for manipulating
and displaying imported data.
To show only the R1 read totals, you could do the following
library(dplyr) library(pander) reads %>% dplyr::filter(grepl("R1", Filename)) %>% pander( big.mark = ",", caption = "Read totals from R1 libraries", justify = "lr" )
Plots created from a single
FastqcData object will resemble those generated
FastQC tool, whilst those created from a
FastqcDataList will be
combined summaries across a library of files.
In addition, all plots are able to be generated as interactive plots using the
usePlotly = TRUE.
All FastQC modules have been enabled for plotting using default
with the exception of
The simplest of the plots is to summarise the
PASS/WARN/FAIL flags as
FastQC for each module.
This plot can be simply generated using
The next most informative plot may be to summarise the total numbers of reads
in each associated Fastq file.
By default, the number of duplicated sequences from the
Total_Duplicated_Percentage module are shown, but this can be disabled by
duplicated = FALSE.
As these are
ggplot2 objects, the output can be modified easily using
Here we'll move the legend to the top right as an example.
plotReadTotals(fdl) + theme( legend.position = c(1, 1), legend.justification = c(1, 1), legend.background = element_rect(colour = "black") )
Turning to the
Per base sequence quality scores is the next most common step
for most researchers, and these can be obtained for an individual file by
selecting this as an element (i.e.
FastqcData object ) of the main
This plot replicates the default plots from a FastQC report.
When working with multiple FastQC reports, these are summarised as a heatmap using the mean quality score at each position.
Boxplots of any combinations can also be drawn from a
setting the argument
plotType = "boxplot".
However, this may be not suitable for datasets with a large number of libraries.
plotBaseQuals(fdl[1:4], plotType = "boxplot")
Similarly, the Mean Sequence Quality Per Read plot can be generated to
replicate plots from a FastQC report by selecting the individual file from the
A heatmap of mean sequence qualities can be generated when inspecting multiple reports.
An alternative view may be to plot these as overlaid lines, which can be simply
done by setting
plotType = "line".
Again, discretion should be shown when choosing this option for many samples.
r2 <- grepl("R2", names(fdl)) plotSeqQuals(fdl[r2], plotType = "line")
Per_base_sequence_content module can also be plotted for an individual
report with the layout being identical to that from FastQC.
These are then combined across multiple files as a heatmap showing a composite
colour for each position.
Colours are combined using
T being represented
by green, blue, black and red respectively.
NB These plots can be very informative setting the argument
usePlotly = TRUE, however they can be slow to render given the nature of how
plotly renders graphics.
Again, supplying multiple files and setting
plotType = "line" will replicate
multiple individual plots from the original FastQC reports.
nc is passed to
facet_wrap() from the package
determine the number of columns in the final plot.
plotSeqContent(fdl[1:2], plotType = "line", nc = 1)
Adapter content as identified by FastQC is also able to be plotted for an individual file.
When producing a heatmap across a set of FastQC reports, this will default to Total Adapter Content, instead of showing the individual adapter types.
As with all modules, the Sequence Duplication Levels plot is able to be replicated for an individual file.
When plotting across multiple FastQC reports, duplication levels are shown as a
heatmap based on each default bin included in the initial FastQC reports.
By default, the plotted values are the "Pre" de-duplication values.
Note that values are converted to percentages instead of read numbers to ensure
comparability across files.
In the plot below,
CCGC_R2 shows very low duplication levels, whilst
shows high levels of duplication.
The commonly observed 'spikes' around
>10 are also evident as the larger red
A selection of Theoretical GC content is supplied with the package in the
gcTheoretical, which has been defined with the additional
GC content was calculated using scripts obtained from
https://github.com/mikelove/fastqcTheoreticalGC. Available genomes and transcriptomes can be obtained using the function
gcAvail() on the object
gcTheoretical and specifying the type.
As with all modules, data for an individual file replicates the default plot
from a FastQC report, but with one key difference.
This is that the Theoretical GC content has been provided in the object
gcTheoretical based on 100bp reads.
This empirically determined content is shown as the Theoretical GC content
plotGcContent(fdl[], species = "Hsapiens", gcType = "Transcriptome")
Again, data is summarised as a heatmap when plotting across multiple reports, with the default value being the difference between the observed and the theoretical GC content.
Line plots can also be produce an alternative viewpoint, with read totals displayed as percentages instead of raw counts.
plotGcContent(fdl, plotType = "line", gcType = "Transcriptome")
Customized theoretical GC content can be generated using input DNA sequences from a supplied fasta file.
faFile <- system.file( "extdata", "Athaliana.TAIR10.tRNA.fasta", package = "ngsReports") plotGcContent(fdl, Fastafile = faFile, n = 1000)
When inspecting the Overrepresented Sequence module, the top
ncan be plotted
for an individual file, again broken down by their possible source, and
coloured based on their
When applying this across multiple files, instead of identifying common sequences across a set of libraries, overrepresented sequences are summarised by their possible source as defined by FastQC.
In addition to the above, the most abundant
n overrepresented sequences can
be exported as a FASTA file for easy submission to
overRep2Fasta(fdl, n = 10)
A selection of log files as produced by tools such as
picard duplicationMetrics, can be easily imported
Tool can be specified by the user using the argument
type, however if no
type is provided we will attempt to auto-detect from the file's structure.
Note: only a single log file type can be imported at any time.
importNgsLogs() function currently supports log files from
the following tools.
Adapter removal and trimming
Mapping and alignment
fl <- c("Sample1.trimmomaticPE.txt") trimmomaticLogs <- system.file("extdata", fl, package = "ngsReports") df <- importNgsLogs(trimmomaticLogs)
df %>% dplyr::select("Filename", contains("Surviving"), "Dropped") %>% pander( split.tables = Inf, style = "rmarkdown", big.mark = ",", caption = "Select columns as an example of output from trimmomatic." )
Bowtie log files can be parsed and imported
fls <- c("bowtiePE.txt", "bowtieSE.txt") bowtieLogs <- system.file("extdata", fls, package = "ngsReports") df <- importNgsLogs(bowtieLogs, type = "bowtie")
df %>% dplyr::select("Filename", starts_with("Reads")) %>% pander( split.tables = Inf, style = "rmarkdown", big.mark = ",", caption = "Select columns as an example of output from bowtie." )
STAR log files can be parsed and imported
starLog <- system.file("extdata", "log.final.out", package = "ngsReports") df <- importNgsLogs(starLog, type = "star")
df %>% dplyr::select("Filename", contains("Unique")) %>% pander( split.tables = Inf, style = "rmarkdown", big.mark = ",", caption = "Select columns as output from STAR" )
The output of the
samtools flagstat module can be parsed and imported
flagstatLog <- system.file("extdata", "flagstat.txt", package = "ngsReports") df <- importNgsLogs(flagstatLog, type = "flagstat")
df %>% pander( split.tables = Inf, style = "rmarkdown", big.mark = ",", caption = "Flagstat output for a single file" )
In addition to the files produced by the above alignment tools, the output
from Duplication Metrics (
picard) can also be imported.
This is imported as a list with a
tibble containing the detailed output in
the list element
$metrics and the histogram data included as the second
sysDir <- system.file("extdata", package = "ngsReports") fl <- list.files(sysDir, "Dedup_metrics.txt", full.names = TRUE) dupMetrics <- importNgsLogs(fl, type = "duplicationMetrics", which = "metrics") str(dupMetrics)
Summaries of log files from select mapping and alignment tools can be plot
using the function
plotAlignmentSummary(bowtieLogs, type = "bowtie")
plotAlignmentSummary(starLog, type = "star")
Assembly 'completeness' and summary statistic information from the tools BUSCO
and quast can also be plot using the function
buscoLog <- system.file("extdata", "short_summary_Dmelanogaster_Busco.txt", package = "ngsReports") plotAssemblyStats(buscoLog, type = "busco")
fls <- c("quast1.tsv", "quast2.tsv") quastLog <- system.file("extdata", fls, package = "ngsReports") plotAssemblyStats(quastLog, type = "quast")
plotAssemblyStats(quastLog, type = "quast", plotType = "paracoord")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.