Tutorial Objectives

r newthought("Summary Statistics and QC tutorial.") is intended as a guide to help assess and question the quality characteristics of a Nanopore sequence collection. There are a number of tools for the QC analysis of short-sequence reads, this tutorial aims to enable an objective assessment as to the relative performance of a Nanopore flowcell run and to assess the sequence characteristics to benchmark quality.

r newthought("Sufficient information") is provided in the tutorial such that the workflow can be tested, validated, and replicated. The tutorial is provided with an example dataset from a barcoded sequence library. The tutorial is intended to address important questions;

r newthought("Methods utilised") within this tutorial include

r newthought("Computational requirements") for this tutorial include


Quick start

This tutorial aims to summarise the data characteristics from an Oxford Nanopore Technologies sequencing run. Observations from basecalled reads and their quality characteristics, temporal performance, and barcoded content are presented. The information presented is derived from the sequence_summary.txt file produced during basecalling by MinKNOW (soon), Guppy and Albacore.

This report has been produced from an Rmarkdown template and is intended as a starting template for QC review of data produced during a run and as a tutorial for the exploration and assessement of Oxford Nanopore DNA sequence data. The goals from this tutorial include

  1. To introduce a literate framework for analysing base-calling summary statistics to evaluate relative performance of runs
  2. To provide basic QC metrics such that a review and consideration of experimental data can be undertaken
  3. To provide training as to which QC metrics are of greatest value and to encourage an understanding as to how different aspects of sequencing quality can be attributed to various characteristics from DNA isolation to library preparation.

Several of the plots included in this report have been replicated from publicly available projects such as POREquality ^1, minion_qc ^2, and pycoQC ^3.

r newthought('The sequence\\_summary.txt') file is produced by either the Albacore or Guppy base-calling software. The file contains rich metadata for each sequence read produced during a run. These data include timestamp, quality, and channel information, in addition to the characteristics of the called DNA sequence. This tutorial uses this starting file for performance reasons.

Other QC tools such as wub ^4 utilise the fastq file for quality metrics, and other tools make extensive use of the fast5 files. Parsing the fast5 files provides additional analytical context but is much more demanding in terms of compute resource and time. This tutorial is lightweight and is intended to run within a few minutes on a desktop computer.

Customize the tutorial for your data

Using the sequencing_summary.txt file makes this analysis quick to perform and report. Fastq / Fast5 files are much larger and parsing them to calculate the required metrics associated with base-calling takes a considerable amount of time.

The Rmarkdown script contains both the script and code to perform the analysis, but also contains the parameters and pointer to the file that will be analysed. The script, fresh from github contains a bzip2 compressed file summarising base-calling for approximately 800,000 sequences. Please edit the line defining the inputFile variable. This should point to a sequencing_summary file - this file may be compressed (.gz or .bz2). The file may either be monolithic from a single base-calling process or a concatenation of multiple summary_statistic files from e.g. a cluster based base-calling.

Best practices recommend that you place your summary_statistic file into the RawData folder of the project created from the github clone.

Only the **`inputFile`** is critical; the *`flowcellId`* and *`libraryPrepKit`* parameters are only used for legends at present. 


Run the analysis

The analysis of the sequences specified within the Rmarkdown file will be performed as part of the knit process. This will load the summary_statistics file, will prepare a sequence analysis, render figures and prepare the report. To start the analysis, it is only necessary to click the knit button in the Rstudio software - please see figure \ref{fig:KnitIt}.



Executive Summary

Basecalling was performed using the Albacore software. Called reads were classified as either pass or fail depending on their overall quality score. For this analysis, a total of r formatC(nrow(sequencedata), big.mark=",") reads were basecalled and of these r formatC(nrow(passedSeqs), big.mark=",") (r round(nrow(passedSeqs) / nrow(sequencedata) * 100, 1)%) were passed as satsifying the quality metric. The passed reads contain a total of r round(passedBases, 2) Gbp of DNA sequence. This passed-fraction amounts to r round(passedBases / totalBases * 100, 1)% of the total DNA sequenced.

Sequencing channel activity plot

The nanopores through which DNA is passed, and signal collected, are arrayed on a 2-dimensional matrix. A heatmap can be plotted such that channel productivity can be shown against spatial position on the matrix. Such a plot enables the identification of spatial artifacts that could result from membrane damage through e.g. the introduction of an air-bubble. This heatmap representation of spatial activity shows only gross spatial aberations; the activity plot shows the number of sequences produced per channel, not per pore.

Quality and length

The distribution of base-called DNA sequence lengths and their accompanying qualities are key metrics for the review of a sequencing library. This section of the QC review tutorial assesses the length and quality distributions for reads from this flowcell. This section is reviewing the total collection of sequences that both pass and fail the mean quality filter.

The distribution of sequence lengths will be dependent on the protocols that have been used to extract the starting DNA. Sequences from amplicon DNA will have a tight distribution of read lengths, while sequences from genomic DNA will have a broader distribution of sheared DNA product. The distribution will be further influenced if a size-selection step has been used, and will also be dependent on the choice of sequencing library preparation kits. The read length distribution should be assessed to see if the distribution is concordant with that expected.


The distribution of read lengths has been coloured by reads which either pass or fail the mean QV filter. The mean and N50 values for the QV passing sequences are super-imposed on the histogram.  

\hfill\break A histogram of mean QV scores reveals the relative abundance of sequences of different quality.

The distribution of sequence qualities is by QV filter pass status. This QV filter is applied in the base-calling workflow as a modifiable parameter.  

\hfill\break The density plot of mean sequence quality plotted against log10 sequence length appears as a favourite in the various publicly available QC tools. I am not sure on the overall point of this plot ... It reinforces the point that the short(est) sequences tend to have low quality values. In terms of per flowcell QC and as a plot used for the systematic review of many libraries I feel that this is superfluous ... would welcome some feedback. \hfill\break

The density plot of log sequence length against mean read quality has been sharpened for presentation. The sequence bins containing less than or equal to 5 sequences have been removed; this removes background speckle from the graph - the regions of increased density remain.


Time/Duty Performance

Another key metric in the quality review of a sequencing run is an analysis of the temporal performance of the run. During a run each sequencing channel may address a number of different pores (mux) and the individual pores may become temporarily or permanently blocked. It is therefore expected that during a run sequencing productivity will decrease; it is useful to review at to whether the observed productivity decline is normal or if it happens more rapidly than expected. A rapid pore decline could be indicative of contaminants with the sequencing library.


This cumulative plot has been augmented with information that shows the points within the run that account for 50% and 90% of the total bases sequenced.


In addition to the cumulative plot of sequenced bases, an equivalent plot for the sequenced reads can be plotted. This is not too different in structure or morphology to the cumulative baseplot. It would be recommended to consider either a cumulative base plot or a cumulative read plot - the information overlap is sufficient that both are unlikely to be required.

The speed/time plot is a valuable metric in that changed in sequencing speed can be measured. A slow-down in sequencing speed can indicate a shortage of the sequencing fuel required by the motor protein.

if (barcodes > 0) {

Reproducible Research - Produce your own report

This report has been created using Rmarkdown, publicly available R packages, and the \LaTeX document typesetting software for reproducibility. For clarity the R packages used, and their versions, is listed below.


r newthought("Final thoughts.") Behind this Rmarkdown file (and its glossy pdf) is a modest amount of R code - please explore the Rmarkdown template; modify it, and run with your own samples.

To extract the whole set of R code from the Rmarkdown, use the purl command - this will extract the R code into its own file.

Glossary of Terms


Appendix - Installation on macOS

The conda package management system is available for Linux and macOS. There are some R-package dependencies for the tutorial that are challenged by the system files distributed by conda. For macOS, it would be recommended that R, Rstudio, and \LaTeX installation is managed at the system level; other bioinformatics software will be installed through conda.

  1. Download and install the R statistical software from
  2. Install the gfortran and clang compilers provided at
  3. Download and install Rstudio software from
  4. Download and install MacTex from,
  5. Install and update required macOS R packages
R --slave -e 'install.packages(c("BiocManager", "devtools"), ask=F, update=F)'
R --slave -e 'BiocManager::install(c("sysfonts", "showtext"), ask=F, update=F, type="source")'
R --slave -e 'BiocManager::install(c("kableExtra", "roxygen2", "emojifont", "extrafont", "R.utils"), ask=F, update=F)
R --slave -e 'BiocManager::install(c("DESeq2", "caTools", "pcaMethods", "dplyr", "ggplot2", "plyr", "RColorBrewer", "rmarkdown", "tufte", "xfun", "xlsx", "yaml", "Rsubread", "ShortRead"), ask=F, update=F)'
References and Citations

