knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(fig.width=9, fig.height=6, warning=FALSE, message=FALSE)

library(TuPoreIal) 
library(data.table)
library(dplyr)
library(plyr)
library(emojifont)
library(extrafont)
library(fontawesome) # devtools::install_github(rstudio/fontawesome)
library(ggplot2)
library(knitr)
library(RColorBrewer)
library(tufte)
library(caTools)
library(yaml)

config <- yaml.load_file("config.yaml")
inputFile <- config$inputFile
flowcellId <- config$flowcellId
libraryPrepKit <- config$libraryPrepKit
scaling <- 1
reportDPI <- 90

Tutorial Objectives

r newthought("Summary Statistics and QC tutorial.") is intended as a guide to help assess and question the quality characteristics of a Nanopore sequence collection. There are a number of tools for the QC analysis of short-sequence reads, this tutorial aims to enable an objective assessment as to the relative performance of a Nanopore flowcell run and to assess the sequence characteristics to benchmark quality.

r newthought("Sufficient information") is provided in the tutorial such that the workflow can be tested, validated, and replicated. The tutorial is provided with an example dataset from a barcoded sequence library. The tutorial is intended to address important questions;

r newthought("Methods utilised") within this tutorial include

r newthought("Computational requirements") for this tutorial include

\pagebreak

Quick start

  1. Most software dependecies are managed though conda, install as described at
    https://conda.io/docs/install/quick.html#linux-miniconda-install.
  2. create a working directory for the project and cd into it
  3. Download github gist for environment.yaml
    wget https://gist.githubusercontent.com/sagrudd/609b9124b96f6d388951dad9e8bc4964/raw/environment.yaml
  4. Install conda software dependencies with
    conda env create --name sumStatTutorial --file environment.yaml
  5. Initialise conda environment with source activate sumStatTutorial
  6. Install the Nanopore tutorials Rpackage R --slave -e 'Sys.setenv(TAR = "/bin/tar"); devtools::install_github("sagrudd/tutorials")'
  7. Initialise the Nanopore QC tutorial R --slave -e 'TuPoreIal::InitTutorial("SeqStatsQC")'
  8. optional edit the provided config.yaml file to match your own study design
  9. Render the report using results from the analysis above R --slave -e 'rmarkdown::render("Nanopore_SumStatQC_Tutorial.Rmd", "tufte_handout")'

\pagebreak

Introduction

This tutorial aims to summarise the data characteristics from an Oxford Nanopore Technologies sequencing run. Observations from basecalled reads and their quality characteristics, temporal performance, and barcoded content are presented. The information presented is derived from the sequence_summary.txt file produced during basecalling by MinKNOW (soon), Guppy and Albacore.

This report has been produced from an Rmarkdown template and is intended as a starting template for QC review of data produced during a run and as a tutorial for the exploration and assessement of Oxford Nanopore DNA sequence data. The goals from this tutorial include

  1. To introduce a literate framework for analysing base-calling summary statistics to evaluate relative performance of runs
  2. To provide basic QC metrics such that a review and consideration of experimental data can be undertaken
  3. To provide training as to which QC metrics are of greatest value and to encourage an understanding as to how different aspects of sequencing quality can be attributed to various characteristics from DNA isolation to library preparation.

Several of the plots included in this report have been replicated from publicly available projects such as POREquality ^1, minion_qc ^2, and pycoQC ^3.

r newthought('The sequence\\_summary.txt') file is produced by either the Albacore or Guppy base-calling software. The file contains rich metadata for each sequence read produced during a run. These data include timestamp, quality, and channel information, in addition to the characteristics of the called DNA sequence. This tutorial uses this starting file for performance reasons.

Other QC tools such as wub ^4 utilise the fastq file for quality metrics, and other tools make extensive use of the fast5 files. Parsing the fast5 files provides additional analytical context but is much more demanding in terms of compute resource and time. This tutorial is lightweight and is intended to run within a few minutes on a desktop computer.

Customize the tutorial for your data

Using the sequencing_summary.txt file makes this analysis quick to perform and report. Fastq / Fast5 files are much larger and parsing them to calculate the required metrics associated with base-calling takes a considerable amount of time.

The Rmarkdown script contains both the script and code to perform the analysis, but also contains the parameters and pointer to the file that will be analysed. The script, fresh from github contains a bzip2 compressed file summarising base-calling for approximately 800,000 sequences. Please edit the line defining the inputFile variable. This should point to a sequencing_summary file - this file may be compressed (.gz or .bz2). The file may either be monolithic from a single base-calling process or a concatenation of multiple summary_statistic files from e.g. a cluster based base-calling.

Best practices recommend that you place your summary_statistic file into the RawData folder of the project created from the github clone.

Only the **`inputFile`** is critical; the *`flowcellId`* and *`libraryPrepKit`* parameters are only used for legends at present. 
knitr::include_graphics("Analysis/StaticImages/sumstatEditParams.png")

\pagebreak

Run the analysis

The analysis of the sequences specified within the Rmarkdown file will be performed as part of the knit process. This will load the summary_statistics file, will prepare a sequence analysis, render figures and prepare the report. To start the analysis, it is only necessary to click the knit button in the Rstudio software - please see figure \ref{fig:KnitIt}.

knitr::include_graphics("Analysis/StaticImages/KnitIt.png")

\pagebreak

Executive Summary

# Using fread for fast and friendly import of sequence_summary file
# no definition of column types to speed import and allow for merged files
# could be worthwhile to fread(select=) to select only a subset of columns - this could
# preclude e.g. barcode data or different versions?
sequencedata <- data.table::fread(inputFile, stringsAsFactors=FALSE)

# remove the redundant headers from merged files
if (length(which(sequencedata[,1]=="filename")) > 0) {
  sequencedata <- sequencedata[-which(sequencedata[,1]=="filename"),]
}

# coerce the columns used in analytics into more appropriate data-types 
sequencedata$channel<-as.numeric(sequencedata$channel)
sequencedata$start_time<-as.numeric(sequencedata$start_time)
sequencedata$duration<-as.numeric(sequencedata$duration)
sequencedata$num_events<-as.numeric(sequencedata$num_events)
sequencedata$sequence_length_template<-as.numeric(sequencedata$sequence_length_template)
sequencedata$mean_qscore_template<-as.numeric(sequencedata$mean_qscore_template)

# passes_filtering is a useful flag; but there are examples of sequencing_summary.txt where this 
# is not present - https://github.com/a-slide/pycoQC/blob/master/pycoQC/data/sequencing_summary_1D_DNA_Albacore_1.2.1.txt
if (! "passes_filtering" %in% colnames(sequencedata)) {
  # set all of the reads to pass? apply a cutoff?
  sequencedata$passes_filtering <- TRUE
} else {
  sequencedata$passes_filtering <- as.logical(sequencedata$passes_filtering)
}
# create a convenient separation of pass and fail ...
passedSeqs <- sequencedata[which(sequencedata$passes_filtering), ]
failedSeqs <- sequencedata[which(!sequencedata$passes_filtering), ]
# calculate some basec, but key, metrics
flowcellId <- "undefined flowId"
readCount <- formatC(nrow(sequencedata), big.mark=",")
totalBases = sum(sequencedata$sequence_length_template,na.rm=T)/10^9
passedBases = sum(passedSeqs$sequence_length_template,na.rm=T)/10^9
gigabases <- round(totalBases,2)

# prepare a data object to render a summary graphic
figures <- 3
df <- data.frame(
    x = cumsum(c(2, rep(6.5, figures-1))),
    y = rep(2, figures),
    h = rep(4, figures),
    w = rep(6, figures))

df$info <- c("flowcell", readCount, gigabases)
df$key <- c(flowcellId,"Reads produced", "gigabases called")
df$icon <- fontawesome(c('fa-qrcode', 'fa-filter', 'fa-file-text-o'))
df$colour <- rep("steelblue", figures)


# and display the plot
ExecutiveSummaryValueBoxes <- ggplot(df, aes(x, y, height = h, width = w, label = key, fill = colour)) +
    geom_tile(fill = brewer.pal(9,"Blues")[7]) +
    geom_text(color = brewer.pal(9,"Blues")[3], hjust="left", nudge_y=-1.5, nudge_x=-2.6, size=5) +
    geom_text(label=df$info, size=10, color = brewer.pal(9,"Blues")[2], fontface = "bold", nudge_x=-2.6, hjust="left") +
    geom_text(label=df$icon, family='fontawesome-webfont', colour=brewer.pal(9,"Blues")[5], size=23, hjust="right", nudge_x=2.85, nudge_y=0.8) +
    coord_fixed() + 
    theme_void() +
    guides(fill = F)

# some juggling here to prepare a plot that can be rendered across platforms = windows/linux/osx
ggsave("ExecutiveSummaryValueBoxes.png", plot=ExecutiveSummaryValueBoxes, device="png", units="cm", width=25, height=5, dpi=reportDPI)
knitr::include_graphics("ExecutiveSummaryValueBoxes.png")

Basecalling was performed using the Albacore software. Called reads were classified as either pass or fail depending on their overall quality score. For this analysis, a total of r formatC(nrow(sequencedata), big.mark=",") reads were basecalled and of these r formatC(nrow(passedSeqs), big.mark=",") (r round(nrow(passedSeqs) / nrow(sequencedata) * 100, 1)%) were passed as satsifying the quality metric. The passed reads contain a total of r round(passedBases, 2) Gbp of DNA sequence. This passed-fraction amounts to r round(passedBases / totalBases * 100, 1)% of the total DNA sequenced.

df <- data.frame(matrix(nrow=1, ncol = 3))

names(df) <- c("variable", "percentage","label")
df$variable <- c("pass")
df$percentage <- c(round(length(which(sequencedata$passes_filtering==TRUE)) / nrow(sequencedata), 3))

df <- df %>% mutate(group=ifelse(percentage <0.6, "red",
 ifelse(percentage>=0.6 & percentage<0.8, "orange","green")),
 label=paste0(df$percentage*100, "%"))

title="Percentage of reads\npassing QC filter"

ggplot(df, aes(fill = group, ymax = percentage, ymin = 0, xmax = 2, xmin = 1)) +
 geom_rect(aes(ymax=1, ymin=0, xmax=2, xmin=1), fill ="#ece8bd") +
 geom_rect() + 
 coord_polar(theta = "y",start=-pi/2) + xlim(c(0, 2)) + ylim(c(0,2)) +
  guides(fill=FALSE) +
  guides(colour=FALSE) +
  theme_void() +
  theme(strip.background = element_blank(), strip.text.x = element_blank()) +
  geom_text(aes(x = 0, y = 0, label = label), size=13) +
  geom_text(aes(x=1.5, y=1.5, label=title), size=11) +
  scale_fill_manual(values = c("red"="#C9146C", "orange"="#DA9112", "green"="#129188")) +
  scale_colour_manual(values = c("red"="#C9146C", "orange"="#DA9112", "green"="#129188"))

\newpage

Sequencing channel activity plot

The nanopores through which DNA is passed, and signal collected, are arrayed on a 2-dimensional matrix. A heatmap can be plotted such that channel productivity can be shown against spatial position on the matrix. Such a plot enables the identification of spatial artifacts that could result from membrane damage through e.g. the introduction of an air-bubble. This heatmap representation of spatial activity shows only gross spatial aberations; the activity plot shows the number of sequences produced per channel, not per pore.

# create an empty read count container ... MinION or PromethION??

# https://gist.github.com/roblanf/df47b9748c3aae00809cc675aca79989
# build the map for R9.5 flowcell, as a long-form dataframe that translates
# channels into rows and columns on the flowcell. Good for plotting in R.
p1 = data.frame(channel=33:64, row=rep(1:4, each=8), col=rep(1:8, 4))
p2 = data.frame(channel=481:512, row=rep(5:8, each=8), col=rep(1:8, 4))
p3 = data.frame(channel=417:448, row=rep(9:12, each=8), col=rep(1:8, 4))
p4 = data.frame(channel=353:384, row=rep(13:16, each=8), col=rep(1:8, 4))
p5 = data.frame(channel=289:320, row=rep(17:20, each=8), col=rep(1:8, 4))
p6 = data.frame(channel=225:256, row=rep(21:24, each=8), col=rep(1:8, 4))
p7 = data.frame(channel=161:192, row=rep(25:28, each=8), col=rep(1:8, 4))
p8 = data.frame(channel=97:128, row=rep(29:32, each=8), col=rep(1:8, 4))

q1 = data.frame(channel=1:32, row=rep(1:4, each=8), col=rep(16:9, 4))
q2 = data.frame(channel=449:480, row=rep(5:8, each=8), col=rep(16:9, 4))
q3 = data.frame(channel=385:416, row=rep(9:12, each=8), col=rep(16:9, 4))
q4 = data.frame(channel=321:352, row=rep(13:16, each=8), col=rep(16:9, 4))
q5 = data.frame(channel=257:288, row=rep(17:20, each=8), col=rep(16:9, 4))
q6 = data.frame(channel=193:224, row=rep(21:24, each=8), col=rep(16:9, 4))
q7 = data.frame(channel=129:160, row=rep(25:28, each=8), col=rep(16:9, 4))
q8 = data.frame(channel=65:96, row=rep(29:32, each=8), col=rep(16:9, 4))

# long form as a data frame, i.e. map$channel[[1]] returns 33
channelMap = rbind(p1, p2, p3, p4, p5, p6, p7, p8, q1, q2, q3, q4, q5, q6, q7, q8)

hm.palette <- colorRampPalette(brewer.pal(9, 'Blues'), space='Lab') #RdPu, Oranges, Greens, YlOrRd, Purples

channelCounts <- as.data.frame(matrix(rep(0, 512), ncol=1))
channelCountRaw <- as.data.frame(table(unlist(sequencedata[, "channel"])), row.names=1)
channelCounts[row.names(channelCountRaw),] <- channelCountRaw[,1]
#channelMap <- cbind(channelMap[channelMap$channel,], frequency=channelCounts[channelMap$channel,])
channelMap <- merge(channelMap, channelCounts, by.x="channel", by.y=0)
colnames(channelMap)[4]<-"count"
channelMapMatrix <- reshape2::acast(channelMap, col ~ row, value.var = "count")

theme_update(plot.title = element_text(hjust = 0.5))

ggplot(channelMap, aes(x = row, y = col, fill = count)) +
    geom_tile() +
    geom_text(data=channelMap,aes(x=row, y=col,label=count,color=count),show.legend = F, size=2.5) +
    scale_x_discrete(breaks=NULL) +
    scale_y_discrete(breaks=NULL) +
    coord_equal() +
    scale_fill_gradientn(colours = hm.palette(100)) +
    scale_color_gradient2(low = hm.palette(100), high = hm.palette(1)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title="Channel activity plot showing number of reads per flowcell channel") +
    theme(panel.border = element_blank(), panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        legend.position="bottom",
        legend.key.width=unit(5.6,"cm"))

\pagebreak

Quality and length

The distribution of base-called DNA sequence lengths and their accompanying qualities are key metrics for the review of a sequencing library. This section of the QC review tutorial assesses the length and quality distributions for reads from this flowcell. This section is reviewing the total collection of sequences that both pass and fail the mean quality filter.

#N50.length = seq_dt$sequence_length_template[min(which(seq_dt$cumulative_bases > (total.bases/2)))]
lenSorted <- rev(sort(passedSeqs$sequence_length_template))
passedMeanLength = round(mean(lenSorted), digits = 0)
N50 <- lenSorted[cumsum(lenSorted) >= sum(lenSorted)*0.5][1]
passedMeanQ = round(mean(passedSeqs$mean_qscore_template), digits = 1)
failedMeanQ = round(mean(failedSeqs$mean_qscore_template), digits = 1)

#N50 length is the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs
figures <- 5

df <- data.frame(
    x = cumsum(c(2, rep(6.5, figures-1))),
    y = rep(2, figures),
    h = rep(4, figures),
    w = rep(6, figures))

    df$info <- c(passedMeanLength, N50, passedMeanQ, failedMeanQ, prettyNum(max(passedSeqs$sequence_length_template), big.mark=","))
    df$key <- c("Mean Read Length (nt)","N50","Mean Read Quality (QV)","Mean Failed QV","Longest Read")
    df$icon <- fontawesome(c("fa-bar-chart", "fa-play", "fa-area-chart", "fa-bug", "fa-sort"))

    df$colour <- rep("steelblue", figures)

ReadCharacteristicsValueBoxes <- ggplot(df, aes(x, y, height = h, width = w, label = key, fill = colour)) +
    geom_tile(fill = brewer.pal(9,"Blues")[7]) +
    geom_text(color = brewer.pal(9,"Blues")[3], hjust="left", nudge_y=-1.5, nudge_x=-2.6, size=3.5) +
    geom_text(label=df$info, size=5.5, color = brewer.pal(9,"Blues")[2], fontface = "bold", nudge_x=-2.6, hjust="left") +
    geom_text(label=df$icon, family='fontawesome-webfont', colour=brewer.pal(9,"Blues")[5], size=13.3, hjust="right", nudge_x=2.85, nudge_y=0.8) +
    coord_fixed() + 
    scale_fill_brewer(type = "qual",palette = "Dark2") +
    theme_void() +
    guides(fill = F)



ggsave("ReadCharacteristicsValueBoxes.png", plot=ReadCharacteristicsValueBoxes, device="png", units="cm", width=25, height=5, dpi=reportDPI)

knitr::include_graphics("ReadCharacteristicsValueBoxes.png")

The distribution of sequence lengths will be dependent on the protocols that have been used to extract the starting DNA. Sequences from amplicon DNA will have a tight distribution of read lengths, while sequences from genomic DNA will have a broader distribution of sheared DNA product. The distribution will be further influenced if a size-selection step has been used, and will also be dependent on the choice of sequencing library preparation kits. The read length distribution should be assessed to see if the distribution is concordant with that expected.

\hfill\break

# there is a challenge to scale the data intelligently - this would be different for e.g. 16S and WGS
# instead of hard coding this, perhaps we can place the logic into ggplot2?

# quantile(x=sequencedata$sequence_length_template, probs=c(0.975))

# https://stackoverflow.com/questions/6461209/how-to-round-up-to-the-nearest-10-or-100-or-x
roundUpNice <- function(x, nice=seq(from=1, to=10, by=0.25)) {
    if(length(x) != 1) stop("'x' must be of length 1")
    10^floor(log10(x)) * nice[[which(x <= 10^floor(log10(x)) * nice)[[1]]]]
}

# pick a friendly upper limit to render sequence lengths into a histogram
# here we're aiming for a robustly rounded up 97.5 quantile of the data (skip a few outliers ...)
upperLimit <- roundUpNice(as.numeric(quantile(x=sequencedata$sequence_length_template, probs=c(0.975))))

# an ideal histogram will have 40 or so bins
histogramBinCount <- 40
breakVal = roundUpNice(upperLimit / histogramBinCount)

breaks <- seq(0, to=upperLimit, by=breakVal)
assgnd.pass <- table(
  cut(subset(sequencedata, passes_filtering==TRUE & sequence_length_template > 1)$sequence_length_template, breaks, include.lowest = TRUE, right=FALSE))

assgnd.fail <- table(cut(subset(sequencedata, passes_filtering==FALSE & sequence_length_template > 1)$sequence_length_template, breaks, include.lowest = TRUE, right=FALSE))

lengthdist <- data.frame(length=head(breaks, -1), pass=as.vector(assgnd.pass), fail=as.vector(assgnd.fail))

lendistmlt <- reshape2::melt(lengthdist, id.vars=c("length"))


ggplot(lendistmlt, aes(x=length, fill=variable, y=value)) +
  geom_bar(stat="identity") +
  xlab("Read length\n") + ylab("Number of reads\n") +
  scale_fill_brewer(palette="Paired") +
  scale_x_continuous(limits=c(-breakVal,upperLimit), breaks=pretty(passedSeqs$sequence_length_template,n=40)) +
  labs(title="Histogram showing distribution of read lengths across quality passing sequences", fill="QV filter")+
    geom_vline(xintercept = N50, size = 1) +
    annotate("text", x=N50, y=max(assgnd.pass + assgnd.fail), label = " N50", hjust=0, colour="SteelBlue") +
    geom_vline(xintercept = passedMeanLength, size = 1) +
    annotate("text", x=passedMeanLength, y=max(assgnd.pass + assgnd.fail), label = " Mean", hjust=0, colour="SteelBlue")
The distribution of read lengths has been coloured by reads which either pass or fail the mean QV filter. The mean and N50 values for the QV passing sequences are super-imposed on the histogram.  

\hfill\break A histogram of mean QV scores reveals the relative abundance of sequences of different quality.
\hfill\break

ggplot(sequencedata, aes(x=mean_qscore_template, fill=passes_filtering)) + 
    geom_histogram(breaks=seq(from=0, to=15, by=0.1)) +
    scale_fill_brewer(palette="Paired") +
    labs(title="Plot showing distribution of quality scores across all reads") +
    xlab("Mean Q score of read") +
    ylab("Number of reads")
The distribution of sequence qualities is by QV filter pass status. This QV filter is applied in the base-calling workflow as a modifiable parameter.  

\hfill\break The density plot of mean sequence quality plotted against log10 sequence length appears as a favourite in the various publicly available QC tools. I am not sure on the overall point of this plot ... It reinforces the point that the short(est) sequences tend to have low quality values. In terms of per flowcell QC and as a plot used for the systematic review of many libraries I feel that this is superfluous ... would welcome some feedback. \hfill\break

binFilter <- 5
# background speckle from low density bins is meh - this can be prepared ...
# prepare the density plot, but do not render
lq_dens <- ggplot(sequencedata, aes(log10(sequence_length_template), mean_qscore_template)) + geom_bin2d(bins=100)
# extract the density map from the plot
lq_dens_counts <- ggplot_build(lq_dens)$data[[1]]
if (binFilter > 0) {
  # remove the bins from the density map that do not contain sequence count above threshold 
  lq_dens_counts <- lq_dens_counts[-which(lq_dens_counts$count <= binFilter),]
}
# directly plot this modified density map (stat=="identity")
ggplot(lq_dens_counts) + 
  geom_bin2d(aes(x,y,fill=count), stat="identity") +
  scale_fill_distiller(palette="Blues") + 
  geom_hline(yintercept = 7, size = 1) + 
  xlab("log10(read length)") + 
  ylab("read mean quality") + 
  scale_x_continuous(breaks = c(1,2,3,4,5), label = c("10", "100", "1000", "10,000", "100,000")) +
  annotation_logticks(base = 10, sides = "b", scaled = TRUE) +
  labs(title="Contour Plot showing distribution of quality scores against log10 read lengths (all reads)")
The density plot of log sequence length against mean read quality has been sharpened for presentation. The sequence bins containing less than or equal to 5 sequences have been removed; this removes background speckle from the graph - the regions of increased density remain.

\pagebreak

Time/Duty Performance

Another key metric in the quality review of a sequencing run is an analysis of the temporal performance of the run. During a run each sequencing channel may address a number of different pores (mux) and the individual pores may become temporarily or permanently blocked. It is therefore expected that during a run sequencing productivity will decrease; it is useful to review at to whether the observed productivity decline is normal or if it happens more rapidly than expected. A rapid pore decline could be indicative of contaminants with the sequencing library.

\hfill\break

Plotting the number of bases generated per unit time over the course of the run can reveal unexpected behaviours. In an ideal experiment there will not be many sudden decreases in performance.

# there is a fundamental change in the summary_statistics files generated by newer and older versions of
# the base callers - the scale of sequence_data$start_time has changed and this will produce some FUBAR
# graphs

sequencedata$start_time <- sequencedata$start_time - min(sequencedata$start_time)
sequencedata$start_time <- sequencedata$start_time / scaling

# assuming a 48 hour run, 5 minute intervals
sampleHours = 48
sampleIntervalMinutes = 60

breaks = seq(0, sampleHours*60*60, by=60*sampleIntervalMinutes)
binass <- findInterval(sequencedata$start_time, breaks)

mergeItPerHour <- function(interval, binnedAssignments, filter) {
  totalbases = 0
  if (length(which(binnedAssignments==interval))>0) {
    subset <- sequencedata[which(binnedAssignments==interval), ]
    if (length(which(subset$passes_filtering == filter)) > 0) {
      totalbases = sum(subset[which(subset$passes_filtering == filter), "sequence_length_template"])
    }
  }
  # need to scale what is being returned - totalbases value is total bases within an interval (sampleIntervalMinutes)
  return(totalbases / 1e9 / sampleIntervalMinutes * 60)
}

binnedTemporalDataPerHour <- data.frame(
  cbind(
    time=breaks,
    pass=unlist(lapply(seq(breaks), mergeItPerHour, binnedAssignments=binass,filter=TRUE)),
    fail=unlist(lapply(seq(breaks), mergeItPerHour, binnedAssignments=binass, filter=FALSE))
  )
)

binnedTemporalDataPerHour$time <- binnedTemporalDataPerHour$time / 60 / 60

ggplot(binnedTemporalDataPerHour, aes(time)) +
  geom_line(aes(y = fail, colour = "fail"), size=1) + 
  geom_line(aes(y = pass, colour = "pass"), size=1) +
  scale_color_brewer(name="Reads passing QV 7", palette="Paired") + 
  xlab("Time (hours)") + 
  ylab("Gigabases sequenced per hour (Gbp/hr)") + 
  labs(title="Plot showing sequence throughput against time")
The temporal data has been scaled to sequence collected per hour. For a finer resolution of performance the code can be modified for a more frequent sample. I observe a lot more noise and sampling effect when the graph is sampled at 5 or 10 minute intervals.
# binnedTemporalDataPerHour is scaled to Gbp per hour - rescale to raw for cumulative plotting
binnedTemporalDataPerHour$pass <- binnedTemporalDataPerHour$pass / 60 * sampleIntervalMinutes
binnedTemporalDataPerHour$fail <- binnedTemporalDataPerHour$fail / 60 * sampleIntervalMinutes

# https://stackoverflow.com/questions/31404679/can-ggplot2-find-the-intersections-or-is-there-any-other-neat-way
targetInterpolate <- approxfun(x=binnedTemporalDataPerHour$time, y=cumsum(binnedTemporalDataPerHour$pass))

base50 <- sum(passedSeqs$sequence_length_template)/1e9*0.5
base90 <- sum(passedSeqs$sequence_length_template)/1e9*0.9

T50 <- optimize(function(t0) abs(targetInterpolate(t0) - base50), interval = range(binnedTemporalDataPerHour$time))
T90 <- optimize(function(t0) abs(targetInterpolate(t0) - base90), interval = range(binnedTemporalDataPerHour$time))

In addition to plotting the temporal production of data, a cumulative plot shows how data is accumulated during the run. From this dataset, we have measured a total of r round(sum(passedSeqs$sequence_length_template)/1e9, 2) Gigabases of quality passing sequence. We can identify the timepoint T50 such that 50% of sequenced bases have been collected within this time - or r round(T50$minimum, 2) hours in this example. This is displayed on the graph.

ggplot(binnedTemporalDataPerHour, aes(time)) +
  geom_line(aes(y = cumsum(fail), colour = "fail"), size=1) + 
  geom_line(aes(y = cumsum(pass), colour = "pass"), size=1) +
  scale_color_brewer(name="Reads passing QV 7", palette="Paired") + 
  geom_segment(x=T50$minimum, y=0, xend=T50$minimum, yend=base50, colour="darkgray", size=1) +
  geom_segment(x=0, y=base50, xend=T50$minimum, yend=base50, colour="darkgray", size=1) +
  annotate("text", x=T50$minimum, y=base50, label=" T50", vjust=1, hjust=0, colour="SteelBlue") +
  geom_segment(x=T90$minimum, y=0, xend=T90$minimum, yend=base90, colour="darkgray", size=1) +
  geom_segment(x=0, y=base90, xend=T90$minimum, yend=base90, colour="darkgray", size=1) +
  annotate("text", x=T90$minimum, y=base90, label=" T90", vjust=1, hjust=0, colour="SteelBlue") +
  xlab("Time (hours)") + 
  ylab("Number of bases sequenced (Gigabases)") + 
  labs(title="Plot showing cumulative bases sequenced against time")
This cumulative plot has been augmented with information that shows the points within the run that account for 50% and 90% of the total bases sequenced.

\hfill\break

In addition to the cumulative plot of sequenced bases, an equivalent plot for the sequenced reads can be plotted. This is not too different in structure or morphology to the cumulative baseplot. It would be recommended to consider either a cumulative base plot or a cumulative read plot - the information overlap is sufficient that both are unlikely to be required.

mergeItReadsPerHour <- function(interval, binnedAssignments,filter) {
  totalreads = 0
  if (length(which(binnedAssignments==interval))>0) {
    subset <- sequencedata[which(binnedAssignments==interval), ]
    if (length(which(subset$passes_filtering == filter)) > 0) {
      totalreads = nrow(subset[which(subset$passes_filtering == filter),])
    }
  }
  # scale results to mean millions of reads per hour
  return(totalreads/ 1e6 / sampleIntervalMinutes * 60)
}

binnedTemporalDataReadsPerHour <- data.frame(
  cbind(time=breaks,
        pass=unlist(lapply(seq(breaks), mergeItReadsPerHour, binnedAssignments=binass, filter=TRUE)),
        fail=unlist(lapply(seq(breaks), mergeItReadsPerHour, binnedAssignments=binass, filter=FALSE))
    )
)

binnedTemporalDataReadsPerHour$time <- binnedTemporalDataReadsPerHour$time / 60 / 60
# binnedTemporalDataReadsPerHour is scaled to Gbp per hour - rescale to raw for cumulative plotting
binnedTemporalDataReadsPerHour$pass <- binnedTemporalDataReadsPerHour$pass / 60 * sampleIntervalMinutes
binnedTemporalDataReadsPerHour$fail <- binnedTemporalDataReadsPerHour$fail / 60 * sampleIntervalMinutes

ggplot(binnedTemporalDataReadsPerHour, aes(time)) +
  geom_line(aes(y = cumsum(fail), colour = "fail"), size=1) + 
  geom_line(aes(y = cumsum(pass), colour = "pass"), size=1) +
  scale_color_brewer(name="Reads passing QV 7", palette="Paired") + 
  xlab("Time (hours)") + 
  ylab("Number of reads sequenced (Millions)") + 
  labs(title="Plot showing cumulative reads sequenced against time")

\hfill\break

The speed/time plot is a valuable metric in that changed in sequencing speed can be measured. A slow-down in sequencing speed can indicate a shortage of the sequencing fuel required by the motor protein.

speedTime <- data.frame(segment=binass, rate=sequencedata$sequence_length_template / (sequencedata$duration/scaling))

ggplot(speedTime, aes(x=segment, y=rate, group=segment)) + geom_boxplot(fill="steelblue", outlier.shape=NA) +scale_x_continuous(name="Time (hours)") + ylab("Sequencing rate (bases per second)") + labs(title="boxplot showing distribution of sequencing speed against time")
This speed-time plot has been modified such that datapoints corresponding to outlying sequences have been masked. This does not change the box and whisker.
mergeActiveChannels <- function(interval, binnedAssignments) {
  totalChannels = 0
  if (length(which(binnedAssignments==interval))>0) {
    subset <- sequencedata[which(binnedAssignments==interval), ]
    totalChannels = length(unique(subset$channel))
  }
  return(totalChannels)
}


binnedTemporalChannels <- data.frame(time=breaks,
        channels=unlist(lapply(seq(breaks), mergeActiveChannels, binnedAssignments=binass)
  )
)

binnedTemporalChannels$time <- binnedTemporalChannels$time / 60 / 60

ggplot(binnedTemporalChannels, aes(time)) +
  geom_step(aes(y = channels), size=1, colour = "Steelblue") +
  xlab("Time (hours)") + 
  ylab("Number of channels producing data") + 
  labs(title="Plot showing number of functional channels against time")

\pagebreak

Demultiplexing

barcodes = 0
barcodeUnass = 0
barcodeRange <- c(0,0)
if ("barcode_arrangement" %in% names(passedSeqs)) {
  barcode=plyr::count(passedSeqs$barcode_arrangement)
  barcode=subset(barcode, freq > 150)
  if ("unclassified" %in% barcode$x) {
    barcodes <- nrow(barcode[-which(barcode$x=="unclassified"),])
    barcodeUnass <- sum(barcode[-which(barcode$x=="unclassified"),"freq"]) / sum(barcode$freq) * 100
    barcodeRange <- range(subset(barcode, x!="unclassified")$freq)
  } else {
    barcodes <- nrow(barcode)
    barcodeUnass = 100
    barcodeRange <- range(barcode$freq)
  }
}
figures <- 3

df <- data.frame(
    x = cumsum(c(2, rep(6.5, figures-1))),
    y = rep(2, figures),
    h = rep(4, figures),
    w = rep(6, figures))

    df$info <- c(round(barcodeUnass, digits = 1), barcodes, paste(barcodeRange,collapse="\n"))
    df$key <- c("Reads with barcode (%)","Barcoded libraries", "barcode variance")
    df$icon <- fontawesome(c('fa-pie-chart', 'fa-barcode', 'fa-sliders'))
    df$colour <- rep("steelblue", figures)
if (barcodes > 0) {

MultiplexCharacteristicsValueBoxes <- ggplot(df, aes(x, y, height = h, width = w, label = key, fill = colour)) +
    geom_tile(fill = brewer.pal(9,"Blues")[7]) +
    geom_text(color = brewer.pal(9,"Blues")[3], hjust="left", nudge_y=-1.5, nudge_x=-2.6, size=5) +
    geom_text(label=df$info, size=10, color = brewer.pal(9,"Blues")[2], fontface = "bold", nudge_x=-2.6, hjust="left") +
    geom_text(label=df$icon, family='fontawesome-webfont', colour=brewer.pal(9,"Blues")[5], size=23, hjust="right", nudge_x=2.85, nudge_y=0.8) +
    coord_fixed() + 
    scale_fill_brewer(type = "qual",palette = "Dark2") +
    theme_void() +
    guides(fill = F)



ggsave("MultiplexCharacteristicsValueBoxes.png", plot=MultiplexCharacteristicsValueBoxes, device="png", units="cm", width=25, height=5, dpi=reportDPI)

knitr::include_graphics("MultiplexCharacteristicsValueBoxes.png")

}
if (barcodes > 0) {

# https://www.ncbi.nlm.nih.gov/assembly/help/
ncalc <- function(len.vector, n) {
  # N50 - length such that scaffolds of this length or longer include half the bases of the assembly
  len.sorted <- rev(sort(len.vector))
  len.sorted[cumsum(len.sorted) >= sum(len.sorted)*n][1]
}

lcalc <- function(len.vector, n) {
  # L50 - number of scaffolds that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
  len.sorted <- rev(sort(len.vector))
  which(cumsum(len.sorted) >= sum(len.sorted)*n)[1]
}


N50 <- ncalc(passedSeqs$sequence_length_template, 0.5)
N90 <- ncalc(passedSeqs$sequence_length_template, 0.9)
L50 <- lcalc(passedSeqs$sequence_length_template, 0.5)
L90 <- lcalc(passedSeqs$sequence_length_template, 0.9)

seqSummary <- function(barcodeId, myBarcode, myVector, myMethod, xlist=NA) {
  subVector <- myVector[which(myBarcode == barcodeId)]
  params <- list(subVector)
  if (!is.na(xlist)) {
    params <- append(params, xlist)
  }
  do.call(myMethod, params)
}

barcode <- cbind(barcode, mbps=round(unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="sum")) / 10e6, digits=0))

barcode <- cbind(barcode, min=unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="min")))

barcode <- cbind(barcode, max=unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="max")))

barcode <- cbind(barcode, mean=round(unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="mean")), digits=0))

barcode <- cbind(barcode, N50=unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="ncalc", xlist=list(n=0.5))))

barcode <- cbind(barcode, L50=unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="lcalc", xlist=list(n=0.5))))

kable(barcode, format="markdown", caption="Table summarising barcode content")
}
if (barcodes > 0) {
  if ("barcode_arrangement" %in% names(passedSeqs)) {
    # it's a run that used the --barcoding flag

    #library(extrafont)
    #loadfonts(device = "win")
    ggplot(barcode, aes(x, freq, fill=x)) +
        geom_bar(stat="identity", width=0.5, fill="#9ecae1") +
        xlab("\nDemultiplexed barcodes") + 
        ylab("\nFrequency") +
        scale_y_continuous(expand = c(0,0)) +
      labs(title="Histogram showing abundance of different barcodes") +
        theme(axis.text.x = element_text(angle=45, hjust=1))
  }
}

\pagebreak

df <- data.frame(
    x = seq(2, 21.5, 6.5),
    y = rep(2, 4),
    h = rep(4, 4),
    w = rep(6, 4),
    info = c("meaningless plots",
             "hours wasted",
             "zombies prefer brains", "useless metric"),
    data = c("78%","+10k","8/10", "ZZZ"),
    icon = emojifont::fontawesome(c("fa-qrcode", "fa-hashtag", "fa-file-text-o", "fa-hashtag")),
    color = factor(1:4)
)

ggplot(df, aes(x, y, height = h, width = w, label = info, fill = color)) +
    geom_tile() +
    geom_text(color = "white", hjust="left", nudge_y=-1.5, nudge_x=-2.6, size=4.5) +
    geom_text(label=df$data, size=7, color = "gray", fontface = "bold", nudge_x=-2.6, hjust="left") +
    geom_text(label=df$icon, family="fontawesome-webfont", colour="darkgray", size=16, hjust="right", nudge_x=2.85, nudge_y=0.8) +
    coord_fixed() + 
    scale_fill_brewer(type="qual", palette="Dark2") +
    theme_void() +
    guides(fill = F)

Reproducible Research - Produce your own report

This report has been created using Rmarkdown, publicly available R packages, and the \LaTeX document typesetting software for reproducibility. For clarity the R packages used, and their versions, is listed below.

\fontsize{8}{12}

options(width = 100)
utils:::print.sessionInfo(sessionInfo()[-7], locale=FALSE)

\fontsize{10}{14}

r newthought("Final thoughts.") Behind this Rmarkdown file (and its glossy pdf) is a modest amount of R code - please explore the Rmarkdown template; modify it, and run with your own samples.

To extract the whole set of R code from the Rmarkdown, use the purl command - this will extract the R code into its own file.

knitr::purl("Nanopore_cDNA_Tutorial.Rmd", quiet=TRUE)

\pagebreak

Glossary of Terms

\pagebreak

Appendix - Installation on macOS

The conda package management system is available for Linux and macOS. There are some R-package dependencies for the tutorial that are challenged by the system files distributed by conda. For macOS, it would be recommended that R, Rstudio, and \LaTeX installation is managed at the system level; other bioinformatics software will be installed through conda.

  1. Download and install the R statistical software from https://cran.r-project.org/bin/macosx/
  2. Install the gfortran and clang compilers provided at https://cran.r-project.org/bin/macosx/tools/
  3. Download and install Rstudio software from https://www.rstudio.com
  4. Download and install MacTex from, https://tug.org/mactex/mactex-download.html
  5. Install and update required macOS R packages
R --slave -e 'install.packages(c("BiocManager", "devtools"), ask=F, update=F)'
R --slave -e 'BiocManager::install(c("sysfonts", "showtext"), ask=F, update=F, type="source")'
R --slave -e 'BiocManager::install(c("kableExtra", "roxygen2", "emojifont", "extrafont", "R.utils"), ask=F, update=F)
R --slave -e 'BiocManager::install(c("DESeq2", "caTools", "pcaMethods", "dplyr", "ggplot2", "plyr", "RColorBrewer", "rmarkdown", "tufte", "xfun", "xlsx", "yaml", "Rsubread", "ShortRead"), ask=F, update=F)'
  1. Most software dependecies are managed though conda, install as described at
    https://conda.io/docs/user-guide/install/macos.html.
  2. create a working directory for the project and cd into it
  3. Download github gist for macOS macenv.yaml
    wget https://gist.github.com/sagrudd/63d7c76035555def83a336336a73365d/raw/macenv.yaml
  4. Install conda software dependencies with
    conda env create --name sumStatTutorial --file macenv.yaml
  5. Initialise conda environment with source activate sumStatTutorial
  6. Install the Nanopore tutorials Rpackage R --slave -e 'Sys.setenv(TAR = "/bin/tar"); devtools::install_github("sagrudd/tutorials")'
  7. Initialise the Nanopore QC tutorial R --slave -e 'TuPoreIal::InitTutorial("SeqStatsQC")'
  8. optional edit the provided config.yaml file to match your own study design
  9. Render the report using results from the analysis above R --slave -e 'rmarkdown::render("Nanopore_SumStatQC_Tutorial.Rmd", "tufte_handout")'

\pagebreak

References and Citations

# this command will nuke the existing bib file ...
#knitr::write_bib(c('base', 'rmarkdown', 'DESeq2', 'Rsubread', 'apeglm', 'pcaMethods', 'ShortRead'), file = 'skeleton.bib')
bibliography <- "QGFydGljbGV7bWluaW1hcDIyMDE4LAphdXRob3IgPSB7TGksIEhlbmd9LAp0aXRsZSA9IHtNaW5pbWFwMjogcGFpcndpc2UgYWxpZ25tZW50IGZvciBudWNsZW90aWRlIHNlcXVlbmNlc30sCmpvdXJuYWwgPSB7QmlvaW5mb3JtYXRpY3N9LAp2b2x1bWUgPSB7MzR9LApudW1iZXIgPSB7MTh9LApwYWdlcyA9IHszMDk0LTMxMDB9LAp5ZWFyID0gezIwMTh9LApkb2kgPSB7MTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHkxOTF9LApVUkwgPSB7aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHkxOTF9LAplcHJpbnQgPSB7L291cC9iYWNrZmlsZS9jb250ZW50X3B1YmxpYy9qb3VybmFsL2Jpb2luZm9ybWF0aWNzLzM0LzE4LzEwLjEwOTNfYmlvaW5mb3JtYXRpY3NfYnR5MTkxLzEvYnR5MTkxLnBkZn0KfQoKQGFydGljbGV7c2FtdG9vbHMyMDA5LAphdXRob3IgPSB7TGksIEhlbmcgYW5kIEhhbmRzYWtlciwgQm9iIGFuZCBXeXNva2VyLCBBbGVjIGFuZCBGZW5uZWxsLCBUaW0gYW5kIFJ1YW4sIEp1ZSBhbmQgSG9tZXIsIE5pbHMgYW5kIE1hcnRoLCBHYWJvciBhbmQgQWJlY2FzaXMsIEdvbmNhbG8gYW5kIER1cmJpbiwgUmljaGFyZCBhbmQgMTAwMCBHZW5vbWUgUHJvamVjdCBEYXRhIFByb2Nlc3NpbmcgU3ViZ3JvdXB9LAp0aXRsZSA9IHtUaGUgU2VxdWVuY2UgQWxpZ25tZW50L01hcCBmb3JtYXQgYW5kIFNBTXRvb2xzfSwKam91cm5hbCA9IHtCaW9pbmZvcm1hdGljc30sCnZvbHVtZSA9IHsyNX0sCm51bWJlciA9IHsxNn0sCnBhZ2VzID0gezIwNzgtMjA3OX0sCnllYXIgPSB7MjAwOX0sCmRvaSA9IHsxMC4xMDkzL2Jpb2luZm9ybWF0aWNzL2J0cDM1Mn0sClVSTCA9IHtodHRwOi8vZHguZG9pLm9yZy8xMC4xMDkzL2Jpb2luZm9ybWF0aWNzL2J0cDM1Mn0sCmVwcmludCA9IHsvb3VwL2JhY2tmaWxlL2NvbnRlbnRfcHVibGljL2pvdXJuYWwvYmlvaW5mb3JtYXRpY3MvMjUvMTYvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHAzNTIvMi9idHAzNTIucGRmfQp9CgpAYXJ0aWNsZXtzbmFrZW1ha2UyMDEyLAphdXRob3IgPSB7S8O2c3RlciwgSm9oYW5uZXMgYW5kIFJhaG1hbm4sIFN2ZW59LAp0aXRsZSA9IHtTbmFrZW1ha2XigJRhIHNjYWxhYmxlIGJpb2luZm9ybWF0aWNzIHdvcmtmbG93IGVuZ2luZX0sCmpvdXJuYWwgPSB7QmlvaW5mb3JtYXRpY3N9LAp2b2x1bWUgPSB7Mjh9LApudW1iZXIgPSB7MTl9LApwYWdlcyA9IHsyNTIwLTI1MjJ9LAp5ZWFyID0gezIwMTJ9LApkb2kgPSB7MTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHM0ODB9LApVUkwgPSB7aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHM0ODB9LAplcHJpbnQgPSB7L291cC9iYWNrZmlsZS9jb250ZW50X3B1YmxpYy9qb3VybmFsL2Jpb2luZm9ybWF0aWNzLzI4LzE5LzEwLjEwOTMvYmlvaW5mb3JtYXRpY3MvYnRzNDgwLzIvYnRzNDgwLnBkZn0KfQoKQGFydGljbGV7QkgxOTk1LAogSVNTTiA9IHswMDM1OTI0Nn0sCiBVUkwgPSB7aHR0cDovL3d3dy5qc3Rvci5vcmcvc3RhYmxlLzIzNDYxMDF9LAogYWJzdHJhY3QgPSB7VGhlIGNvbW1vbiBhcHByb2FjaCB0byB0aGUgbXVsdGlwbGljaXR5IHByb2JsZW0gY2FsbHMgZm9yIGNvbnRyb2xsaW5nIHRoZSBmYW1pbHl3aXNlIGVycm9yIHJhdGUgKEZXRVIpLiBUaGlzIGFwcHJvYWNoLCB0aG91Z2gsIGhhcyBmYXVsdHMsIGFuZCB3ZSBwb2ludCBvdXQgYSBmZXcuIEEgZGlmZmVyZW50IGFwcHJvYWNoIHRvIHByb2JsZW1zIG9mIG11bHRpcGxlIHNpZ25pZmljYW5jZSB0ZXN0aW5nIGlzIHByZXNlbnRlZC4gSXQgY2FsbHMgZm9yIGNvbnRyb2xsaW5nIHRoZSBleHBlY3RlZCBwcm9wb3J0aW9uIG9mIGZhbHNlbHkgcmVqZWN0ZWQgaHlwb3RoZXNlcy10aGUgZmFsc2UgZGlzY292ZXJ5IHJhdGUuIFRoaXMgZXJyb3IgcmF0ZSBpcyBlcXVpdmFsZW50IHRvIHRoZSBGV0VSIHdoZW4gYWxsIGh5cG90aGVzZXMgYXJlIHRydWUgYnV0IGlzIHNtYWxsZXIgb3RoZXJ3aXNlLiBUaGVyZWZvcmUsIGluIHByb2JsZW1zIHdoZXJlIHRoZSBjb250cm9sIG9mIHRoZSBmYWxzZSBkaXNjb3ZlcnkgcmF0ZSByYXRoZXIgdGhhbiB0aGF0IG9mIHRoZSBGV0VSIGlzIGRlc2lyZWQsIHRoZXJlIGlzIHBvdGVudGlhbCBmb3IgYSBnYWluIGluIHBvd2VyLiBBIHNpbXBsZSBzZXF1ZW50aWFsIEJvbmZlcnJvbmktdHlwZSBwcm9jZWR1cmUgaXMgcHJvdmVkIHRvIGNvbnRyb2wgdGhlIGZhbHNlIGRpc2NvdmVyeSByYXRlIGZvciBpbmRlcGVuZGVudCB0ZXN0IHN0YXRpc3RpY3MsIGFuZCBhIHNpbXVsYXRpb24gc3R1ZHkgc2hvd3MgdGhhdCB0aGUgZ2FpbiBpbiBwb3dlciBpcyBzdWJzdGFudGlhbC4gVGhlIHVzZSBvZiB0aGUgbmV3IHByb2NlZHVyZSBhbmQgdGhlIGFwcHJvcHJpYXRlbmVzcyBvZiB0aGUgY3JpdGVyaW9uIGFyZSBpbGx1c3RyYXRlZCB3aXRoIGV4YW1wbGVzLn0sCiBhdXRob3IgPSB7WW9hdiBCZW5qYW1pbmkgYW5kIFlvc2VmIEhvY2hiZXJnfSwKIGpvdXJuYWwgPSB7Sm91cm5hbCBvZiB0aGUgUm95YWwgU3RhdGlzdGljYWwgU29jaWV0eS4gU2VyaWVzIEIgKE1ldGhvZG9sb2dpY2FsKX0sCiBudW1iZXIgPSB7MX0sCiBwYWdlcyA9IHsyODktLTMwMH0sCiBwdWJsaXNoZXIgPSB7W1JveWFsIFN0YXRpc3RpY2FsIFNvY2lldHksIFdpbGV5XX0sCiB0aXRsZSA9IHtDb250cm9sbGluZyB0aGUgRmFsc2UgRGlzY292ZXJ5IFJhdGU6IEEgUHJhY3RpY2FsIGFuZCBQb3dlcmZ1bCBBcHByb2FjaCB0byBNdWx0aXBsZSBUZXN0aW5nfSwKIHZvbHVtZSA9IHs1N30sCiB5ZWFyID0gezE5OTV9Cn0KCgoKQE1hbnVhbHtSLWFwZWdsbSwKICB0aXRsZSA9IHthcGVnbG06IEFwcHJveGltYXRlIHBvc3RlcmlvciBlc3RpbWF0aW9uIGZvciBHTE0gY29lZmZpY2llbnRzfSwKICBhdXRob3IgPSB7QW5xaSBaaHUgYW5kIEpvc2VwaCBHLiBJYnJhaGltIGFuZCBNaWNoYWVsIEkuIExvdmV9LAogIHllYXIgPSB7MjAxOH0sCiAgbm90ZSA9IHtSIHBhY2thZ2UgdmVyc2lvbiAxLjQuMH0sCn0KQE1hbnVhbHtSLWJhc2UsCiAgdGl0bGUgPSB7UjogQSBMYW5ndWFnZSBhbmQgRW52aXJvbm1lbnQgZm9yIFN0YXRpc3RpY2FsIENvbXB1dGluZ30sCiAgYXV0aG9yID0ge3tSIENvcmUgVGVhbX19LAogIG9yZ2FuaXphdGlvbiA9IHtSIEZvdW5kYXRpb24gZm9yIFN0YXRpc3RpY2FsIENvbXB1dGluZ30sCiAgYWRkcmVzcyA9IHtWaWVubmEsIEF1c3RyaWF9LAogIHllYXIgPSB7MjAxOH0sCiAgdXJsID0ge2h0dHBzOi8vd3d3LlItcHJvamVjdC5vcmcvfSwKfQpATWFudWFse1ItREVTZXEyLAogIHRpdGxlID0ge0RFU2VxMjogRGlmZmVyZW50aWFsIGdlbmUgZXhwcmVzc2lvbiBhbmFseXNpcyBiYXNlZCBvbiB0aGUgbmVnYXRpdmUKYmlub21pYWwgZGlzdHJpYnV0aW9ufSwKICBhdXRob3IgPSB7TWljaGFlbCBMb3ZlIGFuZCBTaW1vbiBBbmRlcnMgYW5kIFdvbGZnYW5nIEh1YmVyfSwKICB5ZWFyID0gezIwMTh9LAogIG5vdGUgPSB7UiBwYWNrYWdlIHZlcnNpb24gMS4yMi4wfSwKICB1cmwgPSB7aHR0cHM6Ly9naXRodWIuY29tL21pa2Vsb3ZlL0RFU2VxMn0sCn0KQE1hbnVhbHtSLXBjYU1ldGhvZHMsCiAgdGl0bGUgPSB7cGNhTWV0aG9kczogQSBjb2xsZWN0aW9uIG9mIFBDQSBtZXRob2RzfSwKICBhdXRob3IgPSB7V29sZnJhbSBTdGFja2xpZXMgYW5kIEhlbm5pbmcgUmVkZXN0aWcgYW5kIEtldmluIFdyaWdodH0sCiAgeWVhciA9IHsyMDE4fSwKICBub3RlID0ge1IgcGFja2FnZSB2ZXJzaW9uIDEuNzQuMH0sCiAgdXJsID0ge2h0dHBzOi8vZ2l0aHViLmNvbS9ocmVkZXN0aWcvcGNhbWV0aG9kc30sCn0KQE1hbnVhbHtSLXJtYXJrZG93biwKICB0aXRsZSA9IHtybWFya2Rvd246IER5bmFtaWMgRG9jdW1lbnRzIGZvciBSfSwKICBhdXRob3IgPSB7SkogQWxsYWlyZSBhbmQgWWlodWkgWGllIGFuZCBKb25hdGhhbiBNY1BoZXJzb24gYW5kIEphdmllciBMdXJhc2NoaSBhbmQgS2V2aW4gVXNoZXkgYW5kIEFyb24gQXRraW5zIGFuZCBIYWRsZXkgV2lja2hhbSBhbmQgSm9lIENoZW5nIGFuZCBXaW5zdG9uIENoYW5nfSwKICB5ZWFyID0gezIwMTh9LAogIG5vdGUgPSB7UiBwYWNrYWdlIHZlcnNpb24gMS4xMH0sCiAgdXJsID0ge2h0dHBzOi8vQ1JBTi5SLXByb2plY3Qub3JnL3BhY2thZ2U9cm1hcmtkb3dufSwKfQpATWFudWFse1ItUnN1YnJlYWQsCiAgdGl0bGUgPSB7UnN1YnJlYWQ6IFN1YnJlYWQgc2VxdWVuY2UgYWxpZ25tZW50IGFuZCBjb3VudGluZyBmb3IgUn0sCiAgYXV0aG9yID0ge1dlaSBTaGkgYW5kIFlhbmcgTGlhbyB3aXRoIGNvbnRyaWJ1dGlvbnMgZnJvbSBHb3Jkb24gSyBTbXl0aCBhbmQgSmVubnkgRGFpIGFuZCBUaW1vdGh5IHtUcmljaGUsIEpyLn19LAogIHllYXIgPSB7MjAxOH0sCiAgbm90ZSA9IHtSIHBhY2thZ2UgdmVyc2lvbiAxLjMyLjB9LAogIHVybCA9IHtodHRwOi8vYmlvY29uZHVjdG9yLm9yZy9wYWNrYWdlcy9yZWxlYXNlL2Jpb2MvaHRtbC9Sc3VicmVhZC5odG1sfSwKfQpATWFudWFse1ItU2hvcnRSZWFkLAogIHRpdGxlID0ge1Nob3J0UmVhZDogRkFTVFEgaW5wdXQgYW5kIG1hbmlwdWxhdGlvbn0sCiAgYXV0aG9yID0ge01hcnRpbiBNb3JnYW4gYW5kIE1pY2hhZWwgTGF3cmVuY2UgYW5kIFNpbW9uIEFuZGVyc30sCiAgeWVhciA9IHsyMDE4fSwKICBub3RlID0ge1IgcGFja2FnZSB2ZXJzaW9uIDEuNDAuMH0sCn0K"

write(base64decode(bibliography, what=character, size=1)[1], file="skeleton.bib", append=FALSE)


sagrudd/cDNA_tutorial documentation built on May 30, 2019, 2:13 p.m.