knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte')) options(htmltools.dir.version = FALSE) knitr::opts_chunk$set(fig.width=9, fig.height=6, warning=FALSE, message=FALSE) library(TuPoreIal) library(data.table) library(dplyr) library(plyr) library(emojifont) library(extrafont) library(fontawesome) # devtools::install_github(rstudio/fontawesome) library(ggplot2) library(knitr) library(RColorBrewer) library(tufte) library(caTools) library(yaml) config <- yaml.load_file("config.yaml") inputFile <- config$inputFile flowcellId <- config$flowcellId libraryPrepKit <- config$libraryPrepKit scaling <- 1 reportDPI <- 90
r newthought("Summary Statistics and QC tutorial.")
is intended as a guide to help assess and question the quality characteristics of a Nanopore sequence collection. There are a number of tools for the QC analysis of short-sequence reads, this tutorial aims to enable an objective assessment as to the relative performance of a Nanopore flowcell run and to assess the sequence characteristics to benchmark quality.
r newthought("Sufficient information")
is provided in the tutorial such that the workflow can be tested, validated, and replicated. The tutorial is provided with an example dataset from a barcoded sequence library. The tutorial is intended to address important questions;
r newthought("Methods utilised")
within this tutorial include
R
for statistical analysis and reportingsequencing_summary.txt
as data source for parsingr newthought("Computational requirements")
for this tutorial include
\pagebreak
conda
, install as described at cd
into itenvironment.yaml
wget https://gist.githubusercontent.com/sagrudd/609b9124b96f6d388951dad9e8bc4964/raw/environment.yaml
conda env create --name sumStatTutorial --file environment.yaml
source activate sumStatTutorial
R --slave -e 'Sys.setenv(TAR = "/bin/tar"); devtools::install_github("sagrudd/tutorials")'
R --slave -e 'TuPoreIal::InitTutorial("SeqStatsQC")'
config.yaml
file to match your own study designR --slave -e 'rmarkdown::render("Nanopore_SumStatQC_Tutorial.Rmd", "tufte_handout")'
\pagebreak
This tutorial aims to summarise the data characteristics from an Oxford Nanopore Technologies sequencing run. Observations from basecalled reads and their quality characteristics, temporal performance, and barcoded content are presented. The information presented is derived from the sequence_summary.txt
file produced during basecalling by MinKNOW (soon), Guppy and Albacore.
This report has been produced from an Rmarkdown
template and is intended as a starting template for QC review of data produced during a run and as a tutorial for the exploration and assessement of Oxford Nanopore DNA sequence data. The goals from this tutorial include
Several of the plots included in this report have been replicated from publicly available projects such as POREquality ^1, minion_qc ^2, and pycoQC ^3.
r newthought('The sequence\\_summary.txt')
file is produced by either the Albacore or Guppy base-calling software. The file contains rich metadata for each sequence read produced during a run. These data include timestamp, quality, and channel information, in addition to the characteristics of the called DNA sequence. This tutorial uses this starting file for performance reasons.
Other QC tools such as wub ^4 utilise the fastq file for quality metrics, and other tools make extensive use of the fast5
files. Parsing the fast5
files provides additional analytical context but is much more demanding in terms of compute resource and time. This tutorial is lightweight and is intended to run within a few minutes on a desktop computer.
Using the sequencing_summary.txt file makes this analysis quick to perform and report. Fastq / Fast5 files are much larger and parsing them to calculate the required metrics associated with base-calling takes a considerable amount of time.
The Rmarkdown
script contains both the script and code to perform the analysis, but also contains the parameters and pointer to the file that will be analysed. The script, fresh from github
contains a bzip2
compressed file summarising base-calling for approximately 800,000 sequences. Please edit the line defining the inputFile
variable. This should point to a sequencing_summary file - this file may be compressed (.gz
or .bz2
). The file may either be monolithic from a single base-calling process or a concatenation of multiple summary_statistic
files from e.g. a cluster based base-calling.
Best practices recommend that you place your summary_statistic
file into the RawData
folder of the project created from the github
clone
.
Only the **`inputFile`** is critical; the *`flowcellId`* and *`libraryPrepKit`* parameters are only used for legends at present.
knitr::include_graphics("Analysis/StaticImages/sumstatEditParams.png")
\pagebreak
The analysis of the sequences specified within the Rmarkdown
file will be performed as part of the knit
process. This will load the summary_statistics
file, will prepare a sequence analysis, render figures and prepare the report. To start the analysis, it is only necessary to click the knit
button in the Rstudio
software - please see figure \ref{fig:KnitIt}.
knitr::include_graphics("Analysis/StaticImages/KnitIt.png")
\pagebreak
# Using fread for fast and friendly import of sequence_summary file # no definition of column types to speed import and allow for merged files # could be worthwhile to fread(select=) to select only a subset of columns - this could # preclude e.g. barcode data or different versions? sequencedata <- data.table::fread(inputFile, stringsAsFactors=FALSE) # remove the redundant headers from merged files if (length(which(sequencedata[,1]=="filename")) > 0) { sequencedata <- sequencedata[-which(sequencedata[,1]=="filename"),] } # coerce the columns used in analytics into more appropriate data-types sequencedata$channel<-as.numeric(sequencedata$channel) sequencedata$start_time<-as.numeric(sequencedata$start_time) sequencedata$duration<-as.numeric(sequencedata$duration) sequencedata$num_events<-as.numeric(sequencedata$num_events) sequencedata$sequence_length_template<-as.numeric(sequencedata$sequence_length_template) sequencedata$mean_qscore_template<-as.numeric(sequencedata$mean_qscore_template) # passes_filtering is a useful flag; but there are examples of sequencing_summary.txt where this # is not present - https://github.com/a-slide/pycoQC/blob/master/pycoQC/data/sequencing_summary_1D_DNA_Albacore_1.2.1.txt if (! "passes_filtering" %in% colnames(sequencedata)) { # set all of the reads to pass? apply a cutoff? sequencedata$passes_filtering <- TRUE } else { sequencedata$passes_filtering <- as.logical(sequencedata$passes_filtering) } # create a convenient separation of pass and fail ... passedSeqs <- sequencedata[which(sequencedata$passes_filtering), ] failedSeqs <- sequencedata[which(!sequencedata$passes_filtering), ]
# calculate some basec, but key, metrics flowcellId <- "undefined flowId" readCount <- formatC(nrow(sequencedata), big.mark=",") totalBases = sum(sequencedata$sequence_length_template,na.rm=T)/10^9 passedBases = sum(passedSeqs$sequence_length_template,na.rm=T)/10^9 gigabases <- round(totalBases,2) # prepare a data object to render a summary graphic figures <- 3 df <- data.frame( x = cumsum(c(2, rep(6.5, figures-1))), y = rep(2, figures), h = rep(4, figures), w = rep(6, figures)) df$info <- c("flowcell", readCount, gigabases) df$key <- c(flowcellId,"Reads produced", "gigabases called") df$icon <- fontawesome(c('fa-qrcode', 'fa-filter', 'fa-file-text-o')) df$colour <- rep("steelblue", figures) # and display the plot ExecutiveSummaryValueBoxes <- ggplot(df, aes(x, y, height = h, width = w, label = key, fill = colour)) + geom_tile(fill = brewer.pal(9,"Blues")[7]) + geom_text(color = brewer.pal(9,"Blues")[3], hjust="left", nudge_y=-1.5, nudge_x=-2.6, size=5) + geom_text(label=df$info, size=10, color = brewer.pal(9,"Blues")[2], fontface = "bold", nudge_x=-2.6, hjust="left") + geom_text(label=df$icon, family='fontawesome-webfont', colour=brewer.pal(9,"Blues")[5], size=23, hjust="right", nudge_x=2.85, nudge_y=0.8) + coord_fixed() + theme_void() + guides(fill = F) # some juggling here to prepare a plot that can be rendered across platforms = windows/linux/osx ggsave("ExecutiveSummaryValueBoxes.png", plot=ExecutiveSummaryValueBoxes, device="png", units="cm", width=25, height=5, dpi=reportDPI) knitr::include_graphics("ExecutiveSummaryValueBoxes.png")
Basecalling was performed using the Albacore software. Called reads were classified as either pass or fail depending on their overall quality score. For this analysis, a total of r formatC(nrow(sequencedata), big.mark=",")
reads were basecalled and of these r formatC(nrow(passedSeqs), big.mark=",")
(r round(nrow(passedSeqs) / nrow(sequencedata) * 100, 1)
%) were passed as satsifying the quality metric. The passed reads contain a total of r round(passedBases, 2)
Gbp of DNA sequence. This passed-fraction amounts to r round(passedBases / totalBases * 100, 1)
% of the total DNA sequenced.
df <- data.frame(matrix(nrow=1, ncol = 3)) names(df) <- c("variable", "percentage","label") df$variable <- c("pass") df$percentage <- c(round(length(which(sequencedata$passes_filtering==TRUE)) / nrow(sequencedata), 3)) df <- df %>% mutate(group=ifelse(percentage <0.6, "red", ifelse(percentage>=0.6 & percentage<0.8, "orange","green")), label=paste0(df$percentage*100, "%")) title="Percentage of reads\npassing QC filter" ggplot(df, aes(fill = group, ymax = percentage, ymin = 0, xmax = 2, xmin = 1)) + geom_rect(aes(ymax=1, ymin=0, xmax=2, xmin=1), fill ="#ece8bd") + geom_rect() + coord_polar(theta = "y",start=-pi/2) + xlim(c(0, 2)) + ylim(c(0,2)) + guides(fill=FALSE) + guides(colour=FALSE) + theme_void() + theme(strip.background = element_blank(), strip.text.x = element_blank()) + geom_text(aes(x = 0, y = 0, label = label), size=13) + geom_text(aes(x=1.5, y=1.5, label=title), size=11) + scale_fill_manual(values = c("red"="#C9146C", "orange"="#DA9112", "green"="#129188")) + scale_colour_manual(values = c("red"="#C9146C", "orange"="#DA9112", "green"="#129188"))
\newpage
The nanopores through which DNA is passed, and signal collected, are arrayed on a 2-dimensional matrix. A heatmap can be plotted such that channel productivity can be shown against spatial position on the matrix. Such a plot enables the identification of spatial artifacts that could result from membrane damage through e.g. the introduction of an air-bubble. This heatmap representation of spatial activity shows only gross spatial aberations; the activity plot shows the number of sequences produced per channel, not per pore.
# create an empty read count container ... MinION or PromethION?? # https://gist.github.com/roblanf/df47b9748c3aae00809cc675aca79989 # build the map for R9.5 flowcell, as a long-form dataframe that translates # channels into rows and columns on the flowcell. Good for plotting in R. p1 = data.frame(channel=33:64, row=rep(1:4, each=8), col=rep(1:8, 4)) p2 = data.frame(channel=481:512, row=rep(5:8, each=8), col=rep(1:8, 4)) p3 = data.frame(channel=417:448, row=rep(9:12, each=8), col=rep(1:8, 4)) p4 = data.frame(channel=353:384, row=rep(13:16, each=8), col=rep(1:8, 4)) p5 = data.frame(channel=289:320, row=rep(17:20, each=8), col=rep(1:8, 4)) p6 = data.frame(channel=225:256, row=rep(21:24, each=8), col=rep(1:8, 4)) p7 = data.frame(channel=161:192, row=rep(25:28, each=8), col=rep(1:8, 4)) p8 = data.frame(channel=97:128, row=rep(29:32, each=8), col=rep(1:8, 4)) q1 = data.frame(channel=1:32, row=rep(1:4, each=8), col=rep(16:9, 4)) q2 = data.frame(channel=449:480, row=rep(5:8, each=8), col=rep(16:9, 4)) q3 = data.frame(channel=385:416, row=rep(9:12, each=8), col=rep(16:9, 4)) q4 = data.frame(channel=321:352, row=rep(13:16, each=8), col=rep(16:9, 4)) q5 = data.frame(channel=257:288, row=rep(17:20, each=8), col=rep(16:9, 4)) q6 = data.frame(channel=193:224, row=rep(21:24, each=8), col=rep(16:9, 4)) q7 = data.frame(channel=129:160, row=rep(25:28, each=8), col=rep(16:9, 4)) q8 = data.frame(channel=65:96, row=rep(29:32, each=8), col=rep(16:9, 4)) # long form as a data frame, i.e. map$channel[[1]] returns 33 channelMap = rbind(p1, p2, p3, p4, p5, p6, p7, p8, q1, q2, q3, q4, q5, q6, q7, q8) hm.palette <- colorRampPalette(brewer.pal(9, 'Blues'), space='Lab') #RdPu, Oranges, Greens, YlOrRd, Purples channelCounts <- as.data.frame(matrix(rep(0, 512), ncol=1)) channelCountRaw <- as.data.frame(table(unlist(sequencedata[, "channel"])), row.names=1) channelCounts[row.names(channelCountRaw),] <- channelCountRaw[,1] #channelMap <- cbind(channelMap[channelMap$channel,], frequency=channelCounts[channelMap$channel,]) channelMap <- merge(channelMap, channelCounts, by.x="channel", by.y=0) colnames(channelMap)[4]<-"count" channelMapMatrix <- reshape2::acast(channelMap, col ~ row, value.var = "count") theme_update(plot.title = element_text(hjust = 0.5)) ggplot(channelMap, aes(x = row, y = col, fill = count)) + geom_tile() + geom_text(data=channelMap,aes(x=row, y=col,label=count,color=count),show.legend = F, size=2.5) + scale_x_discrete(breaks=NULL) + scale_y_discrete(breaks=NULL) + coord_equal() + scale_fill_gradientn(colours = hm.palette(100)) + scale_color_gradient2(low = hm.palette(100), high = hm.palette(1)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title="Channel activity plot showing number of reads per flowcell channel") + theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), legend.position="bottom", legend.key.width=unit(5.6,"cm"))
\pagebreak
The distribution of base-called DNA sequence lengths and their accompanying qualities are key metrics for the review of a sequencing library. This section of the QC review tutorial assesses the length and quality distributions for reads from this flowcell. This section is reviewing the total collection of sequences that both pass and fail the mean quality filter.
#N50.length = seq_dt$sequence_length_template[min(which(seq_dt$cumulative_bases > (total.bases/2)))] lenSorted <- rev(sort(passedSeqs$sequence_length_template)) passedMeanLength = round(mean(lenSorted), digits = 0) N50 <- lenSorted[cumsum(lenSorted) >= sum(lenSorted)*0.5][1] passedMeanQ = round(mean(passedSeqs$mean_qscore_template), digits = 1) failedMeanQ = round(mean(failedSeqs$mean_qscore_template), digits = 1) #N50 length is the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs
figures <- 5 df <- data.frame( x = cumsum(c(2, rep(6.5, figures-1))), y = rep(2, figures), h = rep(4, figures), w = rep(6, figures)) df$info <- c(passedMeanLength, N50, passedMeanQ, failedMeanQ, prettyNum(max(passedSeqs$sequence_length_template), big.mark=",")) df$key <- c("Mean Read Length (nt)","N50","Mean Read Quality (QV)","Mean Failed QV","Longest Read") df$icon <- fontawesome(c("fa-bar-chart", "fa-play", "fa-area-chart", "fa-bug", "fa-sort")) df$colour <- rep("steelblue", figures) ReadCharacteristicsValueBoxes <- ggplot(df, aes(x, y, height = h, width = w, label = key, fill = colour)) + geom_tile(fill = brewer.pal(9,"Blues")[7]) + geom_text(color = brewer.pal(9,"Blues")[3], hjust="left", nudge_y=-1.5, nudge_x=-2.6, size=3.5) + geom_text(label=df$info, size=5.5, color = brewer.pal(9,"Blues")[2], fontface = "bold", nudge_x=-2.6, hjust="left") + geom_text(label=df$icon, family='fontawesome-webfont', colour=brewer.pal(9,"Blues")[5], size=13.3, hjust="right", nudge_x=2.85, nudge_y=0.8) + coord_fixed() + scale_fill_brewer(type = "qual",palette = "Dark2") + theme_void() + guides(fill = F) ggsave("ReadCharacteristicsValueBoxes.png", plot=ReadCharacteristicsValueBoxes, device="png", units="cm", width=25, height=5, dpi=reportDPI) knitr::include_graphics("ReadCharacteristicsValueBoxes.png")
The distribution of sequence lengths will be dependent on the protocols that have been used to extract the starting DNA. Sequences from amplicon DNA will have a tight distribution of read lengths, while sequences from genomic DNA will have a broader distribution of sheared DNA product. The distribution will be further influenced if a size-selection step has been used, and will also be dependent on the choice of sequencing library preparation kits. The read length distribution should be assessed to see if the distribution is concordant with that expected.
\hfill\break
# there is a challenge to scale the data intelligently - this would be different for e.g. 16S and WGS # instead of hard coding this, perhaps we can place the logic into ggplot2? # quantile(x=sequencedata$sequence_length_template, probs=c(0.975)) # https://stackoverflow.com/questions/6461209/how-to-round-up-to-the-nearest-10-or-100-or-x roundUpNice <- function(x, nice=seq(from=1, to=10, by=0.25)) { if(length(x) != 1) stop("'x' must be of length 1") 10^floor(log10(x)) * nice[[which(x <= 10^floor(log10(x)) * nice)[[1]]]] } # pick a friendly upper limit to render sequence lengths into a histogram # here we're aiming for a robustly rounded up 97.5 quantile of the data (skip a few outliers ...) upperLimit <- roundUpNice(as.numeric(quantile(x=sequencedata$sequence_length_template, probs=c(0.975)))) # an ideal histogram will have 40 or so bins histogramBinCount <- 40 breakVal = roundUpNice(upperLimit / histogramBinCount) breaks <- seq(0, to=upperLimit, by=breakVal) assgnd.pass <- table( cut(subset(sequencedata, passes_filtering==TRUE & sequence_length_template > 1)$sequence_length_template, breaks, include.lowest = TRUE, right=FALSE)) assgnd.fail <- table(cut(subset(sequencedata, passes_filtering==FALSE & sequence_length_template > 1)$sequence_length_template, breaks, include.lowest = TRUE, right=FALSE)) lengthdist <- data.frame(length=head(breaks, -1), pass=as.vector(assgnd.pass), fail=as.vector(assgnd.fail)) lendistmlt <- reshape2::melt(lengthdist, id.vars=c("length")) ggplot(lendistmlt, aes(x=length, fill=variable, y=value)) + geom_bar(stat="identity") + xlab("Read length\n") + ylab("Number of reads\n") + scale_fill_brewer(palette="Paired") + scale_x_continuous(limits=c(-breakVal,upperLimit), breaks=pretty(passedSeqs$sequence_length_template,n=40)) + labs(title="Histogram showing distribution of read lengths across quality passing sequences", fill="QV filter")+ geom_vline(xintercept = N50, size = 1) + annotate("text", x=N50, y=max(assgnd.pass + assgnd.fail), label = " N50", hjust=0, colour="SteelBlue") + geom_vline(xintercept = passedMeanLength, size = 1) + annotate("text", x=passedMeanLength, y=max(assgnd.pass + assgnd.fail), label = " Mean", hjust=0, colour="SteelBlue")
The distribution of read lengths has been coloured by reads which either pass or fail the mean QV filter. The mean and N50 values for the QV passing sequences are super-imposed on the histogram.
\hfill\break
A histogram of mean QV scores reveals the relative abundance of sequences of different quality.
\hfill\break
ggplot(sequencedata, aes(x=mean_qscore_template, fill=passes_filtering)) + geom_histogram(breaks=seq(from=0, to=15, by=0.1)) + scale_fill_brewer(palette="Paired") + labs(title="Plot showing distribution of quality scores across all reads") + xlab("Mean Q score of read") + ylab("Number of reads")
The distribution of sequence qualities is by QV filter pass status. This QV filter is applied in the base-calling workflow as a modifiable parameter.
\hfill\break The density plot of mean sequence quality plotted against log10 sequence length appears as a favourite in the various publicly available QC tools. I am not sure on the overall point of this plot ... It reinforces the point that the short(est) sequences tend to have low quality values. In terms of per flowcell QC and as a plot used for the systematic review of many libraries I feel that this is superfluous ... would welcome some feedback. \hfill\break
binFilter <- 5 # background speckle from low density bins is meh - this can be prepared ... # prepare the density plot, but do not render lq_dens <- ggplot(sequencedata, aes(log10(sequence_length_template), mean_qscore_template)) + geom_bin2d(bins=100) # extract the density map from the plot lq_dens_counts <- ggplot_build(lq_dens)$data[[1]] if (binFilter > 0) { # remove the bins from the density map that do not contain sequence count above threshold lq_dens_counts <- lq_dens_counts[-which(lq_dens_counts$count <= binFilter),] } # directly plot this modified density map (stat=="identity") ggplot(lq_dens_counts) + geom_bin2d(aes(x,y,fill=count), stat="identity") + scale_fill_distiller(palette="Blues") + geom_hline(yintercept = 7, size = 1) + xlab("log10(read length)") + ylab("read mean quality") + scale_x_continuous(breaks = c(1,2,3,4,5), label = c("10", "100", "1000", "10,000", "100,000")) + annotation_logticks(base = 10, sides = "b", scaled = TRUE) + labs(title="Contour Plot showing distribution of quality scores against log10 read lengths (all reads)")
The density plot of log sequence length against mean read quality has been sharpened for presentation. The sequence bins containing less than or equal to 5 sequences have been removed; this removes background speckle from the graph - the regions of increased density remain.
\pagebreak
Another key metric in the quality review of a sequencing run is an analysis of the temporal performance of the run. During a run each sequencing channel may address a number of different pores (mux) and the individual pores may become temporarily or permanently blocked. It is therefore expected that during a run sequencing productivity will decrease; it is useful to review at to whether the observed productivity decline is normal or if it happens more rapidly than expected. A rapid pore decline could be indicative of contaminants with the sequencing library.
\hfill\break
Plotting the number of bases generated per unit time over the course of the run can reveal unexpected behaviours. In an ideal experiment there will not be many sudden decreases in performance.
# there is a fundamental change in the summary_statistics files generated by newer and older versions of # the base callers - the scale of sequence_data$start_time has changed and this will produce some FUBAR # graphs sequencedata$start_time <- sequencedata$start_time - min(sequencedata$start_time) sequencedata$start_time <- sequencedata$start_time / scaling # assuming a 48 hour run, 5 minute intervals sampleHours = 48 sampleIntervalMinutes = 60 breaks = seq(0, sampleHours*60*60, by=60*sampleIntervalMinutes) binass <- findInterval(sequencedata$start_time, breaks) mergeItPerHour <- function(interval, binnedAssignments, filter) { totalbases = 0 if (length(which(binnedAssignments==interval))>0) { subset <- sequencedata[which(binnedAssignments==interval), ] if (length(which(subset$passes_filtering == filter)) > 0) { totalbases = sum(subset[which(subset$passes_filtering == filter), "sequence_length_template"]) } } # need to scale what is being returned - totalbases value is total bases within an interval (sampleIntervalMinutes) return(totalbases / 1e9 / sampleIntervalMinutes * 60) } binnedTemporalDataPerHour <- data.frame( cbind( time=breaks, pass=unlist(lapply(seq(breaks), mergeItPerHour, binnedAssignments=binass,filter=TRUE)), fail=unlist(lapply(seq(breaks), mergeItPerHour, binnedAssignments=binass, filter=FALSE)) ) ) binnedTemporalDataPerHour$time <- binnedTemporalDataPerHour$time / 60 / 60 ggplot(binnedTemporalDataPerHour, aes(time)) + geom_line(aes(y = fail, colour = "fail"), size=1) + geom_line(aes(y = pass, colour = "pass"), size=1) + scale_color_brewer(name="Reads passing QV 7", palette="Paired") + xlab("Time (hours)") + ylab("Gigabases sequenced per hour (Gbp/hr)") + labs(title="Plot showing sequence throughput against time")
The temporal data has been scaled to sequence collected per hour. For a finer resolution of performance the code can be modified for a more frequent sample. I observe a lot more noise and sampling effect when the graph is sampled at 5 or 10 minute intervals.
# binnedTemporalDataPerHour is scaled to Gbp per hour - rescale to raw for cumulative plotting binnedTemporalDataPerHour$pass <- binnedTemporalDataPerHour$pass / 60 * sampleIntervalMinutes binnedTemporalDataPerHour$fail <- binnedTemporalDataPerHour$fail / 60 * sampleIntervalMinutes # https://stackoverflow.com/questions/31404679/can-ggplot2-find-the-intersections-or-is-there-any-other-neat-way targetInterpolate <- approxfun(x=binnedTemporalDataPerHour$time, y=cumsum(binnedTemporalDataPerHour$pass)) base50 <- sum(passedSeqs$sequence_length_template)/1e9*0.5 base90 <- sum(passedSeqs$sequence_length_template)/1e9*0.9 T50 <- optimize(function(t0) abs(targetInterpolate(t0) - base50), interval = range(binnedTemporalDataPerHour$time)) T90 <- optimize(function(t0) abs(targetInterpolate(t0) - base90), interval = range(binnedTemporalDataPerHour$time))
In addition to plotting the temporal production of data, a cumulative plot shows how data is accumulated during the run. From this dataset, we have measured a total of r round(sum(passedSeqs$sequence_length_template)/1e9, 2)
Gigabases of quality passing sequence. We can identify the timepoint T50 such that 50% of sequenced bases have been collected within this time - or r round(T50$minimum, 2)
hours in this example. This is displayed on the graph.
ggplot(binnedTemporalDataPerHour, aes(time)) + geom_line(aes(y = cumsum(fail), colour = "fail"), size=1) + geom_line(aes(y = cumsum(pass), colour = "pass"), size=1) + scale_color_brewer(name="Reads passing QV 7", palette="Paired") + geom_segment(x=T50$minimum, y=0, xend=T50$minimum, yend=base50, colour="darkgray", size=1) + geom_segment(x=0, y=base50, xend=T50$minimum, yend=base50, colour="darkgray", size=1) + annotate("text", x=T50$minimum, y=base50, label=" T50", vjust=1, hjust=0, colour="SteelBlue") + geom_segment(x=T90$minimum, y=0, xend=T90$minimum, yend=base90, colour="darkgray", size=1) + geom_segment(x=0, y=base90, xend=T90$minimum, yend=base90, colour="darkgray", size=1) + annotate("text", x=T90$minimum, y=base90, label=" T90", vjust=1, hjust=0, colour="SteelBlue") + xlab("Time (hours)") + ylab("Number of bases sequenced (Gigabases)") + labs(title="Plot showing cumulative bases sequenced against time")
This cumulative plot has been augmented with information that shows the points within the run that account for 50% and 90% of the total bases sequenced.
\hfill\break
In addition to the cumulative plot of sequenced bases, an equivalent plot for the sequenced reads can be plotted. This is not too different in structure or morphology to the cumulative baseplot. It would be recommended to consider either a cumulative base plot or a cumulative read plot - the information overlap is sufficient that both are unlikely to be required.
mergeItReadsPerHour <- function(interval, binnedAssignments,filter) { totalreads = 0 if (length(which(binnedAssignments==interval))>0) { subset <- sequencedata[which(binnedAssignments==interval), ] if (length(which(subset$passes_filtering == filter)) > 0) { totalreads = nrow(subset[which(subset$passes_filtering == filter),]) } } # scale results to mean millions of reads per hour return(totalreads/ 1e6 / sampleIntervalMinutes * 60) } binnedTemporalDataReadsPerHour <- data.frame( cbind(time=breaks, pass=unlist(lapply(seq(breaks), mergeItReadsPerHour, binnedAssignments=binass, filter=TRUE)), fail=unlist(lapply(seq(breaks), mergeItReadsPerHour, binnedAssignments=binass, filter=FALSE)) ) ) binnedTemporalDataReadsPerHour$time <- binnedTemporalDataReadsPerHour$time / 60 / 60 # binnedTemporalDataReadsPerHour is scaled to Gbp per hour - rescale to raw for cumulative plotting binnedTemporalDataReadsPerHour$pass <- binnedTemporalDataReadsPerHour$pass / 60 * sampleIntervalMinutes binnedTemporalDataReadsPerHour$fail <- binnedTemporalDataReadsPerHour$fail / 60 * sampleIntervalMinutes ggplot(binnedTemporalDataReadsPerHour, aes(time)) + geom_line(aes(y = cumsum(fail), colour = "fail"), size=1) + geom_line(aes(y = cumsum(pass), colour = "pass"), size=1) + scale_color_brewer(name="Reads passing QV 7", palette="Paired") + xlab("Time (hours)") + ylab("Number of reads sequenced (Millions)") + labs(title="Plot showing cumulative reads sequenced against time")
\hfill\break
The speed/time plot is a valuable metric in that changed in sequencing speed can be measured. A slow-down in sequencing speed can indicate a shortage of the sequencing fuel required by the motor protein.
speedTime <- data.frame(segment=binass, rate=sequencedata$sequence_length_template / (sequencedata$duration/scaling)) ggplot(speedTime, aes(x=segment, y=rate, group=segment)) + geom_boxplot(fill="steelblue", outlier.shape=NA) +scale_x_continuous(name="Time (hours)") + ylab("Sequencing rate (bases per second)") + labs(title="boxplot showing distribution of sequencing speed against time")
This speed-time plot has been modified such that datapoints corresponding to outlying sequences have been masked. This does not change the box and whisker.
mergeActiveChannels <- function(interval, binnedAssignments) { totalChannels = 0 if (length(which(binnedAssignments==interval))>0) { subset <- sequencedata[which(binnedAssignments==interval), ] totalChannels = length(unique(subset$channel)) } return(totalChannels) } binnedTemporalChannels <- data.frame(time=breaks, channels=unlist(lapply(seq(breaks), mergeActiveChannels, binnedAssignments=binass) ) ) binnedTemporalChannels$time <- binnedTemporalChannels$time / 60 / 60 ggplot(binnedTemporalChannels, aes(time)) + geom_step(aes(y = channels), size=1, colour = "Steelblue") + xlab("Time (hours)") + ylab("Number of channels producing data") + labs(title="Plot showing number of functional channels against time")
\pagebreak
barcodes = 0 barcodeUnass = 0 barcodeRange <- c(0,0) if ("barcode_arrangement" %in% names(passedSeqs)) { barcode=plyr::count(passedSeqs$barcode_arrangement) barcode=subset(barcode, freq > 150) if ("unclassified" %in% barcode$x) { barcodes <- nrow(barcode[-which(barcode$x=="unclassified"),]) barcodeUnass <- sum(barcode[-which(barcode$x=="unclassified"),"freq"]) / sum(barcode$freq) * 100 barcodeRange <- range(subset(barcode, x!="unclassified")$freq) } else { barcodes <- nrow(barcode) barcodeUnass = 100 barcodeRange <- range(barcode$freq) } } figures <- 3 df <- data.frame( x = cumsum(c(2, rep(6.5, figures-1))), y = rep(2, figures), h = rep(4, figures), w = rep(6, figures)) df$info <- c(round(barcodeUnass, digits = 1), barcodes, paste(barcodeRange,collapse="\n")) df$key <- c("Reads with barcode (%)","Barcoded libraries", "barcode variance") df$icon <- fontawesome(c('fa-pie-chart', 'fa-barcode', 'fa-sliders')) df$colour <- rep("steelblue", figures)
if (barcodes > 0) { MultiplexCharacteristicsValueBoxes <- ggplot(df, aes(x, y, height = h, width = w, label = key, fill = colour)) + geom_tile(fill = brewer.pal(9,"Blues")[7]) + geom_text(color = brewer.pal(9,"Blues")[3], hjust="left", nudge_y=-1.5, nudge_x=-2.6, size=5) + geom_text(label=df$info, size=10, color = brewer.pal(9,"Blues")[2], fontface = "bold", nudge_x=-2.6, hjust="left") + geom_text(label=df$icon, family='fontawesome-webfont', colour=brewer.pal(9,"Blues")[5], size=23, hjust="right", nudge_x=2.85, nudge_y=0.8) + coord_fixed() + scale_fill_brewer(type = "qual",palette = "Dark2") + theme_void() + guides(fill = F) ggsave("MultiplexCharacteristicsValueBoxes.png", plot=MultiplexCharacteristicsValueBoxes, device="png", units="cm", width=25, height=5, dpi=reportDPI) knitr::include_graphics("MultiplexCharacteristicsValueBoxes.png") }
if (barcodes > 0) { # https://www.ncbi.nlm.nih.gov/assembly/help/ ncalc <- function(len.vector, n) { # N50 - length such that scaffolds of this length or longer include half the bases of the assembly len.sorted <- rev(sort(len.vector)) len.sorted[cumsum(len.sorted) >= sum(len.sorted)*n][1] } lcalc <- function(len.vector, n) { # L50 - number of scaffolds that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly len.sorted <- rev(sort(len.vector)) which(cumsum(len.sorted) >= sum(len.sorted)*n)[1] } N50 <- ncalc(passedSeqs$sequence_length_template, 0.5) N90 <- ncalc(passedSeqs$sequence_length_template, 0.9) L50 <- lcalc(passedSeqs$sequence_length_template, 0.5) L90 <- lcalc(passedSeqs$sequence_length_template, 0.9) seqSummary <- function(barcodeId, myBarcode, myVector, myMethod, xlist=NA) { subVector <- myVector[which(myBarcode == barcodeId)] params <- list(subVector) if (!is.na(xlist)) { params <- append(params, xlist) } do.call(myMethod, params) } barcode <- cbind(barcode, mbps=round(unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="sum")) / 10e6, digits=0)) barcode <- cbind(barcode, min=unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="min"))) barcode <- cbind(barcode, max=unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="max"))) barcode <- cbind(barcode, mean=round(unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="mean")), digits=0)) barcode <- cbind(barcode, N50=unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="ncalc", xlist=list(n=0.5)))) barcode <- cbind(barcode, L50=unlist(lapply(as.character(barcode$x), seqSummary, myBarcode=passedSeqs$barcode_arrangement, myVector=passedSeqs$sequence_length_template, myMethod="lcalc", xlist=list(n=0.5)))) kable(barcode, format="markdown", caption="Table summarising barcode content") }
if (barcodes > 0) { if ("barcode_arrangement" %in% names(passedSeqs)) { # it's a run that used the --barcoding flag #library(extrafont) #loadfonts(device = "win") ggplot(barcode, aes(x, freq, fill=x)) + geom_bar(stat="identity", width=0.5, fill="#9ecae1") + xlab("\nDemultiplexed barcodes") + ylab("\nFrequency") + scale_y_continuous(expand = c(0,0)) + labs(title="Histogram showing abundance of different barcodes") + theme(axis.text.x = element_text(angle=45, hjust=1)) } }
\pagebreak
df <- data.frame( x = seq(2, 21.5, 6.5), y = rep(2, 4), h = rep(4, 4), w = rep(6, 4), info = c("meaningless plots", "hours wasted", "zombies prefer brains", "useless metric"), data = c("78%","+10k","8/10", "ZZZ"), icon = emojifont::fontawesome(c("fa-qrcode", "fa-hashtag", "fa-file-text-o", "fa-hashtag")), color = factor(1:4) ) ggplot(df, aes(x, y, height = h, width = w, label = info, fill = color)) + geom_tile() + geom_text(color = "white", hjust="left", nudge_y=-1.5, nudge_x=-2.6, size=4.5) + geom_text(label=df$data, size=7, color = "gray", fontface = "bold", nudge_x=-2.6, hjust="left") + geom_text(label=df$icon, family="fontawesome-webfont", colour="darkgray", size=16, hjust="right", nudge_x=2.85, nudge_y=0.8) + coord_fixed() + scale_fill_brewer(type="qual", palette="Dark2") + theme_void() + guides(fill = F)
This report has been created using Rmarkdown
, publicly available R
packages, and the \LaTeX document typesetting software for reproducibility. For clarity the R
packages used, and their versions, is listed below.
\fontsize{8}{12}
options(width = 100) utils:::print.sessionInfo(sessionInfo()[-7], locale=FALSE)
\fontsize{10}{14}
r newthought("Final thoughts.")
Behind this Rmarkdown
file (and its glossy pdf) is a modest amount of R code
- please explore the Rmarkdown
template; modify it, and run with your own samples.
To extract the whole set of R code
from the Rmarkdown
, use the purl
command - this will extract the R code into its own file.
knitr::purl("Nanopore_cDNA_Tutorial.Rmd", quiet=TRUE)
\pagebreak
knit is the command to render an Rmarkdown file. The knitr package is used to embed code, the results of R analyses and their figures within the typeset text from the document.
L50 the number of sequences (or contigs etc) that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
N50 length such that sequences (or contigs etc) of this length or longer include half the bases of the sequence collection
Rmarkdown is an extension to markdown. Functional R code can be embedded in a plain-text document and subsequently rendered to other formats including the PDF format of this report.
QV the quality value - -log10(p) that any given base is incorrect. QV may be either at the individual base level, or may be averaged across whole sequences
sequencing_summary.txt a summary file
\pagebreak
The conda
package management system is available for Linux and macOS. There are some R-package
dependencies for the tutorial that are challenged by the system files distributed by conda
. For macOS, it would be recommended that R
, Rstudio
, and \LaTeX installation is managed at the system level; other bioinformatics software will be installed through conda
.
R
statistical software from https://cran.r-project.org/bin/macosx/Rstudio
software from https://www.rstudio.comMacTex
from, https://tug.org/mactex/mactex-download.htmlR
packagesR --slave -e 'install.packages(c("BiocManager", "devtools"), ask=F, update=F)' R --slave -e 'BiocManager::install(c("sysfonts", "showtext"), ask=F, update=F, type="source")' R --slave -e 'BiocManager::install(c("kableExtra", "roxygen2", "emojifont", "extrafont", "R.utils"), ask=F, update=F) R --slave -e 'BiocManager::install(c("DESeq2", "caTools", "pcaMethods", "dplyr", "ggplot2", "plyr", "RColorBrewer", "rmarkdown", "tufte", "xfun", "xlsx", "yaml", "Rsubread", "ShortRead"), ask=F, update=F)'
conda
, install as described at cd
into itmacenv.yaml
wget https://gist.github.com/sagrudd/63d7c76035555def83a336336a73365d/raw/macenv.yaml
conda env create --name sumStatTutorial --file macenv.yaml
source activate sumStatTutorial
R --slave -e 'Sys.setenv(TAR = "/bin/tar"); devtools::install_github("sagrudd/tutorials")'
R --slave -e 'TuPoreIal::InitTutorial("SeqStatsQC")'
config.yaml
file to match your own study designR --slave -e 'rmarkdown::render("Nanopore_SumStatQC_Tutorial.Rmd", "tufte_handout")'
\pagebreak
# this command will nuke the existing bib file ... #knitr::write_bib(c('base', 'rmarkdown', 'DESeq2', 'Rsubread', 'apeglm', 'pcaMethods', 'ShortRead'), file = 'skeleton.bib') bibliography <- "QGFydGljbGV7bWluaW1hcDIyMDE4LAphdXRob3IgPSB7TGksIEhlbmd9LAp0aXRsZSA9IHtNaW5pbWFwMjogcGFpcndpc2UgYWxpZ25tZW50IGZvciBudWNsZW90aWRlIHNlcXVlbmNlc30sCmpvdXJuYWwgPSB7QmlvaW5mb3JtYXRpY3N9LAp2b2x1bWUgPSB7MzR9LApudW1iZXIgPSB7MTh9LApwYWdlcyA9IHszMDk0LTMxMDB9LAp5ZWFyID0gezIwMTh9LApkb2kgPSB7MTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHkxOTF9LApVUkwgPSB7aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHkxOTF9LAplcHJpbnQgPSB7L291cC9iYWNrZmlsZS9jb250ZW50X3B1YmxpYy9qb3VybmFsL2Jpb2luZm9ybWF0aWNzLzM0LzE4LzEwLjEwOTNfYmlvaW5mb3JtYXRpY3NfYnR5MTkxLzEvYnR5MTkxLnBkZn0KfQoKQGFydGljbGV7c2FtdG9vbHMyMDA5LAphdXRob3IgPSB7TGksIEhlbmcgYW5kIEhhbmRzYWtlciwgQm9iIGFuZCBXeXNva2VyLCBBbGVjIGFuZCBGZW5uZWxsLCBUaW0gYW5kIFJ1YW4sIEp1ZSBhbmQgSG9tZXIsIE5pbHMgYW5kIE1hcnRoLCBHYWJvciBhbmQgQWJlY2FzaXMsIEdvbmNhbG8gYW5kIER1cmJpbiwgUmljaGFyZCBhbmQgMTAwMCBHZW5vbWUgUHJvamVjdCBEYXRhIFByb2Nlc3NpbmcgU3ViZ3JvdXB9LAp0aXRsZSA9IHtUaGUgU2VxdWVuY2UgQWxpZ25tZW50L01hcCBmb3JtYXQgYW5kIFNBTXRvb2xzfSwKam91cm5hbCA9IHtCaW9pbmZvcm1hdGljc30sCnZvbHVtZSA9IHsyNX0sCm51bWJlciA9IHsxNn0sCnBhZ2VzID0gezIwNzgtMjA3OX0sCnllYXIgPSB7MjAwOX0sCmRvaSA9IHsxMC4xMDkzL2Jpb2luZm9ybWF0aWNzL2J0cDM1Mn0sClVSTCA9IHtodHRwOi8vZHguZG9pLm9yZy8xMC4xMDkzL2Jpb2luZm9ybWF0aWNzL2J0cDM1Mn0sCmVwcmludCA9IHsvb3VwL2JhY2tmaWxlL2NvbnRlbnRfcHVibGljL2pvdXJuYWwvYmlvaW5mb3JtYXRpY3MvMjUvMTYvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHAzNTIvMi9idHAzNTIucGRmfQp9CgpAYXJ0aWNsZXtzbmFrZW1ha2UyMDEyLAphdXRob3IgPSB7S8O2c3RlciwgSm9oYW5uZXMgYW5kIFJhaG1hbm4sIFN2ZW59LAp0aXRsZSA9IHtTbmFrZW1ha2XigJRhIHNjYWxhYmxlIGJpb2luZm9ybWF0aWNzIHdvcmtmbG93IGVuZ2luZX0sCmpvdXJuYWwgPSB7QmlvaW5mb3JtYXRpY3N9LAp2b2x1bWUgPSB7Mjh9LApudW1iZXIgPSB7MTl9LApwYWdlcyA9IHsyNTIwLTI1MjJ9LAp5ZWFyID0gezIwMTJ9LApkb2kgPSB7MTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHM0ODB9LApVUkwgPSB7aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHM0ODB9LAplcHJpbnQgPSB7L291cC9iYWNrZmlsZS9jb250ZW50X3B1YmxpYy9qb3VybmFsL2Jpb2luZm9ybWF0aWNzLzI4LzE5LzEwLjEwOTMvYmlvaW5mb3JtYXRpY3MvYnRzNDgwLzIvYnRzNDgwLnBkZn0KfQoKQGFydGljbGV7QkgxOTk1LAogSVNTTiA9IHswMDM1OTI0Nn0sCiBVUkwgPSB7aHR0cDovL3d3dy5qc3Rvci5vcmcvc3RhYmxlLzIzNDYxMDF9LAogYWJzdHJhY3QgPSB7VGhlIGNvbW1vbiBhcHByb2FjaCB0byB0aGUgbXVsdGlwbGljaXR5IHByb2JsZW0gY2FsbHMgZm9yIGNvbnRyb2xsaW5nIHRoZSBmYW1pbHl3aXNlIGVycm9yIHJhdGUgKEZXRVIpLiBUaGlzIGFwcHJvYWNoLCB0aG91Z2gsIGhhcyBmYXVsdHMsIGFuZCB3ZSBwb2ludCBvdXQgYSBmZXcuIEEgZGlmZmVyZW50IGFwcHJvYWNoIHRvIHByb2JsZW1zIG9mIG11bHRpcGxlIHNpZ25pZmljYW5jZSB0ZXN0aW5nIGlzIHByZXNlbnRlZC4gSXQgY2FsbHMgZm9yIGNvbnRyb2xsaW5nIHRoZSBleHBlY3RlZCBwcm9wb3J0aW9uIG9mIGZhbHNlbHkgcmVqZWN0ZWQgaHlwb3RoZXNlcy10aGUgZmFsc2UgZGlzY292ZXJ5IHJhdGUuIFRoaXMgZXJyb3IgcmF0ZSBpcyBlcXVpdmFsZW50IHRvIHRoZSBGV0VSIHdoZW4gYWxsIGh5cG90aGVzZXMgYXJlIHRydWUgYnV0IGlzIHNtYWxsZXIgb3RoZXJ3aXNlLiBUaGVyZWZvcmUsIGluIHByb2JsZW1zIHdoZXJlIHRoZSBjb250cm9sIG9mIHRoZSBmYWxzZSBkaXNjb3ZlcnkgcmF0ZSByYXRoZXIgdGhhbiB0aGF0IG9mIHRoZSBGV0VSIGlzIGRlc2lyZWQsIHRoZXJlIGlzIHBvdGVudGlhbCBmb3IgYSBnYWluIGluIHBvd2VyLiBBIHNpbXBsZSBzZXF1ZW50aWFsIEJvbmZlcnJvbmktdHlwZSBwcm9jZWR1cmUgaXMgcHJvdmVkIHRvIGNvbnRyb2wgdGhlIGZhbHNlIGRpc2NvdmVyeSByYXRlIGZvciBpbmRlcGVuZGVudCB0ZXN0IHN0YXRpc3RpY3MsIGFuZCBhIHNpbXVsYXRpb24gc3R1ZHkgc2hvd3MgdGhhdCB0aGUgZ2FpbiBpbiBwb3dlciBpcyBzdWJzdGFudGlhbC4gVGhlIHVzZSBvZiB0aGUgbmV3IHByb2NlZHVyZSBhbmQgdGhlIGFwcHJvcHJpYXRlbmVzcyBvZiB0aGUgY3JpdGVyaW9uIGFyZSBpbGx1c3RyYXRlZCB3aXRoIGV4YW1wbGVzLn0sCiBhdXRob3IgPSB7WW9hdiBCZW5qYW1pbmkgYW5kIFlvc2VmIEhvY2hiZXJnfSwKIGpvdXJuYWwgPSB7Sm91cm5hbCBvZiB0aGUgUm95YWwgU3RhdGlzdGljYWwgU29jaWV0eS4gU2VyaWVzIEIgKE1ldGhvZG9sb2dpY2FsKX0sCiBudW1iZXIgPSB7MX0sCiBwYWdlcyA9IHsyODktLTMwMH0sCiBwdWJsaXNoZXIgPSB7W1JveWFsIFN0YXRpc3RpY2FsIFNvY2lldHksIFdpbGV5XX0sCiB0aXRsZSA9IHtDb250cm9sbGluZyB0aGUgRmFsc2UgRGlzY292ZXJ5IFJhdGU6IEEgUHJhY3RpY2FsIGFuZCBQb3dlcmZ1bCBBcHByb2FjaCB0byBNdWx0aXBsZSBUZXN0aW5nfSwKIHZvbHVtZSA9IHs1N30sCiB5ZWFyID0gezE5OTV9Cn0KCgoKQE1hbnVhbHtSLWFwZWdsbSwKICB0aXRsZSA9IHthcGVnbG06IEFwcHJveGltYXRlIHBvc3RlcmlvciBlc3RpbWF0aW9uIGZvciBHTE0gY29lZmZpY2llbnRzfSwKICBhdXRob3IgPSB7QW5xaSBaaHUgYW5kIEpvc2VwaCBHLiBJYnJhaGltIGFuZCBNaWNoYWVsIEkuIExvdmV9LAogIHllYXIgPSB7MjAxOH0sCiAgbm90ZSA9IHtSIHBhY2thZ2UgdmVyc2lvbiAxLjQuMH0sCn0KQE1hbnVhbHtSLWJhc2UsCiAgdGl0bGUgPSB7UjogQSBMYW5ndWFnZSBhbmQgRW52aXJvbm1lbnQgZm9yIFN0YXRpc3RpY2FsIENvbXB1dGluZ30sCiAgYXV0aG9yID0ge3tSIENvcmUgVGVhbX19LAogIG9yZ2FuaXphdGlvbiA9IHtSIEZvdW5kYXRpb24gZm9yIFN0YXRpc3RpY2FsIENvbXB1dGluZ30sCiAgYWRkcmVzcyA9IHtWaWVubmEsIEF1c3RyaWF9LAogIHllYXIgPSB7MjAxOH0sCiAgdXJsID0ge2h0dHBzOi8vd3d3LlItcHJvamVjdC5vcmcvfSwKfQpATWFudWFse1ItREVTZXEyLAogIHRpdGxlID0ge0RFU2VxMjogRGlmZmVyZW50aWFsIGdlbmUgZXhwcmVzc2lvbiBhbmFseXNpcyBiYXNlZCBvbiB0aGUgbmVnYXRpdmUKYmlub21pYWwgZGlzdHJpYnV0aW9ufSwKICBhdXRob3IgPSB7TWljaGFlbCBMb3ZlIGFuZCBTaW1vbiBBbmRlcnMgYW5kIFdvbGZnYW5nIEh1YmVyfSwKICB5ZWFyID0gezIwMTh9LAogIG5vdGUgPSB7UiBwYWNrYWdlIHZlcnNpb24gMS4yMi4wfSwKICB1cmwgPSB7aHR0cHM6Ly9naXRodWIuY29tL21pa2Vsb3ZlL0RFU2VxMn0sCn0KQE1hbnVhbHtSLXBjYU1ldGhvZHMsCiAgdGl0bGUgPSB7cGNhTWV0aG9kczogQSBjb2xsZWN0aW9uIG9mIFBDQSBtZXRob2RzfSwKICBhdXRob3IgPSB7V29sZnJhbSBTdGFja2xpZXMgYW5kIEhlbm5pbmcgUmVkZXN0aWcgYW5kIEtldmluIFdyaWdodH0sCiAgeWVhciA9IHsyMDE4fSwKICBub3RlID0ge1IgcGFja2FnZSB2ZXJzaW9uIDEuNzQuMH0sCiAgdXJsID0ge2h0dHBzOi8vZ2l0aHViLmNvbS9ocmVkZXN0aWcvcGNhbWV0aG9kc30sCn0KQE1hbnVhbHtSLXJtYXJrZG93biwKICB0aXRsZSA9IHtybWFya2Rvd246IER5bmFtaWMgRG9jdW1lbnRzIGZvciBSfSwKICBhdXRob3IgPSB7SkogQWxsYWlyZSBhbmQgWWlodWkgWGllIGFuZCBKb25hdGhhbiBNY1BoZXJzb24gYW5kIEphdmllciBMdXJhc2NoaSBhbmQgS2V2aW4gVXNoZXkgYW5kIEFyb24gQXRraW5zIGFuZCBIYWRsZXkgV2lja2hhbSBhbmQgSm9lIENoZW5nIGFuZCBXaW5zdG9uIENoYW5nfSwKICB5ZWFyID0gezIwMTh9LAogIG5vdGUgPSB7UiBwYWNrYWdlIHZlcnNpb24gMS4xMH0sCiAgdXJsID0ge2h0dHBzOi8vQ1JBTi5SLXByb2plY3Qub3JnL3BhY2thZ2U9cm1hcmtkb3dufSwKfQpATWFudWFse1ItUnN1YnJlYWQsCiAgdGl0bGUgPSB7UnN1YnJlYWQ6IFN1YnJlYWQgc2VxdWVuY2UgYWxpZ25tZW50IGFuZCBjb3VudGluZyBmb3IgUn0sCiAgYXV0aG9yID0ge1dlaSBTaGkgYW5kIFlhbmcgTGlhbyB3aXRoIGNvbnRyaWJ1dGlvbnMgZnJvbSBHb3Jkb24gSyBTbXl0aCBhbmQgSmVubnkgRGFpIGFuZCBUaW1vdGh5IHtUcmljaGUsIEpyLn19LAogIHllYXIgPSB7MjAxOH0sCiAgbm90ZSA9IHtSIHBhY2thZ2UgdmVyc2lvbiAxLjMyLjB9LAogIHVybCA9IHtodHRwOi8vYmlvY29uZHVjdG9yLm9yZy9wYWNrYWdlcy9yZWxlYXNlL2Jpb2MvaHRtbC9Sc3VicmVhZC5odG1sfSwKfQpATWFudWFse1ItU2hvcnRSZWFkLAogIHRpdGxlID0ge1Nob3J0UmVhZDogRkFTVFEgaW5wdXQgYW5kIG1hbmlwdWxhdGlvbn0sCiAgYXV0aG9yID0ge01hcnRpbiBNb3JnYW4gYW5kIE1pY2hhZWwgTGF3cmVuY2UgYW5kIFNpbW9uIEFuZGVyc30sCiAgeWVhciA9IHsyMDE4fSwKICBub3RlID0ge1IgcGFja2FnZSB2ZXJzaW9uIDEuNDAuMH0sCn0K" write(base64decode(bibliography, what=character, size=1)[1], file="skeleton.bib", append=FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.