In sagrudd/cDNA_tutorial: Tutorial For Nanopore Sequence Based Gene Expression Profiling

knitr::opts_chunk$set(tidy=TRUE, tidy.opts=list(width.cutoff=80), 
                      fig.width=9, fig.height=6, warning=FALSE, message=FALSE, 
                      kable.force.latex=TRUE,
                      highlight=TRUE,
                      cache.path="Analysis/Cache/cDNATutorial/")

# library(TuPoreIal)  # ONT container package; deprecated ...
library(tufte)
library(digest)
library(ShortRead)   # oh the irony!
library(ggplot2)
library(plyr)
library(dplyr)       # used for the %>% filtering
library(Rsubread)
library(DESeq2)
library(pcaMethods)
library(kableExtra)  # rendering of tables...
library(caTools)     # base64decode for embedded Snakefile
library(xlsx)        # writing results to Excel files
library(yaml)

source("common.R")

resultDir <- file.path("Analysis", "Results")
dir.create(resultDir, showWarnings = FALSE, recursive=TRUE)

options(htmltools.dir.version = FALSE)

# as this tutorial evolves there is a possibility to include some additional logic and validation of what has been provided; further abstraction and accommodation of more complex experimental designs. Further abstraction required for handling input data with varying extensions and degrees of compression ...

# reality check of provided metadata ....

# does the provided reference genome exist - is it fasta (or fasta.gz)?

# does the required gene features gtf file exist?

# are each of the sample files specified existing - are they fastq?

# are there at least 2 replicates for at least 2 conditions - required for statistical analysis

Statement of Tutorial Objectives

The aim of this tutorial is to demonstrate a workflow for long-read differential gene expression analysis based on replicated cDNA sequence data. This workflow is suitable for fastq sequence collections with a paired design (e.g. tumour/normal) where a reference genome sequence, and its gene annotation, is available.

The tutorial is packaged with example data, such that the workflow can be replicated to address gene expression questions such as

which genes are expressed in my study of interest?
which genes are upregulated in my tumour sample?
show grouped expression levels for gene ENSG00000117523?

Editing of the workflow's configuration file, config.yaml will allow the workflow to be run with different starting cDNA sequence collections, reference genomes, and with different statistical requirements for selection of candiate genes.

Methods utilised include:

conda for management of bioinformatics software
snakemake for managing the bioinformatics workflow
minimap2 for mapping sequence reads to reference genome
samtools for SAM/BAM handling and mapping statistics
RSubread and DESeq2 for differential expression analysis

The Computational requirements include

Computer running Linux (Centos7, Ubuntu 18_10, Fedora 29) - also valided on macOS (Mojave); please see Appendix
At least 16Gb RAM - swap space of at least 8Gb ideal
At least 15Gb spare disk space for analysis and indices
Runtime with provided example data - approximately 20 minutes

\pagebreak

Quick start

Most software dependecies are managed though conda, install as described at
https://conda.io/docs/install/quick.html#linux-miniconda-install.
Download github gist for environment.yaml
wget https://gist.githubusercontent.com/sagrudd/609b9124b96f6d388951dad9e8bc4964/raw/environment.yaml
Install conda software dependencies with
conda env create --name diffExprTutorial --file environment.yaml
Initialise conda environment with source activate diffExprTutorial
Download Nanopore cDNA tutorial & example files into folder named cDNA_tutorial svn export https://github.com/sagrudd/tutorials.git/trunk/cDNA_deseq2 cDNA_tutorial
Change into the created cDNA_tutorial directory cd cDNA_tutorial
Install the Rpackage dependencies

    R --slave -e 'install.packages("BiocManager", repos="https://ftp.fau.de/cran/")'
    R --slave -e 'BiocManager::install(readLines("R.dependencies"), ask=F, update=F)'

optional edit the provided config.yaml file to match your own study design
Snakemake perform the bioinformatics analysis of the specified samples snakemake
Render the report using results from the analysis above R --slave -e 'rmarkdown::render("Nanopore_cDNA_Tutorial.Rmd", "tufte_handout")'

\pagebreak

Introduction

Differential gene expression analysis aims to identify genes that show statistically (and magnitudinally) altered expression patterns in a biological system of interest, when compared to a biological control. The results of the differential expression analysis are presented in a quantitative form and therefore the degree of change (up or down regulation) can be ascertained for each gene identified.

Differential gene expression analysis requires a "snapshot" of gene activity. In this context, gene activity corresponds to the quantity of messenger RNAs (mRNA) transcribed from each gene within the organism/tissue/culture being investigated. The greater the number of mRNA molecules observed from a given gene, the higher the expression level. In order to determine expression levels across the whole genome, sequence data specifically targeting the mRNA molecules can be generated.

Oxford Nanopore Technologies provides a number of sequencing solutions to allow users to generate such a "snapshot"" of gene expression. This can be achieved by both sequencing the mRNA directly, or via a complementary DNA (cDNA) proxy. In contrast to short read sequencing technologies, entire mRNA transcripts can be captured as single reads. The example data provided with this tutorial is from a study based on the PCR-cDNA kit. This is a robust choice for performing differential gene expression study. This kit is suitable for preparation of sequence libraries from low mRNA input quantities, and, the cDNA population is enriched through PCR with low bias; an important pre-requisite for the subsequent statistical analysis.

Once sequencing data has been produced from both the experimental and paired control samples (with an appropriate number of biological replicates), the sequence reads can be mapped to the host's reference genome. The number of sequences mapping to each gene can be counted, and it is this count data that forms the basis for differential gene expression analysis.

There are five goals for this tutorial:

To introduce a literate framework for analysing Oxford Nanopore cDNA data prepared using MinION or PromethION flowcells
To utilise best data management practices
To provide basic cDNA sequence QC metrics such that review and consideration of the starting experimental data can be performed
To map sequence reads to the reference genome and to identify the genes that are expressed and the number of sequence reads that are observed from each gene
To perform a statistical analysis using DESeq2 (@R-DESeq2) to identify differentially expressed genes

This tutorial does not aim to provide an exhaustive analysis or annotation of the differentially expressed genes.

Getting started and Best Practices

This tutorial requires that we are working at a computer workstation running a Linux operating system. An update for MacOS will be released. The workflow described has been tested using Fedora 29, Centos 7, and Ubuntu 18_04. This tutorial is written in the Rmarkdown file format. This merges both markdown (and easy-to-write plain text format as used in many Wiki systems). See @R-rmarkdown for more information about rmarkdown. The document template contains chunks of embedded R code (and some linux bash commands). The included workflow makes extensive use of the conda package management and the snakemake workflow software. These software packages and the functionality of Rmarkdown provide the source for a rich, reproducible, and extensible tutorial document.

The workflow contained within this Tutorial performs a real bioinformatics analysis and uses the whole human genome as an example. There are some considerations in terms of memory and processor requirement. Indexing the whole human genome for sequence read mapping using minimap2 for example will use at least 18Gb of memory. The minimal recommended hardware setup for this tutorial is therefore an 8 threaded computer with at least 16Gb of RAM and 15Gb of storage space. If you have modest amounts of RAM (<24Gb) then please consider increasing the amount of swap space available for indexing reference genomes and holding genome index in memory whilst mapping.

There are a few dependencies that need to be installed at the system level - please refer to the section below on system requirements. The conda package management software will coordinate the installation of key software and dependencies in user space - this is dependent on a robust internet connection.

As a best practice this tutorial will separate primary DNA sequence data (the base-called fastq files) from the Rmarkdown source, and the genome reference data. The analysis results and figures will again be placed in a separate working directory. The required layout for the primary data is shown in figure \ref{fig:FolderLayout}. This minimal structure will be prepared over the next few sections of this tutorial. The DNA sequences must be placed within a folder called RawData, the reference genome and annotation files must be placed in a folder named ReferenceData.

knitr::include_graphics("Analysis/StaticImages/FolderLayout.png")

Experimental design

The first required step for performing a cDNA sequence analysis requires some information on the biological samples and the replicates that are to be used in the analysis. The minimally required information is a path to the sequence file, and experimental group assignment, and a replicate assignment.

The process of DNA sequencing and base-calling within a MinION, GridION, or PromethION, may generate hundreds of individual fastq files. Each of these files needs to be concatenated into a single fastq file per sample for this workflow. This file should be named appropriately; the files may be compressed with either gzip or bzip2.

knitr::include_graphics("Analysis/StaticImages/ExperimentalDesign.png")

The example data included with this tutorial describes a replicated study comparing an experimental sample against a linked control. This design is described in a separate file that should be named config.yaml - an example file has been provided with the tutorial. The content of this file is highlighted in figure \ref{fig:EditPhenoData}. The sequence files are defined within the Sample block; experimental groups, and their discrete biological samples are defined within this block.

reference_genome refers to the genome against which the sequence reads will be mapped; genome_annotation refers to the gene annotations assigned to this genome sequence. A URL should ideally be provided for both and the Snakemake workflow will download these files and manage the indexing of the genome reference.

The only other parameters that should be considered are readCountMinThreshold, lfcThreshold, and adjPValueThreshold. These correspond to the required depth of coverage to analyse a gene, the log2-fold-change filter to be applied in differential testing and the false-discovery corrected P-value threshold to be used, respectively.

config <- yaml.load_file("config.yaml")
studyDesign <- data.frame()
for (i in 1:length(config$Samples)) {
  studyDesign <- rbind(studyDesign, 
                       data.frame(samples=names(config$Samples[[i]][[1]]), 
                                  filename=unlist(config$Samples[[i]][[1]]), 
                                  group=names(config$Samples[[i]])))
}

studyDesign$replicate <- sapply(1:nrow(studyDesign), function(x)sum(studyDesign$group[1:x]==studyDesign$group[x]))
studyDesign$md5 <- lapply(as.character(studyDesign$filename), md5sum)
# let's use the provided filename as the key??
studyDesign$group <- relevel(studyDesign$group, ref=config$referenceGroup)
# quick tidy
studyDesign <- studyDesign[,-which(colnames(studyDesign)=="samples")]

knitr::kable(studyDesign, caption="Study design for samples evaluated within this report", booktabs=TRUE, table.envir='table*', linesep="")  %>%
  kable_styling(latex_options=c("hold_position", "scale_down")) %>%
  add_footnote(c("please note md5 column - see glossary of terms"))

\pagebreak

Raw sequence review

It is worth considering some of the sequence characteristics for each of the sequence collections that has been included. The statistics shown in table \ref{tab:summaryStatsTable} have been produced using the R ShortRead package (see @R-ShortRead). Loading the sequence files into computer memory has allowed for the rapid calculation of sequence length distribution, a review of sequence qualities, and the preparation of some minimal statistics that allow for the qualitative and quantitative review of these different sequence libraries. These statistics have been parsed directly from the provided Fastq files.

processQCFastq <- function(rowname) {
  row <- which(row.names(studyDesign)==rowname)
  file <- as.character(studyDesign[row, "filename"])
  fastq <- readFastq(file)
  c(
    reads = formatC(length(fastq), big.mark=","),
    mbs = formatC(round(sum(width(fastq)) / 1000 / 1000, digits=1), big.mark=","),
    min = min(width(fastq)),
    max = max(width(fastq)),
    mean = round(mean(width(fastq)), digits=1),
    median = round(median(width(fastq)), digits=0),
    qval = round(mean(alphabetScore(fastq) / width(fastq)), digits=1),
    gc = round(mean(letterFrequency(sread(fastq), "GC")  / width(fastq)) * 100, digits=1),
    n50 = ncalc(width(fastq), n=0.5),
    l50 = lcalc(width(fastq), n=0.5),
    n90 = ncalc(width(fastq), n=0.9),
    l90 = lcalc(width(fastq), n=0.9)
  )
}

data <- lapply(row.names(studyDesign), processQCFastq)
qcData <- data.frame(data)
colnames(qcData) <- row.names(studyDesign)

knitr::kable(qcData, caption="Summary statistics for the cDNA libraries imported", booktabs=TRUE, table.envir='table*', linesep="")  %>%
  kable_styling(latex_options=c("hold_position", font_size=9)) %>%
  add_footnote(c("please note md5 column - see glossary of terms"))

The summary statistics displayed include observational metrics including the number of reads, information on the longest, shortest and mean sequence read lengths and other information that includes GC content, N50 and N90 read lengths. Ideally there will be similar characteristics across each of the sequence libraries - large differences in e.g. the number of reads, overall read quality and GC content may indicate non-biological differences between the samples - such differences may confound the statistical analysis and hinder the identification of differentially expressed genes.

\pagebreak

Violin plots of sequence characteristics

While the summary statistics displayed in table \ref{tab:summaryStatsTable} provide a feel for the equivalence of the sequence libraries, a grahical representation may make the comparison of the libraries simpler. In this version of the cDNA expression profiling tutorial, violin plots are used to summarise the distribution of sequence characteristics within and between libraries. The violin plot is similar to a boxplot but the thickness of the bar illustrates how abundant a value is within the given collection. The violin plot is thus a simple graphic that will enable the visualisation of similarity or dis-similarity within a sequence collection.

extractLengths <- function(rowname) {
  row <- which(row.names(studyDesign)==rowname)
  file <- as.character(studyDesign[row, "filename"])
  fastq <- readFastq(file)
  t(as.matrix(width(fastq)))
}
lengthData <- mapply(extractLengths, row.names(studyDesign))
lengthDataMatrix <- data.frame(t(plyr::rbind.fill.matrix(lengthData)))
colnames(lengthDataMatrix) <-  row.names(studyDesign)

lengthMatrixMelt <- reshape2::melt(lengthDataMatrix, na.rm=TRUE, measure.vars=row.names(studyDesign))
lengthMatrixMelt <- cbind(lengthMatrixMelt, group=studyDesign[match(lengthMatrixMelt$variable,  rownames(studyDesign)), "group"])

plot <- ggplot(lengthMatrixMelt, aes(x=variable, y=value, fill=group)) + geom_violin() + scale_y_continuous(limits=c(0, as.numeric(quantile(lengthMatrixMelt$value, probs=c(0.975))))) + xlab("study sample") +  ylab("Distribution of Read Lengths (bp)") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_brewer(palette="Paired") + labs(title="Violin plot showing distribution of read lengths across samples")

print(plot)

extractQualities <- function(rowname) {
  row <- which(row.names(studyDesign)==rowname)
  file <- as.character(studyDesign[row, "filename"])
  fastq <- readFastq(file)
  t(as.matrix(alphabetScore(fastq) / width(fastq)))
}
qualityData <- mapply(extractQualities, row.names(studyDesign))
qualityDataMatrix <- data.frame(t(plyr::rbind.fill.matrix(qualityData)))
colnames(qualityDataMatrix) <-  row.names(studyDesign)

qualityMatrixMelt <- reshape2::melt(qualityDataMatrix, na.rm=TRUE, measure.vars=row.names(studyDesign))
qualityMatrixMelt <- cbind(qualityMatrixMelt, group=studyDesign[match(qualityMatrixMelt$variable,  rownames(studyDesign)), "group"])

plotQ <- ggplot(qualityMatrixMelt, aes(x=variable, y=value, fill=group)) + geom_violin() + scale_y_continuous(limits=c(min(qualityMatrixMelt$value), max(qualityMatrixMelt$value))) + xlab("study sample") +  ylab("Distribution of Read Qualities (QV)") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_brewer(palette="Paired") + labs(title="Violin plot showing distribution of read qualities across samples")

print(plotQ)

It is recommended to review the violin plots of sequence lengths (figure \ref{fig:violinLength}) and sequence quality (figure \ref{fig:violinQuality}) characteristics. The aim here is to assess the integrity and quality for each sample and to ensure that the quality, depth and characteristics will be suitable for the objectives of the expression profiling study. If there is considerable variability within and between the experimental samples then it is possible that a statistical analysis may not lead to the identification of differentially expressed gene transcripts.

\newpage

Snakemake

This tutorial for expression profiling from cDNA sequence data makes use of snakemake (@snakemake2012). Snakemake is a workflow management system implemented in Python. The aim of the Snakemake tool is to enable reproducible and scalable data analyses. The workflow produced within this document should be portable between computer servers and other larger scale IT deployments. The Snakemake workflow additionally defines the sets of required software (and software versions where appropriate) and will automate the installation and deployment of these software through the conda package management system.

The snakemake workflow will call methods that include minimap2 @minimap22018 and samtools @samtools2009. The planned workflow is shown in \ref{fig:DAG}. The provided reference genome sequence will be indexed using Minimap2, each sequence collection will be mapped to the index (again using minimap2 with parameters tuned for the mapping of long reads whilst accommodating exon matches interspersed by introns) and summary statistics will be prepared using the samtools software.

knitr::include_graphics("Analysis/StaticImages/dag1.png")

The precise commands within the Snakefile include

Download the specified reference genome
Download the specified genome annotations
Use minimap2 to index the reference genome
Map cDNA sequence reads against the reference genome index using minimap2
convert minimap2 output (SAM) into a sorted BAM format using samtools
prepare summary mapping statistics using samtools

Run the generated Snakemake file

r margin_note("The snakemake command is responsible for orchestrating the analytical workflow. The command with the parameters --forceall --dag is responsible for the preparation of the Directed Acyclic Graph that describes the workflow. This is displayed in figure \\ref{fig:DAG}.")

\fontsize{8}{12}

snakemake --forceall --dag | dot -Tpng > Analysis/StaticImages/dag1.png
snakemake

\fontsize{10}{14}

\pagebreak

Review of cDNA read mapping.

From the DAG presented in \ref{fig:DAG}, we can see that the Snakemake workflow included a samtools flagstat mapping statistic summary. These mapping statistics will have been placed in a file in the Analysis/flagstat folder. These mapping statistics are useful for again evaluating the quality of the study - ideally there should be robust similarities in the fraction of mapping sequence reads / multi-mapping etc.

flagstatTargets <- file.path("Analysis", "flagstat", 
    paste(basename(as.character(studyDesign$filename)),".txt",sep="")
)
loadFlagstat <- function(file) {
  x <- read.table(file, header=FALSE, sep=" ", fill=NA)[c(1:5),c(1,3)]
  paste(x[,1], x[,2], sep="/")
}

flagstatRes <- data.frame(matrix(unlist(lapply(flagstatTargets, loadFlagstat)), ncol=length(flagstatTargets)), stringsAsFactors = FALSE)
colnames(flagstatRes) <- rownames(studyDesign)
rownames(flagstatRes) <- c("readMaps.1", "Secondary.2", "Supplementary.3", "Duplicates.4", "Mapped.5")

# include the measured number of reads; this is real
flagstatRes[nrow(flagstatRes)+1,] <- as.numeric(gsub(",","",t(qcData)[, "reads"]))
rownames(flagstatRes)[6] <- "nreads"

getVal <- function(word) {
  sum(as.numeric(unlist(strsplit(word, "/"))))
}

zreads <- unlist(lapply(flagstatRes["readMaps.1", ], getVal)) -
  unlist(lapply(flagstatRes["Secondary.2", ], getVal)) -
  unlist(lapply(flagstatRes["Supplementary.3", ], getVal)) - 
  unlist(lapply(flagstatRes["Duplicates.4", ], getVal)) 

flagstatRes[nrow(flagstatRes)+1,] <- zreads
rownames(flagstatRes)[7] <- "zreads"

flagstatRes[nrow(flagstatRes)+1,] <- as.numeric(flagstatRes["zreads", ]) / as.numeric(flagstatRes["nreads", ]) * 100
rownames(flagstatRes)[8] <- "%mapping"


flagstatRes <- flagstatRes[c(6,1,2,3,4,5,7,8),]

#knitr::kable(flagstatRes, caption="Summary statistics from the Minimap2 long read spliced mapping.")

knitr::kable(flagstatRes, caption="Summary statistics from the Minimap2 long read spliced mapping.", booktabs=TRUE, table.envir='table*', linesep="")  %>%
  kable_styling(latex_options=c("hold_position", font_size=11)) %>%
  add_footnote(c("information from samtools flagstat"))

The read mapping statistics shown in table \ref{tab:flagstatStats} are derived from the bam files produced during the minimap2 sequence mapping against the reference genome (index). The total number of reads (readMaps.1) reported in the samtools flagstat analysis refers to the total number of sequence read mappings, not the total number of reads. The zread is calculated as the (readMaps.1 - Secondary.2 - Supplementary.3 - Duplicates.4). This should be close to nreads, the number of reads counted in the fastq file. The values with the / separator denote the number of reads that pass QC vs fail QC.

\pagebreak

Analysis of reads mapped to genes

The first step in performing a differential expression analysis is to count the number of reads that are mapped to each of the genes annotated in the reference genome. In the previous section we mapped reads to the reference genome using the minimap2 software - this has created positional information but mapped cDNA sequences do not yet correspond to genes. To establish the linkage between (spliced) genomic location and a gene feature requires read summarisation and read quantification. These tasks can be performed using the featureCounts method from the Rsubread package (see @R-Rsubread).

\fontsize{7}{7}

readCountTargets <- file.path("Analysis", "Minimap", 
    paste(basename(as.character(studyDesign$filename)), ".bam", sep="")
)

ExternalAnnotation = file.path("ReferenceData", basename(config$genome_annotation))

geneCounts <- featureCounts(files=readCountTargets,
                      annot.ext=ExternalAnnotation,
                      isGTFAnnotationFile=TRUE,
                      GTF.featureType="exon",
                      GTF.attrType="gene_id",
                      isLongRead=TRUE,
                      nthreads = 4)$counts

# Rename column hedders for clarity
colnames(geneCounts) <- rownames(studyDesign)

\fontsize{10}{14}

knitr::kable(geneCounts[order(rowSums(geneCounts), decreasing=TRUE)[1:10],], caption="Table showing the 10 annotated gene features with the highest number of mapped reads", booktabs=TRUE, table.envir='table*', linesep="") %>%
  kable_styling(latex_options=c("hold_position", font_size=11)) %>%
  add_footnote(c("This is raw count data and no normalisation or transformation has been performed"))


xlsExpressedGenes <- file.path(resultDir, "ExpressedGenes.xlsx")
write.xlsx(geneCounts[which(rowSums(geneCounts) > 0),], file = xlsExpressedGenes,
      sheetName = "ExpressedGenes", append = FALSE)

Table \ref{tab:geneCounts} shows the top 10 genes with the highest number of mapped reads. This is for information only. An Excel file containing the r length(which(rowSums(geneCounts) > 0)) genes with one or more sequence reads has been prepared - this file is available at the following path

`r xlsExpressedGenes`

DESeq2

The analysis objective once we have have the matrix of read counts per gene per sample, is the quantification and statistical analysis of any systematic differences between conditions. The DESeq command (@R-DESeq2) utilises the raw read count data from the previous section to model the between condition variability and to identify the differentially expressed genes. DESeq2 does not normalise the data, but corrects for the relative library size - this is based on the requirement that only the count values allow for the correct assessment of the measurement precision.

Pre-filtering is performed at this step of the analysis. In the initial setup of the Rmarkdown script we evaluated a variable, readCountMinThreshold, that describes the minimum number of sequence reads that should be considered for statistical analysis. Pre-filtering may not be strictly required - it does improve the performance of the method when time and memory footprint are considered. Results are extracted from the DESeq2 data object - additional filtering is performed according the both an adjusted p-value (specified with the adjPValueThreshold parameter) and a log fold change threshold (specified with the lfcThreshold parameter). The p-value correction is performed according to the fdr method of Benjamini & Hochberg (@BH1995).

deSeqRaw <- DESeqDataSetFromMatrix(countData=geneCounts, colData=studyDesign, design=~group)
# filter out the features that do not contain at least min threshold of expressed genes
deSeqRaw <- deSeqRaw[-which(rowSums(counts(deSeqRaw)) < config$readCountMinThreshold), ]

# perform the differential expression analysis; use local dispersion fit
deSeqObj <- DESeq(deSeqRaw, fitType="local")

# and filter for differentially represented, according to LFC
deSeqRes <- results(deSeqObj, alpha=config$adjPValueThreshold, pAdjustMethod="BH", lfcThreshold=config$lfcThreshold)

# and show a brief summary for this differential expression object
summaryDE <- capture.output(summary(deSeqRes))
cat(paste(summaryDE[1:8], collapse="\n"))

deData <- data.frame(results(deSeqObj, pAdjustMethod="BH"))


xlsDiffExprGenes <- file.path(resultDir, "DiffExpressedGenes.xlsx")
write.xlsx(deData[order(deData$padj),], file = xlsDiffExprGenes,
      sheetName = "DiffExpressedGenes", append = FALSE)

The message above shows the characteristics of the DESeq2 based differential expression analysis. The summary shows both the number of (and percentage) of genes that have satisfed the thresholds of logFoldChange and adjusted p-value. The complete matrix of data results (no filtering by fold-change or statistical threshold) has been written to an Excel file. This file can be found at the following path.

`r xlsDiffExprGenes`

Table \ref{tab:diffExprTable} shows the top 15 differentially expressed genes as ranked by their adjusted p-value. This list is not ranked for either statistical (adj.P.value) or magnitudinal (logFoldChange) relevance. The complete dataset is presented in the Excel file.

knitr::kable(deData[order(deData$padj)[1:15],], digits = c(2,2,2,2,45,45), caption="The top 15 genes, ranked by adjusted P-value, from the DESeq2 analysis. Key information shown includes the mean reads per sample, log fold change between samples and the false discovery corrected p-value", booktabs=TRUE, table.envir='table*', linesep="")  %>%
  kable_styling(latex_options=c("hold_position"), font_size=9) %>%
  add_footnote(c("No statistical or magnitudinal filter has been applied"))

\pagebreak

Principal Component Analysis

With multiple samples, different experimental conditions and hundreds or thousands of genes, gene expression data is highly dimensional. Principal Component Analysis (PCA) is a widely used method for the reduction of dimensionality within such data. Linear combinations of orthogonal gene expressions, or principal components, are calculated that in turn describe decreasing amounts of the total explainable variation within the sequence collection.

Figure \ref{fig:pca} is a PCA plot showing the distribution of sample data for the first two principal components. The first principal component is shown on the X-axis; the second on the Y. The total amount of variation explained is shown on the axis legends. In an ideal study a large amount of variation will be explained with these principal components - and the samples corresponding to the different treatment groups should demonstrate both spatial clustering within groups and clear separation between groups. This is an illustrative plot that should again reinforce expectations from such a study.

pcaMatrix <- counts(deSeqObj)
md <- prep(t(pcaMatrix), scale="none", center=TRUE)
pca <- pca(md, method="svd", center=TRUE, nPcs=3)
xdata <- as.data.frame(pca@scores)
xdata <- cbind(xdata, group=studyDesign$group[match(rownames(xdata), rownames(studyDesign))])
x <- 1
y <- 2
xpercent <- round(pca@R2[x]*100, digits=1)
ypercent <- round(pca@R2[y]*100, digits=1)
xlab <- paste("Prin.Comp. ",x," (",xpercent,"%)",sep="")
ylab <- paste("Prin.Comp. ",y," (",ypercent,"%)",sep="")

plot <- ggplot(xdata, aes(x=PC1, y=PC2, colour=group)) + 
  geom_point(size=5, shape=18) + 
  scale_colour_brewer(palette="Paired") + 
  geom_vline(xintercept=0, color="darkgray") + 
  geom_hline(yintercept=0, color="darkgray") + 
  ggtitle("PCA analysis of experimental samples") + 
  ylab(paste("PC2 (",ypercent,"%)",sep=""))  + 
  xlab(paste("PC1 (",xpercent,"%)",sep=""))

print(plot)

\pagebreak

Volcano plot for condition `r gsub('.+: ', '', deSeqRes@elementMetadata@listData$description[2])`

A volcano plot is a scatter-plot that is used to quickly identify changes in large gene expression data sets composed of replicate data. It plots significance versus fold-change on the y and x axes, respectively. The significance is scaled as a -log10 of the (adjusted) p-value. Volcano plots allow for a simple assessment of systematic variance between sample groups; statistically significant and magnitudinally pronouced changes in gene expression will be manifested as a scattering of data points towards the far top-left or top-right of the graph.

volcanoSubstr <- results(deSeqObj)
logUp <- which(volcanoSubstr$log2FoldChange >= config$lfcThreshold)
logDown <- which(volcanoSubstr$log2FoldChange <= -config$lfcThreshold)
withStat <- which(volcanoSubstr$padj <= config$adjPValueThreshold)
colours <- c(noDifference="gray", upRegulated="red", downRegulated="green")
gene <- rep("noDifference", nrow(volcanoSubstr))
gene[logUp[logUp %in% withStat]] <- "upRegulated"
gene[logDown[logDown %in% withStat]] <- "downRegulated"

plot <- ggplot(data.frame(volcanoSubstr), aes(x=log2FoldChange, y=-log10(padj))) + 
    geom_point(size=1.2) + 
    geom_hline(yintercept=-log10(config$adjPValueThreshold), color="orange") + 
    geom_vline(xintercept=-config$lfcThreshold, color="green") + 
    geom_vline(xintercept=config$lfcThreshold, color="red") + 
    aes(colour=gene) + 
    scale_colour_manual(values=colours) +
    ggtitle("Volcano plot showing distribution of lfc vs adjp for expressed genes")

print(plot)

\pagebreak

Exploring Genes of Interest ...

While a cDNA gene expression profiling study may be used to identify a set of genes that are differentially expressed, it is also worth considering if a known gene of interest is differentially expressed. This tutorial is analysing a small set of pre-selected sequence reads from a human expression study. From earlier tables (table \ref{tab:geneCounts} and table \ref{tab:diffExprTable}) we have seen the identification of genes that are both well expressed within this dataset and that appear differentially expressed between the two experimental conditions.

For this demonstration let us imagine that the ENSEMBL gene with id ENSG00000081870 is of interest. This appears within the mock dataset and has read counts in samples from both conditions. The code within this tutorial has created a number of data objects; these include deData (a data.frame containing the differential expression results from DESeq2), deSeqObj (a DESeqDataSeq containing the context for the DESeq2 differential expression analysis), and geneCounts (the matrix of read counts from RSubread).

Simple R code can be used to extract the mapping characteristics for this gene.

\fontsize{8}{12}

geneOfInterest <- "ENSG00000081870"
cat(paste("Raw Read Counts for gene",geneOfInterest))
geneOfInterestRawExpression <- as.data.frame(
  t(rbind(
    count=geneCounts[geneOfInterest,], 
    group=as.character(studyDesign[colnames(geneCounts), "group"]
  ))))
print(geneOfInterestRawExpression)

\fontsize{10}{14}

This shows the number of reads mapped from each of the sequence libraries considered to the gene space annotated for the gene. This is a raw unscaled value; it also makes sense to consider the normalised / corrected count as used for the statistical analysis. This corrected count, as explained earlier, allows for the differences in sequence depth that may be observed across the different sequence libraries. These data when plotted allow for an eye-ball analysis of differential expression (see figure \ref{fig:exprBoxPlot}.)

\fontsize{8}{12}

geneOfInterestNormalExpression <- plotCounts(deSeqObj, gene=geneOfInterest, intgroup="group", returnData = TRUE)
print(geneOfInterestNormalExpression)

\fontsize{10}{14}

plot <- ggplot(geneOfInterestNormalExpression, aes(x=group, y=count, colour=group)) + 
  geom_point(position=position_jitter(w=0.1,h=0), size=2) + 
  scale_y_log10() + ggtitle("Distribution of normalised read counts for experimental conditions") + xlab("Experimental Condition") +ylab("normalised read count - log10 scale") + scale_colour_brewer(palette="Paired")
print(plot)

The final question is whether this observation is statistically meaningful?

\fontsize{8}{12}

options(width = 100)
deData[geneOfInterest,]

\fontsize{10}{14}

If we look at the output we can see that the logFoldChange is r deData[geneOfInterest,]$log2FoldChange and the p-value is r deData[geneOfInterest,]$pvalue (before FDR correction) or r deData[geneOfInterest,]$padj following FDR correction.

Annotation and Further Analysis of Differentially Expressed Genes

There is a substantial amount of further analysis and annotation that could be performed using the study data produced. This tutorial does not pursue these analyses since there will be considerable reliance on reference resources and databases that will tie the analysis to a specific organism or genome database. A good starting place to continue the analysis started within this report would be

http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

\pagebreak

Reproducible Research - Produce your own report

The R code to prepare your own report is included in the distributed Rmarkdown file. The Rmarkdown file can be loaded, viewed, and edited in the RStudio software. Within your conda environment (and within your tutorial folder), simply type

\fontsize{8}{12}

rstudio Nanopore_cDNA_Tutorial.Rmd

\fontsize{10}{14}

The Rmarkdown script can be run usimg the knit dialog in the Rstudio software - please see figure \ref{fig:KnitIt}. Selecting Knit to tufte_handout will prepare a PDF document summarising the analysis while Knit to HTML will prepare a portable HTML file.

knitr::include_graphics("Analysis/StaticImages/KnitIt.png")

The document can also be rendered from the command-line with either of the following commands

\fontsize{8}{12}

R --slave -e 'rmarkdown::render("Nanopore_cDNA_Tutorial.Rmd", "tufte_handout")'

R --slave -e 'rmarkdown::render("Nanopore_cDNA_Tutorial.Rmd", "html_document")'

\fontsize{10}{14}

This report has been created using Rmarkdown, publicly available R packages, and the \LaTeX document typesetting software for reproducibility. For clarity the R packages used, and their versions, is listed below.

\fontsize{8}{12}

options(width = 100)
utils:::print.sessionInfo(sessionInfo()[-7], locale=FALSE)

\fontsize{10}{14}

It is also worth considering the versions of the software that have been used for the analysis.

\fontsize{8}{12}

source ~/.bashrc
source activate cDNA_expression
conda list "samtools|minimap2|snakemake|rstudio"

\fontsize{10}{14}

r newthought("Final thoughts.") Behind this Rmarkdown file (and its glossy pdf) is a modest amount of R code - please explore the Rmarkdown template; modify it, and run with your own samples.

To extract the whole set of R code from the Rmarkdown, use the purl command - this will extract the R code into its own file.

knitr::purl("Nanopore_cDNA_Tutorial.Rmd", quiet=TRUE)

\pagebreak

Glossary of Terms

knit is the command to render an Rmarkdown file. The knitr package is used to embed code, the results of R analyses and their figures within the typeset text from the document.
L50 the number of sequences (or contigs etc) that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
N50 length such that sequences (or contigs etc) of this length or longer include half the bases of the sequence collection
Rmarkdown is an extension to markdown. Functional R code can be embedded in a plain-text document and subsequently rendered to other formats including the PDF format of this report.
QV the quality value - -log10(p) that any given base is incorrect. QV may be either at the individual base level, or may be averaged across whole sequences

\pagebreak

Appendix - Installation on macOS

The conda package management system is available for Linux and macOS. There are some R-package dependencies for the tutorial that are challenged by the system files distributed by conda. For macOS, it would be recommended that R, Rstudio, and \LaTeX installation is managed at the system level; other bioinformatics software will be installed through conda.

Download and install the R statistical software from https://cran.r-project.org/bin/macosx/
Install the gfortran and clang compilers provided at https://cran.r-project.org/bin/macosx/tools/
Download and install Rstudio software from https://www.rstudio.com
Download and install MacTex from, https://tug.org/mactex/mactex-download.html
Install and update required macOS R packages

R --slave -e 'install.packages(c("BiocManager", "devtools"), ask=F, update=F)'
R --slave -e 'BiocManager::install(c("sysfonts", "showtext"), ask=F, update=F, type="source")'
R --slave -e 'BiocManager::install(c("kableExtra", "roxygen2", "emojifont", "extrafont", "R.utils"), ask=F, update=F)
R --slave -e 'BiocManager::install(c("DESeq2", "caTools", "pcaMethods", "dplyr", "ggplot2", "plyr", "RColorBrewer", "rmarkdown", "tufte", "xfun", "xlsx", "yaml", "Rsubread", "ShortRead"), ask=F, update=F)'

Most software dependecies are managed though conda, install as described at
https://conda.io/docs/user-guide/install/macos.html.
create a working directory for the project and cd into it
Download github gist for macOS macenv.yaml
wget https://gist.github.com/sagrudd/63d7c76035555def83a336336a73365d/raw/macenv.yaml
Install conda software dependencies with
conda env create --name sumStatTutorial --file macenv.yaml
Initialise conda environment with source activate sumStatTutorial
Install the Nanopore tutorials Rpackage R --slave -e 'Sys.setenv(TAR = "/bin/tar"); devtools::install_github("sagrudd/tutorials")'
Initialise the Nanopore QC tutorial R --slave -e 'TuPoreIal::InitTutorial("SeqStatsQC")'
optional edit the provided config.yaml file to match your own study design
Render the report using results from the analysis above R --slave -e 'rmarkdown::render("Nanopore_SumStatQC_Tutorial.Rmd", "tufte_handout")'

\pagebreak

References and Citations

# this command will nuke the existing bib file ...
#knitr::write_bib(c('base', 'rmarkdown', 'DESeq2', 'Rsubread', 'apeglm', 'pcaMethods', 'ShortRead'), file = 'skeleton.bib')
bibliography <- "QGFydGljbGV7bWluaW1hcDIyMDE4LAphdXRob3IgPSB7TGksIEhlbmd9LAp0aXRsZSA9IHtNaW5pbWFwMjogcGFpcndpc2UgYWxpZ25tZW50IGZvciBudWNsZW90aWRlIHNlcXVlbmNlc30sCmpvdXJuYWwgPSB7QmlvaW5mb3JtYXRpY3N9LAp2b2x1bWUgPSB7MzR9LApudW1iZXIgPSB7MTh9LApwYWdlcyA9IHszMDk0LTMxMDB9LAp5ZWFyID0gezIwMTh9LApkb2kgPSB7MTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHkxOTF9LApVUkwgPSB7aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHkxOTF9LAplcHJpbnQgPSB7L291cC9iYWNrZmlsZS9jb250ZW50X3B1YmxpYy9qb3VybmFsL2Jpb2luZm9ybWF0aWNzLzM0LzE4LzEwLjEwOTNfYmlvaW5mb3JtYXRpY3NfYnR5MTkxLzEvYnR5MTkxLnBkZn0KfQoKQGFydGljbGV7c2FtdG9vbHMyMDA5LAphdXRob3IgPSB7TGksIEhlbmcgYW5kIEhhbmRzYWtlciwgQm9iIGFuZCBXeXNva2VyLCBBbGVjIGFuZCBGZW5uZWxsLCBUaW0gYW5kIFJ1YW4sIEp1ZSBhbmQgSG9tZXIsIE5pbHMgYW5kIE1hcnRoLCBHYWJvciBhbmQgQWJlY2FzaXMsIEdvbmNhbG8gYW5kIER1cmJpbiwgUmljaGFyZCBhbmQgMTAwMCBHZW5vbWUgUHJvamVjdCBEYXRhIFByb2Nlc3NpbmcgU3ViZ3JvdXB9LAp0aXRsZSA9IHtUaGUgU2VxdWVuY2UgQWxpZ25tZW50L01hcCBmb3JtYXQgYW5kIFNBTXRvb2xzfSwKam91cm5hbCA9IHtCaW9pbmZvcm1hdGljc30sCnZvbHVtZSA9IHsyNX0sCm51bWJlciA9IHsxNn0sCnBhZ2VzID0gezIwNzgtMjA3OX0sCnllYXIgPSB7MjAwOX0sCmRvaSA9IHsxMC4xMDkzL2Jpb2luZm9ybWF0aWNzL2J0cDM1Mn0sClVSTCA9IHtodHRwOi8vZHguZG9pLm9yZy8xMC4xMDkzL2Jpb2luZm9ybWF0aWNzL2J0cDM1Mn0sCmVwcmludCA9IHsvb3VwL2JhY2tmaWxlL2NvbnRlbnRfcHVibGljL2pvdXJuYWwvYmlvaW5mb3JtYXRpY3MvMjUvMTYvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHAzNTIvMi9idHAzNTIucGRmfQp9CgpAYXJ0aWNsZXtzbmFrZW1ha2UyMDEyLAphdXRob3IgPSB7S8O2c3RlciwgSm9oYW5uZXMgYW5kIFJhaG1hbm4sIFN2ZW59LAp0aXRsZSA9IHtTbmFrZW1ha2XigJRhIHNjYWxhYmxlIGJpb2luZm9ybWF0aWNzIHdvcmtmbG93IGVuZ2luZX0sCmpvdXJuYWwgPSB7QmlvaW5mb3JtYXRpY3N9LAp2b2x1bWUgPSB7Mjh9LApudW1iZXIgPSB7MTl9LApwYWdlcyA9IHsyNTIwLTI1MjJ9LAp5ZWFyID0gezIwMTJ9LApkb2kgPSB7MTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHM0ODB9LApVUkwgPSB7aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHM0ODB9LAplcHJpbnQgPSB7L291cC9iYWNrZmlsZS9jb250ZW50X3B1YmxpYy9qb3VybmFsL2Jpb2luZm9ybWF0aWNzLzI4LzE5LzEwLjEwOTMvYmlvaW5mb3JtYXRpY3MvYnRzNDgwLzIvYnRzNDgwLnBkZn0KfQoKQGFydGljbGV7QkgxOTk1LAogSVNTTiA9IHswMDM1OTI0Nn0sCiBVUkwgPSB7aHR0cDovL3d3dy5qc3Rvci5vcmcvc3RhYmxlLzIzNDYxMDF9LAogYWJzdHJhY3QgPSB7VGhlIGNvbW1vbiBhcHByb2FjaCB0byB0aGUgbXVsdGlwbGljaXR5IHByb2JsZW0gY2FsbHMgZm9yIGNvbnRyb2xsaW5nIHRoZSBmYW1pbHl3aXNlIGVycm9yIHJhdGUgKEZXRVIpLiBUaGlzIGFwcHJvYWNoLCB0aG91Z2gsIGhhcyBmYXVsdHMsIGFuZCB3ZSBwb2ludCBvdXQgYSBmZXcuIEEgZGlmZmVyZW50IGFwcHJvYWNoIHRvIHByb2JsZW1zIG9mIG11bHRpcGxlIHNpZ25pZmljYW5jZSB0ZXN0aW5nIGlzIHByZXNlbnRlZC4gSXQgY2FsbHMgZm9yIGNvbnRyb2xsaW5nIHRoZSBleHBlY3RlZCBwcm9wb3J0aW9uIG9mIGZhbHNlbHkgcmVqZWN0ZWQgaHlwb3RoZXNlcy10aGUgZmFsc2UgZGlzY292ZXJ5IHJhdGUuIFRoaXMgZXJyb3IgcmF0ZSBpcyBlcXVpdmFsZW50IHRvIHRoZSBGV0VSIHdoZW4gYWxsIGh5cG90aGVzZXMgYXJlIHRydWUgYnV0IGlzIHNtYWxsZXIgb3RoZXJ3aXNlLiBUaGVyZWZvcmUsIGluIHByb2JsZW1zIHdoZXJlIHRoZSBjb250cm9sIG9mIHRoZSBmYWxzZSBkaXNjb3ZlcnkgcmF0ZSByYXRoZXIgdGhhbiB0aGF0IG9mIHRoZSBGV0VSIGlzIGRlc2lyZWQsIHRoZXJlIGlzIHBvdGVudGlhbCBmb3IgYSBnYWluIGluIHBvd2VyLiBBIHNpbXBsZSBzZXF1ZW50aWFsIEJvbmZlcnJvbmktdHlwZSBwcm9jZWR1cmUgaXMgcHJvdmVkIHRvIGNvbnRyb2wgdGhlIGZhbHNlIGRpc2NvdmVyeSByYXRlIGZvciBpbmRlcGVuZGVudCB0ZXN0IHN0YXRpc3RpY3MsIGFuZCBhIHNpbXVsYXRpb24gc3R1ZHkgc2hvd3MgdGhhdCB0aGUgZ2FpbiBpbiBwb3dlciBpcyBzdWJzdGFudGlhbC4gVGhlIHVzZSBvZiB0aGUgbmV3IHByb2NlZHVyZSBhbmQgdGhlIGFwcHJvcHJpYXRlbmVzcyBvZiB0aGUgY3JpdGVyaW9uIGFyZSBpbGx1c3RyYXRlZCB3aXRoIGV4YW1wbGVzLn0sCiBhdXRob3IgPSB7WW9hdiBCZW5qYW1pbmkgYW5kIFlvc2VmIEhvY2hiZXJnfSwKIGpvdXJuYWwgPSB7Sm91cm5hbCBvZiB0aGUgUm95YWwgU3RhdGlzdGljYWwgU29jaWV0eS4gU2VyaWVzIEIgKE1ldGhvZG9sb2dpY2FsKX0sCiBudW1iZXIgPSB7MX0sCiBwYWdlcyA9IHsyODktLTMwMH0sCiBwdWJsaXNoZXIgPSB7W1JveWFsIFN0YXRpc3RpY2FsIFNvY2lldHksIFdpbGV5XX0sCiB0aXRsZSA9IHtDb250cm9sbGluZyB0aGUgRmFsc2UgRGlzY292ZXJ5IFJhdGU6IEEgUHJhY3RpY2FsIGFuZCBQb3dlcmZ1bCBBcHByb2FjaCB0byBNdWx0aXBsZSBUZXN0aW5nfSwKIHZvbHVtZSA9IHs1N30sCiB5ZWFyID0gezE5OTV9Cn0KCgoKQE1hbnVhbHtSLWFwZWdsbSwKICB0aXRsZSA9IHthcGVnbG06IEFwcHJveGltYXRlIHBvc3RlcmlvciBlc3RpbWF0aW9uIGZvciBHTE0gY29lZmZpY2llbnRzfSwKICBhdXRob3IgPSB7QW5xaSBaaHUgYW5kIEpvc2VwaCBHLiBJYnJhaGltIGFuZCBNaWNoYWVsIEkuIExvdmV9LAogIHllYXIgPSB7MjAxOH0sCiAgbm90ZSA9IHtSIHBhY2thZ2UgdmVyc2lvbiAxLjQuMH0sCn0KQE1hbnVhbHtSLWJhc2UsCiAgdGl0bGUgPSB7UjogQSBMYW5ndWFnZSBhbmQgRW52aXJvbm1lbnQgZm9yIFN0YXRpc3RpY2FsIENvbXB1dGluZ30sCiAgYXV0aG9yID0ge3tSIENvcmUgVGVhbX19LAogIG9yZ2FuaXphdGlvbiA9IHtSIEZvdW5kYXRpb24gZm9yIFN0YXRpc3RpY2FsIENvbXB1dGluZ30sCiAgYWRkcmVzcyA9IHtWaWVubmEsIEF1c3RyaWF9LAogIHllYXIgPSB7MjAxOH0sCiAgdXJsID0ge2h0dHBzOi8vd3d3LlItcHJvamVjdC5vcmcvfSwKfQpATWFudWFse1ItREVTZXEyLAogIHRpdGxlID0ge0RFU2VxMjogRGlmZmVyZW50aWFsIGdlbmUgZXhwcmVzc2lvbiBhbmFseXNpcyBiYXNlZCBvbiB0aGUgbmVnYXRpdmUKYmlub21pYWwgZGlzdHJpYnV0aW9ufSwKICBhdXRob3IgPSB7TWljaGFlbCBMb3ZlIGFuZCBTaW1vbiBBbmRlcnMgYW5kIFdvbGZnYW5nIEh1YmVyfSwKICB5ZWFyID0gezIwMTh9LAogIG5vdGUgPSB7UiBwYWNrYWdlIHZlcnNpb24gMS4yMi4wfSwKICB1cmwgPSB7aHR0cHM6Ly9naXRodWIuY29tL21pa2Vsb3ZlL0RFU2VxMn0sCn0KQE1hbnVhbHtSLXBjYU1ldGhvZHMsCiAgdGl0bGUgPSB7cGNhTWV0aG9kczogQSBjb2xsZWN0aW9uIG9mIFBDQSBtZXRob2RzfSwKICBhdXRob3IgPSB7V29sZnJhbSBTdGFja2xpZXMgYW5kIEhlbm5pbmcgUmVkZXN0aWcgYW5kIEtldmluIFdyaWdodH0sCiAgeWVhciA9IHsyMDE4fSwKICBub3RlID0ge1IgcGFja2FnZSB2ZXJzaW9uIDEuNzQuMH0sCiAgdXJsID0ge2h0dHBzOi8vZ2l0aHViLmNvbS9ocmVkZXN0aWcvcGNhbWV0aG9kc30sCn0KQE1hbnVhbHtSLXJtYXJrZG93biwKICB0aXRsZSA9IHtybWFya2Rvd246IER5bmFtaWMgRG9jdW1lbnRzIGZvciBSfSwKICBhdXRob3IgPSB7SkogQWxsYWlyZSBhbmQgWWlodWkgWGllIGFuZCBKb25hdGhhbiBNY1BoZXJzb24gYW5kIEphdmllciBMdXJhc2NoaSBhbmQgS2V2aW4gVXNoZXkgYW5kIEFyb24gQXRraW5zIGFuZCBIYWRsZXkgV2lja2hhbSBhbmQgSm9lIENoZW5nIGFuZCBXaW5zdG9uIENoYW5nfSwKICB5ZWFyID0gezIwMTh9LAogIG5vdGUgPSB7UiBwYWNrYWdlIHZlcnNpb24gMS4xMH0sCiAgdXJsID0ge2h0dHBzOi8vQ1JBTi5SLXByb2plY3Qub3JnL3BhY2thZ2U9cm1hcmtkb3dufSwKfQpATWFudWFse1ItUnN1YnJlYWQsCiAgdGl0bGUgPSB7UnN1YnJlYWQ6IFN1YnJlYWQgc2VxdWVuY2UgYWxpZ25tZW50IGFuZCBjb3VudGluZyBmb3IgUn0sCiAgYXV0aG9yID0ge1dlaSBTaGkgYW5kIFlhbmcgTGlhbyB3aXRoIGNvbnRyaWJ1dGlvbnMgZnJvbSBHb3Jkb24gSyBTbXl0aCBhbmQgSmVubnkgRGFpIGFuZCBUaW1vdGh5IHtUcmljaGUsIEpyLn19LAogIHllYXIgPSB7MjAxOH0sCiAgbm90ZSA9IHtSIHBhY2thZ2UgdmVyc2lvbiAxLjMyLjB9LAogIHVybCA9IHtodHRwOi8vYmlvY29uZHVjdG9yLm9yZy9wYWNrYWdlcy9yZWxlYXNlL2Jpb2MvaHRtbC9Sc3VicmVhZC5odG1sfSwKfQpATWFudWFse1ItU2hvcnRSZWFkLAogIHRpdGxlID0ge1Nob3J0UmVhZDogRkFTVFEgaW5wdXQgYW5kIG1hbmlwdWxhdGlvbn0sCiAgYXV0aG9yID0ge01hcnRpbiBNb3JnYW4gYW5kIE1pY2hhZWwgTGF3cmVuY2UgYW5kIFNpbW9uIEFuZGVyc30sCiAgeWVhciA9IHsyMDE4fSwKICBub3RlID0ge1IgcGFja2FnZSB2ZXJzaW9uIDEuNDAuMH0sCn0K"

write(base64decode(bibliography, what=character, size=1)[1], file="skeleton.bib", append=FALSE)

sagrudd/cDNA_tutorial documentation built on May 30, 2019, 2:13 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

sagrudd/cDNA_tutorial
Tutorial For Nanopore Sequence Based Gene Expression Profiling

In sagrudd/cDNA_tutorial: Tutorial For Nanopore Sequence Based Gene Expression Profiling

Statement of Tutorial Objectives

Methods utilised include:

The Computational requirements include

Quick start

Introduction

Getting started and Best Practices

Experimental design

Raw sequence review

Violin plots of sequence characteristics

Snakemake

Run the generated Snakemake file

Review of cDNA read mapping.

Analysis of reads mapped to genes

DESeq2

Principal Component Analysis

Volcano plot for condition `r gsub('.+: ', '', deSeqRes@elementMetadata@listData$description[2])`

Exploring Genes of Interest ...

Annotation and Further Analysis of Differentially Expressed Genes

Reproducible Research - Produce your own report

Glossary of Terms

Appendix - Installation on macOS

References and Citations

R Package Documentation

Browse R Packages

We want your feedback!

sagrudd/cDNA_tutorial Tutorial For Nanopore Sequence Based Gene Expression Profiling

In sagrudd/cDNA_tutorial: Tutorial For Nanopore Sequence Based Gene Expression Profiling

Statement of Tutorial Objectives

Methods utilised include:

The Computational requirements include

Quick start

Introduction

Getting started and Best Practices

Experimental design

Raw sequence review

Violin plots of sequence characteristics

Snakemake

Run the generated Snakemake file

Review of cDNA read mapping.

Analysis of reads mapped to genes

DESeq2

Principal Component Analysis

Volcano plot for condition r gsub('.+: ', '', deSeqRes@elementMetadata@listData$description[2])

Exploring Genes of Interest ...

Annotation and Further Analysis of Differentially Expressed Genes

Reproducible Research - Produce your own report

Glossary of Terms

Appendix - Installation on macOS

References and Citations

R Package Documentation

Browse R Packages

We want your feedback!

sagrudd/cDNA_tutorial
Tutorial For Nanopore Sequence Based Gene Expression Profiling

Volcano plot for condition `r gsub('.+: ', '', deSeqRes@elementMetadata@listData$description[2])`