Nanopore cDNA tutorial for differential gene expression analysis. The aim of this tutorial is to perform differential gene expression analysis based on replicated cDNA data. This workflow is for fastq sequence datasets where a reference genome sequence, and its gene annotation, is available.
Sufficient information is provided in the tutorial, and example data, such that the workflow can be replicated with a study design to address gene expression questions such as
Methods utilised within this tutorial include
conda
for management of bioinformatics softwaresnakemake
for managing the bioinformatics workflowminimap2
for mapping sequence reads to reference genomesamtools
for SAM/BAM handling and mapping statisticsRSubread
and DESeq2
for differential expression analysisComputational requirements for this tutorial include
There are four goals for this tutorial:
This tutorial aims to identify, where possible, a set of differentially expressed gene transcripts from the long read sequence data. This tutorial does not aim to provide an exhaustive analysis or annotation of the differentially expressed genes. The tutorial specifies some available reference data so that a workflow can be reproduced with a known outcome.
This tutorial requires that we are working at a computer workstation
running a Linux operating system. An update for MacOS will be released.
The workflow described has been tested using Fedora 29
,
Centos 7
, and Ubuntu 18_04
. This tutorial is written in the
Rmarkdown
file format. This merges both markdown (and
easy-to-write plain text format as used in many Wiki systems). See
Allaire et al. (2018) for more information about
rmarkdown
. The document template contains chunks of embedded
R code
(and some linux bash commands). The included workflow
makes extensive use of the conda
package management and the
snakemake
workflow software. These software packages and the
functionality of Rmarkdown
provide the source for a rich,
reproducible, and extensible tutorial document.
The workflow contained within this Tutorial performs a real
bioinformatics analysis and uses the whole human genome as an example.
There are some considerations in terms of memory and processor
requirement. Indexing the whole human genome for sequence read mapping
using minimap2
for example will use at least 18Gb
of memory.
The minimal recommended hardware setup for this tutorial is therefore an
8 threaded computer with at least 16Gb of RAM and 15Gb of storage space.
If you have modest amounts of RAM (<24Gb) then please consider
increasing the amount of swap
space
available for indexing reference genomes and holding genome index in
memory whilst mapping.
There are a few dependencies that need to be installed at the system
level - please refer to the section below on system requirements. The
conda
package management software will coordinate the installation
of key software and dependencies in user space - this is dependent on a
robust internet connection.
As a best practice this tutorial will separate primary DNA sequence data
(the base-called fastq files) from the Rmarkdown
source, and the
genome reference data. The analysis results and figures will again be
placed in a separate working directory. The required layout for the
primary data is shown in figure . This minimal structure will be
prepared over the next few sections of this tutorial - it is recommended
to follow this tutorial with the provided working data before attempting
to run the tutorial with your own data. Following the tutorial,
preparing the computer, downloading data, and running the analysis
should take approximately 90 minutes.
This tutorial requires a small amount of interaction with the computer
console. The bulk of the analysis will be managed through the GUI
provided by the Rstudio
software.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.