README.md

cDNA_tutorial

Objective Statement

Nanopore cDNA tutorial for differential gene expression analysis. The aim of this tutorial is to perform differential gene expression analysis based on replicated cDNA data. This workflow is for fastq sequence datasets where a reference genome sequence, and its gene annotation, is available.

Sufficient information is provided in the tutorial, and example data, such that the workflow can be replicated with a study design to address gene expression questions such as

Methods utilised within this tutorial include

Computational requirements for this tutorial include

Introduction

There are four goals for this tutorial:

  1. To introduce a literate framework for analysing cDNA data prepared from MinION or PromethION flowcells; this document should be reproducible and extensible
  2. To provide basic QC metrics such that a review and consideration of experimental data can be undertaken
  3. To map sequence reads to the reference genome and to identify the genes that are expressed and the number of sequence reads that are observed from each gene
  4. To perform a statistical analysis to identify the genes that appear differentially expressed between the experimental conditions.

This tutorial aims to identify, where possible, a set of differentially expressed gene transcripts from the long read sequence data. This tutorial does not aim to provide an exhaustive analysis or annotation of the differentially expressed genes. The tutorial specifies some available reference data so that a workflow can be reproduced with a known outcome.

Getting started and Best Practices

This tutorial requires that we are working at a computer workstation running a Linux operating system. An update for MacOS will be released. The workflow described has been tested using Fedora 29, Centos 7, and Ubuntu 18_04. This tutorial is written in the Rmarkdown file format. This merges both markdown (and easy-to-write plain text format as used in many Wiki systems). See Allaire et al. (2018) for more information about rmarkdown. The document template contains chunks of embedded R code (and some linux bash commands). The included workflow makes extensive use of the conda package management and the snakemake workflow software. These software packages and the functionality of Rmarkdown provide the source for a rich, reproducible, and extensible tutorial document.

The workflow contained within this Tutorial performs a real bioinformatics analysis and uses the whole human genome as an example. There are some considerations in terms of memory and processor requirement. Indexing the whole human genome for sequence read mapping using minimap2 for example will use at least 18Gb of memory. The minimal recommended hardware setup for this tutorial is therefore an 8 threaded computer with at least 16Gb of RAM and 15Gb of storage space. If you have modest amounts of RAM (<24Gb) then please consider increasing the amount of swap space available for indexing reference genomes and holding genome index in memory whilst mapping.

There are a few dependencies that need to be installed at the system level - please refer to the section below on system requirements. The conda package management software will coordinate the installation of key software and dependencies in user space - this is dependent on a robust internet connection.

As a best practice this tutorial will separate primary DNA sequence data (the base-called fastq files) from the Rmarkdown source, and the genome reference data. The analysis results and figures will again be placed in a separate working directory. The required layout for the primary data is shown in figure . This minimal structure will be prepared over the next few sections of this tutorial - it is recommended to follow this tutorial with the provided working data before attempting to run the tutorial with your own data. Following the tutorial, preparing the computer, downloading data, and running the analysis should take approximately 90 minutes.

This tutorial requires a small amount of interaction with the computer console. The bulk of the analysis will be managed through the GUI provided by the Rstudio software.

Tree representation of the required folder layout for the included
cDNA expression profiling workflow



sagrudd/cDNA_tutorial documentation built on May 30, 2019, 2:13 p.m.