Home

/

GitHub

/

sagrudd/cDNA_tutorial

/

README.md

README.md
In sagrudd/cDNA_tutorial: Tutorial For Nanopore Sequence Based Gene Expression Profiling

cDNA_tutorial

Objective Statement

Nanopore cDNA tutorial for differential gene expression analysis. The aim of this tutorial is to perform differential gene expression analysis based on replicated cDNA data. This workflow is for fastq sequence datasets where a reference genome sequence, and its gene annotation, is available.

Sufficient information is provided in the tutorial, and example data, such that the workflow can be replicated with a study design to address gene expression questions such as

which genes are expressed in my study of interest?
which genes are upregulated in my tumour sample?
show grouped expression levels for gene ENSG00000117523?

Methods utilised within this tutorial include

conda for management of bioinformatics software
snakemake for managing the bioinformatics workflow
minimap2 for mapping sequence reads to reference genome
samtools for SAM/BAM handling and mapping statistics
RSubread and DESeq2 for differential expression analysis

Computational requirements for this tutorial include

Computer running Linux (Centos7, Ubuntu 18_10, Fedora 29)
At least 16Gb RAM - swap space of at least 8Gb ideal
At least 15Gb spare disk space for analysis and indices
Runtime with provided example data - approximately 20 minutes

Introduction

There are four goals for this tutorial:

To introduce a literate framework for analysing cDNA data prepared from MinION or PromethION flowcells; this document should be reproducible and extensible
To provide basic QC metrics such that a review and consideration of experimental data can be undertaken
To map sequence reads to the reference genome and to identify the genes that are expressed and the number of sequence reads that are observed from each gene
To perform a statistical analysis to identify the genes that appear differentially expressed between the experimental conditions.

This tutorial aims to identify, where possible, a set of differentially expressed gene transcripts from the long read sequence data. This tutorial does not aim to provide an exhaustive analysis or annotation of the differentially expressed genes. The tutorial specifies some available reference data so that a workflow can be reproduced with a known outcome.

Getting started and Best Practices

This tutorial requires that we are working at a computer workstation running a Linux operating system. An update for MacOS will be released. The workflow described has been tested using Fedora 29, Centos 7, and Ubuntu 18_04. This tutorial is written in the Rmarkdown file format. This merges both markdown (and easy-to-write plain text format as used in many Wiki systems). See Allaire et al. (2018) for more information about rmarkdown. The document template contains chunks of embedded R code (and some linux bash commands). The included workflow makes extensive use of the conda package management and the snakemake workflow software. These software packages and the functionality of Rmarkdown provide the source for a rich, reproducible, and extensible tutorial document.

The workflow contained within this Tutorial performs a real bioinformatics analysis and uses the whole human genome as an example. There are some considerations in terms of memory and processor requirement. Indexing the whole human genome for sequence read mapping using minimap2 for example will use at least 18Gb of memory. The minimal recommended hardware setup for this tutorial is therefore an 8 threaded computer with at least 16Gb of RAM and 15Gb of storage space. If you have modest amounts of RAM (<24Gb) then please consider increasing the amount of swap space available for indexing reference genomes and holding genome index in memory whilst mapping.

There are a few dependencies that need to be installed at the system level - please refer to the section below on system requirements. The conda package management software will coordinate the installation of key software and dependencies in user space - this is dependent on a robust internet connection.

As a best practice this tutorial will separate primary DNA sequence data (the base-called fastq files) from the Rmarkdown source, and the genome reference data. The analysis results and figures will again be placed in a separate working directory. The required layout for the primary data is shown in figure . This minimal structure will be prepared over the next few sections of this tutorial - it is recommended to follow this tutorial with the provided working data before attempting to run the tutorial with your own data. Following the tutorial, preparing the computer, downloading data, and running the analysis should take approximately 90 minutes.

This tutorial requires a small amount of interaction with the computer console. The bulk of the analysis will be managed through the GUI provided by the Rstudio software.

Tree representation of the required folder layout for the included
cDNA expression profiling workflow

sagrudd/cDNA_tutorial documentation built on May 30, 2019, 2:13 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

sagrudd/cDNA_tutorial
Tutorial For Nanopore Sequence Based Gene Expression Profiling

README.md
In sagrudd/cDNA_tutorial: Tutorial For Nanopore Sequence Based Gene Expression Profiling

cDNA_tutorial

Objective Statement

Introduction

Getting started and Best Practices

R Package Documentation

Browse R Packages

We want your feedback!

sagrudd/cDNA_tutorial Tutorial For Nanopore Sequence Based Gene Expression Profiling

README.md In sagrudd/cDNA_tutorial: Tutorial For Nanopore Sequence Based Gene Expression Profiling

cDNA_tutorial

Objective Statement

Introduction

Getting started and Best Practices

R Package Documentation

Browse R Packages

We want your feedback!

sagrudd/cDNA_tutorial
Tutorial For Nanopore Sequence Based Gene Expression Profiling

README.md
In sagrudd/cDNA_tutorial: Tutorial For Nanopore Sequence Based Gene Expression Profiling