knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In this part we show data pre-processing steps. It includes quality assessment (QC), Trimming, sequences alignment and creating a count table.
Most tools used here are packed in Docker containers. To see original commands for each tool check our GitHub Docker Images repository.
The Docker image 'qualim' contains FastQC and Trimmomatic.
To use the following code first we define variables for the in-container script 'quali.sh'.
Primary variable ANALYSIS
has two options trimm
and fastqc
. First runs Trimmomatic then FastQC, while second just runs FastQC.
Place all FASTQ files in the working directory. The script detects files with '.gz' extension.
An example run: ```{bash qc,eval=FALSE} docker run --rm --name="qc" \ -e ANALYSIS="fastqc" \ --mount 'type=bind,source=path/workingdir,target=/usr/local/src/rnaSeq/workingdrive' \ qualim
**IMPORTANT** Do not perform 'trimm' on original FASTQ files. They will be deleted. When using 'trimm' option with container based on 'qualim' **REMEMBER**, by default after trimming it will **DELETE** all original fastq and unpaired files. It leaves only paired reads files. This feature was created to optimise the use of disc space on cloud platforms. ```{bash trim,eval=FALSE} docker run --rm --name="trim" \ -e THREADS=6 \ -e CROP_LEN=110 \ -e MIN_LEN=99 \ -e HEADCROP=10 \ -e ANALYSIS="trimm" \ --mount 'type=bind,source=path/workingdir,target=/usr/local/src/rnaSeq/workingdrive' \ qualim
To align with STAR (v.2.7) we first need to create a genome index. This step needs to be done only once.
Because it is not a routine step we use interactive mode with STAR docker image.
To run the following command, save in the working directory a genome FASTA file and corresponding GTF annotation.
```{bash star index1, eval=FALSE} docker run -it --mount 'type=bind,source=path/workingdir,target=/usr/local/src/rnaSeq/workingdrive' staraligner /bin/bash
STAR \ --runThreadN 8 \ --runMode genomeGenerate \ --genomeSAindexNbases 11 \ --genomeDir starIndex \ --genomeFastaFiles cp_genome.fa \ --sjdbGTFfile cp_genome.gtf
On clound platforms (AWS) preferably use memory optimised instances. Alignment of large FASTQ files will require 16-32Gb of RAM (fungal genome). ```{bash star index2, eval=FALSE} docker run --detach\ -e THREADS=8 \ --mount 'type=bind,source=path/workingdir,target=/seqdata/workingdir' \ staraligner /seqdata/star_InContainer_v2.sh *_paired.fastq.gz # Find container ID docker ps # Follow stdout in detached container docker logs -f <container ID>
On this step we create a count table as SummarizedExperiment object.
Launch the container in the same directory with BAM files from the previous step.
First we create annotation data base from GFF file.
library("Rsamtools") library("GenomicFeatures") library("GenomicAlignments") GFFFILE <- "cp_genome.gff" SPECIES_NAME <- "Cryphonectria parasitica" TXDB_FILE <- "crypa_annotation.sqlite" Format <- "gff3" transdb<-makeTxDbFromGFF(GFFFILE, format = Format, organism = SPECIES_NAME) saveDb(transdb, TXDB_FILE)
Create SummarizedExperiment Object.
library("Rsamtools") library("GenomicFeatures") library("GenomicAlignments") SE_NAME <- "se.RData" TXDB_FILE <- "crypa_annotation.sqlite" # Create BAM files list filenames <-list.files(".", recursive=TRUE, pattern=".bam$", full=TRUE) bamfiles <- BamFileList(filenames, yieldSize=200000) #Read annotation DB transdb <- loadDb(TXDB_FILE) genes <- GenomicFeatures::genes(transdb) se <- summarizeOverlaps(features=genes, reads=bamfiles, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE ) saveRDS(se, file = SE_NAME)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.