In UPSCb/RnaSeqTutorial: RNA-Seq Tutorial

    startQuest("First things first")

    h1("Raw Data Pre-processing")

    h2("Getting your share of the data")

The P. tremula sexual dimorphism dataset [@Robinson:2014p6362] contains 17 samples in total. You will be working with all of the available files. This RNA-Seq dataset is paired-end. You can identify read pairs by looking at their file names. Note how half end with 1.fq.gz while the others end with 2.fq.gz. Some steps in the analysis pipeline will require processing of files individually (e.g. FastQC), while others work on pairs (e.g. Trimmomatic, STAR).

| Step | Requires both files| |------|--------------------| | Quality Controlusing FastQC | NO | | rRNA filtering using SortMeRna | YES | | Quality Control using FastQC | NO | | Quality trimming and adapter removal using Trimmomatic | YES | | Quality Control using FastQC | NO | | Trimmed reads alignment using STAR | YES | | BAM format manipulation using samtools | NO | | Read count summarization using HTSeq htseq-count | NO |

    quest(1)
    endQuest()
    startQuest("Raw data quality control")

    h2("Raw data FastQC")

Note that in the following we will refer to any of the 17 samples as "read", so if you selected sample 01, "read" means "202_subset" for you.

The first pre-processing step is to assess the quality of the raw data received from the sequencing facility. This data is commonly delivered in FASTQ format [@Cock:2009p249].

Upon receiving the RNA-Seq FASTQ files from the sequencing facility, it is essential that some initial QC assessments be performed. Most importantly, one should check the overall sequence quality, the GC percentage distribution ( the proportion of guanine and cytosine bp across the reads) and the presence / absence of overrepresented sequences. FastQC has become a de-facto standard for performing this task http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. A FastQC command line example is as follows: ```{bash fqc, eval=FALSE} # First, make an output directory called "qa" and a sub-directory of # that called "raw" mkdir -p ~/results/qa/raw

# We can also make a directory to hold the raw data mkdir ~/results/raw

# Then link the files in our home folder. Always try to # work on links or copies of the raw source data to keep them safe! cd ~/results/raw ln -s share/Day1/fastq/*.fq.gz .

# Then you can run fastqc on the file(s) you have selected. FastQC can run # a number of files in parallel (see the -t option) fastqc -o ~/results/qa/raw -t 16 ~/results/raw/*.fq.gz

# Finally, also run multiqc to summarize all data in one plot cd ~/results/qa/raw multiqc .

The output of FastQC is a zip archive and an HTML document, which is
sub-divided into sections describing the specific metrics that were
analyzed. These sections are detailed below. In order to view the html
page that was created, go to the course webpage and click on "Connect to Apache2"
, then click on home > qa > raw and finally on the html file. View any _single_
report and go through the following categories.

```r
    h3("Basic Statistics")

Most metrics within this section are self-explanatory. For PE reads, the total number of sequences should match between the forward and reverse read files. It is good practice to take note of the FASTQ Phred encoding, as some downstream tools require the user to specify whether Phred64 or Phred33 encoding should be used. Finally, the %GC should lie within the expected values for the sample species.

    h3("Per base sequence quality")

The Phred scale quality represents the probability p that the corresponding base call is incorrect. A Phred score Q is an integer mapping of p where: $$ Q = -10 \cdot log10(p) $$ Briefly, a Phred score of 10 corresponds to one error in every 10 base calls or 90% accuracy; a Phred score of 20 to one error in every 100 base calls or 99% accuracy. The maximum Phred score is 40 (41 for Illumina version 1.8+ encoding). See http://en.wikipedia.org/wiki/FASTQ_format#Quality for more details on the quality and http://en.wikipedia.org/wiki/FASTQ_format#Encoding for information on the corresponding encoding.

  quest(2)
  endQuest()
  startQuest("Overview of FastQC output")

The second FastQC section details the Phred scaled quality as a function of the position in the read. It is very common to observe a quality decrease as a function of the read length (Figure [2][example]) and this pattern is often more pronounced for read2 than it is for read1; this is due to cumulative stochastic errors of the sequencing progresses, largely as a result of the enzyme “tiring out”, and the increasing likelihood that a read cluster becomes out of sync, for example.

![Example of Quality Control reports at different stage of the pipeline. A. The “Per sequence GC content” of the raw data. B. The same data shown in A but after rRNA filtering. C. “Per base sequence content” of the raw data. D. The same data after quality-based trimming has been performed.][example]

    h3("Per sequence quality scores")

This section details the quality distribution at the read level, in contrast to the quality per base position of the previous section. If the data is of good quality, the histogram will be skewed to the right.

    h3("Per base sequence content")

In this section, the average proportion of individual bases (A, C, G and T) is plotted as a line across the length of the reads. The 12 first bases often show a bias that is characteristic of Illumina RNA-Seq data. This is in contrast with the DNA-Seq protocol, which does not show the same bias. The difference between protocols lies in three additional steps performed during the conversion of mRNA to cDNA, which is subsequently sequenced as if it were genomic DNA. Several hypotheses have been proposed as to the cause of this bias: during reverse transcription of the captured cDNA, random hexamer primers are used and these may introduce a positional bias of the reads; artifacts from end repair; and possibly a tenuous sequence specificity of the polymerase may each play a role either singularly in, most likely, in combination.

    h3("Per base GC content")

Similar to the previous section, the GC content is shown as a function of the position in the read. As previously observed, a bias for the first base pairs (once more in contrast to DNA sequencing data) will often be observed. In addition, for non-strand specific RNA-Seq data, the amount of G and C and of A and T should be similar, as an average, at any position within reads. Moreover the proportion of G+C should match the expected GC content of the sample. For strand-specific data, if the RNA was selected using poly-dT beads, enrichment for T over A may be observed.

    h3("Per sequence GC content")

The plot in this section (see Figure [2][example] for an example) represents the distribution of GC content per read, where the data (red curve) is expected to approximately follow the theoretical distribution (blue curve). If the curve presents a shoulder in a region of high GC content, this is usually an indication that rRNA is present in the sample. However, it may also represent contamination by an organism with a higher GC content (such as bacteria or fungi). In contrast, a peak on the left hand side would indicate enrichment for A/T rich sequences. In particular a sharp peak for very low GC content (in the 0-3 range) is usually indicative of the sequencing of the mRNA poly-A tails. If this plot still shows issues after quality and rRNA filtering, additional steps would have to be taken to filter contaminants.

![A comparison of the “theoretical” and “observed” GC distribution, the blue and read lines of FastQC “Per sequence GC content”. A. Examples of “observed” GC distribution with a poly-A enrichment (green), rRNA enrichment (red) or no (black) bias. B. The corresponding “theoretical” curve that FastQC would devise from such read GC content distribution.][theory]

    h3("Per base N content")

This plot shows the fraction of indistinguishable bases as a function of the base position in the reads. In high quality sequence data this is expected to be close to zero. Deviations from the expected values indicate problems during the sequencing.

    h3("Sequence length distribution")

This section shows the distribution of read lengths. Prior to trimming, there should be exactly one peak located at the sequenced read length.

    h3("Sequence duplication level")

This plot represents the level of duplicate sequences in the library. FastQC assumes that the library is diverse, with even representation of all sequences, it assumes a uniform coverage as would usually be obtained for DNA-Seq experiments. However, this assumption is not valid for RNA-Seq libraries, which have a large dynamic range, possibly containing a million fold difference between lowly and highly expressed genes. As a result it is common to observe high duplication levels for sequences originating from highly expressed genes. It is worth noting that before version 0.11 of FastQC, all duplication levels

= 10 were aggregated into a single bin. In more recent version this has been made more comprehensive in order to provide a more accurate representation of the data.

    h3("Overrepresented sequences")

This table shows sequences that are present at unusually large frequency in the reads. These are most commonly sequencing adapters and will be identified as such. If unidentified sequences are detailed these may originate from rRNA or other contaminants, in which case contaminant filtering should be considered. Often a BLAST search of the unidentified sequence using the NCBI nt database will be informative.

    h3("Kmer content")

The final plot of the FastQC report details the occurrence of kmers - nucleotide sequences of fixed k length - that were present at a higher than expected frequency as a function of their position within the read. Commonly, the early bases show the aforementioned Illumina sequencing bias (see section d), whereas the last bases may show enrichment for sequencing adapters.

Run FastQC for your assigned read file. Compare the results of different files and of read pairs. Does the QC looks fine to you or are there issues you would address?

Also, after looking through FastQC itself, look at the MultiQC report, which will give you a nice summary of all files.

Take a comparative look at samples 202, 207 and 213. Do these 3 samples have identical qualities? If not, what differentiates them? What does the sample you have analysed look like most among these 3?

There are 4 samples that have differing %GC characteristics as well as possibly some over-represented sequences: samples 202, 213.1, 303 and 310.3. All samples otherwise show some adapter contamination and the usual average quality drop as sequencing cycles progresses. Both these issues can be remediated and this is what will be addressed in the following.

    quest(3)
    endQuest()
    startQuest("rRNA filtering")

    h2("Filtering rRNA")

Typically, wet-lab protocols to extract mRNA include a step to deplete the sample of rRNA or to enrich it for poly-adenylated transcripts (rRNA is not poly-adenylated). Common approaches to achieve this are to use RiboMinus^TM kits (Life Technologies) or poly-dT beads, respectively or to include a precipitation step that selectively precipitates only long (usually >200 bp) nucleotidefragments. No approach will be fully sensitive and, as a result, some rRNA carryover is to be expected. This is not a problem per se as long as the remaining proportion accounts for a low percentage of the reads (commonly between 0.1 and 3%). Larger proportions will have an effect on the usable number of reads obtained from the samples, as fewer sequence reads would originate from expressed mRNAs. This is not to be overlooked as these rRNAs will produce valid alignments (in all reference genomes and for most de novo assembled transcriptomes and genomes) and hence other metrics (such as the alignment rate) will fail to identify such a bias. To control for the rRNA quantity in our sample FastQ files, we use SortMeRna, a tool originally developed to identify rRNA in metagenomics analyses [@Kopylova:2012p5297]. The tool accepts FASTQ files (SE or PE) as input and includes a set of reference libraries containing the most common rRNAs (5,5.8,16, 18, 23 and 26-28S). Example command lines for a single PE sample are (you shouldn't run samples one by one):

```{bash, eval=FALSE} # THIS IS AN EXAMPLE OF THE COMMANDS! GO ON READING, do NOT execute them!

# First uncompress your forward and reverse reads:
gunzip read_1.fq.gz read_2.fq.gz

# Merge forward and reverse reads into a single file (this is a requirement
# of the sortmerna software)
merge-paired-reads.sh read_1.fq read_2.fq read-interleaved.fq

# Run sortmerna
sortmerna --ref $SORTMERNA_DB --reads read-interleaved.fq --aligned \
read-rRNA-hits --other read-sortmerna --log -a 16 -v --paired_in --fastx

# Split the file again in two separate files, one for forward, one for reverse
# orientation
unmerge-paired-reads.sh read-sortmerna.fq read-sortmerna_1.fq read-sortmerna_2.fq

# Delete the interleaved files and compress the result files to save space
rm read-interleaved.fq read-sortmerna.fq
gzip read-sortmerna_1.fq read-sortmerna_2.fq

It can be very tedious to process files one by one, especially in large projects.
Let's create a script that runs all these steps for us. Open a text editor 
(e.g. `nano` on the command line) and copy the following lines:
```{bash, eval=FALSE}
    #!/bin/bash

    READ_FW="$1"
    READ_RV="$2"

    FILEBASE=$(basename "${READ_FW/_1.fq.gz/}")

    echo "Uncompressing FASTQ data of $FILEBASE"
    gunzip -c "$READ_FW" > ${FILEBASE}_1.fq 
    gunzip -c "$READ_RV" > ${FILEBASE}_2.fq 

    READ_FW="${READ_FW%.gz}"
    READ_RV="${READ_RV%.gz}"

    echo "Merging pairs of $FILEBASE"
    merge-paired-reads.sh "$READ_FW" "$READ_RV" "${FILEBASE}_interleaved.fq"

    echo "Running SortMeRNA for $FILEBASE"
    sortmerna --ref $SORTMERNA_DB --reads "${FILEBASE}_interleaved.fq" --aligned \
    "${FILEBASE}-rRNA-hits" --other  "${FILEBASE}-sortmerna" --log -a 16 \
    -v --paired_in --fastx

    echo "Unmerging SortMeRNA filtered pairs for $FILEBASE"
    unmerge-paired-reads.sh "${FILEBASE}-sortmerna.fq" \
    "${FILEBASE}-sortmerna_1.fq" "${FILEBASE}-sortmerna_2.fq"

    echo "Doing cleanup for $FILEBASE"
    gzip "$READ_FW" "$READ_RV" "${FILEBASE}-sortmerna_1.fq" \
    "${FILEBASE}-sortmerna_2.fq" "${FILEBASE}-rRNA-hits.fq"
    rm "${FILEBASE}_interleaved.fq" "${FILEBASE}-sortmerna.fq"

Press Ctrl+O in nano and write the filename ~/results/runSortMeRNA.sh. Then press return and exit nano with Ctrl+X.

Now we can use a loop to execute this script for two samples (doing more would be very time consuming):

```{bash, eval=FALSE} mkdir ~/results/sortmerna cd ~/results/sortmerna find ~/results/raw -name "*.fq.gz" | sort | head -n 4 | while read READ_FW do read READ_RV bash ~/results/runSortMeRNA.sh $READ_FW $READ_RV done

Now we can link the rest of the files from the pre-calculated example data:

```{bash, eval=FALSE}
   ln -s ~/share/Day1/sortmerna/* .

The format conversion step is not required for SE samples, nor is the “–paired_in” argument. The SORTMERNA_DB environment variable needs to be set at installation and the “-a” argument details the number of CPUs/threads to be used. The manual provides a comprehensive description of all functionalities. SortMeRna does not currently support compressed input, hence the first and last step to (de)compress the data.

Run SortMeRna on your sample (both read_1 and read_2). This may take a while (about 10 mins). Once done, look at the produced log file.

MultiQC understands SortMeRNA log files, so you can nicely summarize the data with:

```{bash, eval=FALSE} cd ~/results/sortmerna multiqc .

```r
    quest(4)
    endQuest()
    startQuest("Quality trimming")

Next, we perform the QC on the filtered data.

    h2("Filtered data FastQC")

The filtered data is again subjected to a QC assessment by FastQC to ensure the validity of the filtering steps.

Redo the FastQC on the SortMeRna files you have generated.

```{bash, eval=FALSE} mkdir ~/results/qa/sortmerna fastqc -o ~/results/qa/sortmerna -t 16 ~/results/sortmerna/sortmerna.fq.gz

The GC content plot should show the biggest change, now fitting more
closely to the theoretical distribution, as shown in Figure [2A][example]
and Figure [2B][example], which represent the raw and filtered GC content
respectively. Shoulders, which were present in regions of higher GC
content, should be noticeably smaller or be absent altogether. rRNA
overrepresented sequences should no longer be identified in the
corresponding table of over-represented sequences. Finally, the
theoretical GC curve should be centered closer to the expected GC value
of the sample organism.

```r
    h2("Quality trimming and adapter removal")

It is a fact that on Illumina sequencers, the quality of a base pair is linked to its position in the read, bases in the later cycles of the sequencing process have a lower average quality than the earliest cycles (as was observed in the QC report above). This effect depends on the sequencing facility and on the chemistry used and it is only recently that sequencing aligners have integrated methods to correct for this - and not all alignment software does so. A common approach to increase the mapping rate of reads is to trim (remove) low quality bases from the 3’ end until the quality reaches a user-selected Phred-quality threshold. A threshold of 20 is widely accepted as it corresponds to a base call error of 1 in a 100, which is approximately the inherent technical error rate of the Illumina sequencing platform.

An additional issue encountered with Illumina sequencing is the presence of partial adapter sequences within sequenced reads. This arises when the sample fragment size has a large variance and fragments shorter than the sequencer read-length are sequenced. As the resulting reads may contain a significant part of the adapter - a bp sequence of artificial origin - earlier generation alignment software ( those that do not use Maximum Exact Matching and require global alignment of an entire read) may not be able to map such reads. Being able to identify the adapter-like sequences at the end of the read and clip/trim them - a process termed adapter removal - may consequently significantly increase the aligned read proportion.

There are numerous tools available to perform either or both of these tasks (quality trimming and adapter removal). Here, we exemplify using Trimmomatic [@Bolger:2014p6305], a tool that does both. The command line to run the first two samples is as follows:

```{bash, eval=FALSE} mkdir ~/results/trimmomatic cd ~/results/trimmomatic find ~/results/sortmerna -name "*sortmerna_[12].fq.gz" | sort | head -n 4 | while read FW_READ do read RV_READ FILEBASE=$(basename "${FW_READ%_1.fq.gz}") trimmomatic PE -threads 16 -phred64 "$FW_READ" "$RV_READ" \ "$FILEBASE-trimmomatic_1.fq.gz" "$FILEBASE-trimmomatic-unpaired_1.fq.gz" \ "$FILEBASE-trimmomatic_2.fq.gz" "$FILEBASE-trimmomatic-unpaired_2.fq.gz" \ ILLUMINACLIP:"/usr/share/Trimmomatic-0.38/adapters/TruSeq3-PE-2.fa":2:30:10 \ SLIDINGWINDOW:5:20 MINLEN:50 2> "$FILEBASE-timmomatic.log" done

Run Trimmomatic on your filtered FASTQ data.

Now we can copy the rest of the files and run MultiQC.

```{bash, eval=FALSE}
    cd ~/results/trimmomatic
    ln -s ~/share/Day1/trimmomatic/* .
    multiqc .

Once again what would you do after having quality-trimmed and removed the sequencing adapter from your reads?

Yes! Quality Control! As in the next section.

    h2("Trimmed data FastQC")

A final FastQC run is performed to ensure that the previous quality trimming and/or adapter removal steps successfully conserved high quality reads without being too stringent and without introducing any newly apparent technical biases.

Redo the FastQC on the Trimmomatic files you have generated.

```{bash, eval=FALSE} mkdir ~/results/qa/trimmomatic fastqc -o ~/results/qa/trimmomatic -t 16 ~/results/trimmomatic/*trimmomatic_[12].fq.gz

Several changes should be observed in comparison with the previous QC
report: first, the per-base quality scores should be noticeably
different. As shown in Figure [2C,D][example] the per-sequence quality
distribution is now shifted towards higher scores (the trimming effect)
and sequencing adapters are no longer identified as over-represented
(the adapter removal effect). If over-represented sequences remain, this
indicates that an additional kind of contamination may be present and
should be investigated.

<!-- Finally, let's create a full, final directory with all logs to have a complete
MultiQC report:

```{bash, eval=FALSE}
    mkdir ~/results/qc_report
    ln -s ~/results/qa/raw/*zip ~/results/qc_report
    ln -s ~/results/qa/sortmerna/*zip ~/results/qc_report
    ln -s ~/results/qa/trimmomatic/*zip ~/results/qc_report
    ln -s ~/results/sortmerna/*.log ~/results/qc_report
    ln -s ~/results/trimmomatic/*.log ~/results/qc_report
    multiqc -o ~/results/qc_report ~/results/qc_report

-->