In UPSCb/RnaSeqTutorial: RNA-Seq Tutorial

    startQuest("Pseudo-alignment and read quantification with Kallisto")

    h1("Pseudo-alignment and read quantification with Kallisto")

Kallisto is a modern read quantification software that makes heavy use of k-mer hash tables and De-Bruijn graphs to "pseudo-align" reads [@Bray2016]. This means that Kallisto does not map a read to a given position in a genome/transcriptome, but rather to an abstract representation of a transcript. Kallisto then performs a likelihood estimation and expectation maximization algorithms to calculate a count and - at request - bootstraps a confidence value.

Step 1 is to create the Kallisto index from a transcriptome:

```{bash, eval=FALSE} cd ~/results kallisto index -i Potra01-mRNA.idx \ ~/share/Day1/reference/fasta/Potra01-mRNA.fa.gz

We could modify the _k_-mer size using the `-k` option. This is - however - often
a very sensitive parameter in _k_-mer/hash based applications. Lower values will
provide higher sensitivity, but lower specificity and vice versa for higher
values of _k_. The optimal choice for a _k_-mer would be the longest read length
without a single mismatch due to natural polymorphisms or technical artefacts. In
practise, this is very difficult to estimate unless you have access to a large
pool of population genomics data.

Next we can quantify the input data using this newly created index:

```{bash, eval=FALSE}
    mkdir -p ~/results/kallisto
    cd ~/results/kallisto
    find ../trimmomatic -name "*trimmomatic_[12].fq.gz" | sort | head -n 4 | while read FW_READ
    do
      read RV_READ
      FILEBASE=$(basename "${FW_READ%_1.fq.gz}")
      kallisto quant -i ~/results/Potra01-mRNA.idx -b 100 \
      -o . -t 16 "$FW_READ" "$RV_READ" | tee $FILEBASE.log
      # Kallisto doesn't let us specify an output filename so we rename all 
      # output files
      mv "abundance.tsv" $FILEBASE"_abundance.tsv"
      mv "abundance.h5" $FILEBASE"_abundance.h5"
      mv "run_info.json" $FILEBASE"_run_info.json"
    done

The option -b 100 performs a bootstrap confidence estimation using 100 iterations.

Kallisto outputs several files:

abundance.tsv: This file is the main file of interest as it contains transcript counts, tpm, etc.
abundance.h5: This file contains the bootstrap values so confidence intervals can be estimated for the gene counts
run_info.json: This is a file containing the parameters supplied to Kallisto

Now let us run multiqc a final couple times multiqc can read the kallisto log file

```{bash, eval=FALSE} ~/results/kallisto multiqc.

Finally, let's create a full, final directory with all logs to have a complete
MultiQC report:

```{bash, eval=FALSE}
    mkdir ~/results/qc_report
    multiqc -o ~/results/qc_report ~/results

UPSCb/RnaSeqTutorial documentation built on Nov. 24, 2020, 12:40 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com