Introduction

Kallisto is software developed by Nicolas Bray, Harold Pimentel, Pall Melsted, and Lior Pachter (UC Berkeley) that analyzes 30 million unaligned paired-end RNA-Seq reads in less than 5 minutes on a standard laptop computer. Kallisto quantifies transcript abundance from input RNA-Seq reads by using a process, known as pseudoalignment, which identifies the read-transcript compatibility matrix. arkas is a BioConductor package that extends functions and utilities for RNA-Seq analysis from raw reads to results in minutes.

Reads to Quantification to Annotation

Arkas was designed to reduce the programmative steps required to quantify and annotate multitudes of sample directories. Arkas calls Kallisto to perform on- the-fly transcriptome indexing and quantification recursively for numerous sample directories. For RNA- Seq projects with numerous sequenced samples, Arkas encapsulates expensive preparatory routines. Arkas programmatically orders FASTQ files output from DNA sequencers and inputs a list required by Kallisto for processing multitudes of demultiplexed reads. The Arkas function 'runKallisto' recursively indexes transcriptomes and quantifies abundances for any number of samples. The function 'mergeKallisto' merges quantified output into an object of ofsubclass a KallistoExperiment-class, SummarizedExperiment-class. Standard mutators and accessor methods from SummarizedExperiment- methods are preserved in KallistoExperiment-methods. Gene annotation is performed from user-selected bundled transcriptomes (ERCC, Ensembl, and/or RepBase) simultaneously merging annotated samples into one R object: KallistoExperiment. Arkas annotates genes for Homo-Sapiens GrCh38 and Mouse GrCm38 (NCBI). Routines such as 'annotateBundles' yields annotated genes from transcriptomes such as External RNA Control Consortium (ERCC), Ensembl release 81 of non-coding RNA, coding RNA, and a hg38 repeatome for both species.

Kallisto Installation

For linux systems, after installing the dependencies, kallisto is installed via:

mkdir /KallistoSource           
cd /KallistoSource             
git clone https://github.com/pachterlab/kallisto.git         
cd ./kallisto       
mkdir ./build        
cd ./build       
cmake ..              
make          
make install   

Reference File Preparation

Arkas imports methods for identifying duplicated RepBase sequences, and imports these methods from TxDbLite. It is not uncommon for RepBase files with fixed species to have duplicated sequences and duplicated sequence names across fasta files. The package arkasData stores supplementary data for the package arkas; we've included raw RepBase files with duplicated repeats, and a set of fasta with manually excised duplicated RepBase sequences across fasta files.

Identifying Duplicated Sequences in RepBase References

Kallisto source must be built locally by indexing the reference files. There can not be duplicated sequences across reference fasta; Arkas will identify where in there reference duplicates occur.

suppressWarnings(suppressPackageStartupMessages(library(arkas)))    
suppressPackageStartupMessages(library(TxDbLite))      
pathBase<-system.file("extdata",package="arkasData")       
FastaPath <- paste0(pathBase, "/fasta")
fastaFilesDuplicates <- c( "ERCC.fa", ## spike-in controls             
                 "Homo_sapiens.RepBase.20_05.humrep.fa", ## repeats         
                 "Homo_sapiens.RepBase.20_05.humsub.fa")  ## ALUs and such             
findDupes(paste0(FastaPath,"/",fastaFilesDuplicates)) #return a df with column of 0 

Example of output with duplicated repeat sequences in Repeatome:

                                          duplicates
Homo_sapiens.RepBase.20_12.merged.fa1     LTR26B
Homo_sapiens.RepBase.20_12.merged.fa2      SVA_A

The duplicated repeats LTR26B at row number 222 have identical sequences: TRUE     
The duplicated repeats LTR26B at row number 1166 have identical sequences: TRUE    
The duplicated repeats SVA_A at row number 671 have identical sequences: TRUE     
The duplicated repeats SVA_A at row number 1207 have identical sequences: TRUE     
There are duplicated sequence names in your FASTA files:    

Quantifying Transcripts with Unique Reference Transcripts in Repeatome

Arkas saves time by generating a reference fasta file only if one does not exist. Note that Arkas enforces the uniformity of references generated by the same version of Kallisto, including the Kallisto version which generates the reference within the name. The reference name that Arkas generates specifies the Kallisto version, and all of the group transcriptomes. In this case, we have already generated the reference fasta, and can regenerate the reference by deleting it.

suppressWarnings(suppressPackageStartupMessages(library(arkas)))
suppressPackageStartupMessages(library(TxDbLite))
pathBase<-system.file("extdata",package="arkasData")
FastaPath <- paste0(pathBase, "/fasta")
if(file.exists("/data")==FALSE){
    system("sudo mkdir /data && sudo mkdir /data/output")
     system("sudo chmod -R 777 /data")
 }
   if(file.exists("/data/output")==FALSE){
     system("sudo mkdir /data/output && sudo chmod -R 777 /data/output")
   }
OutputPath<-"/data/output"
fastqPath <- paste0(pathBase, "/fastq/demoFastqDir/")
samples <- c(MrN="MrN", MrT="MrT")
FastaFiles <- c( "ERCC.fa.gz", ## spike-in controls                    
                  "Homo_sapiens.RepBase.20_05.merged.fa.gz") #no duplicates
indexName <- indexKallisto(fastaFiles=FastaFiles, fastaPath=FastaPath)$indexName
indexName
Xtension<-"_*"

library(parallel)
results <- suppressWarnings(lapply(samples, 
                    runKallisto,
                    indexName=indexName,
                    fastqPath=fastqPath,
                    fastaPath=FastaPath,
                    bootstraps=20,
                    outputPath=OutputPath,
                    extension=Xtension,
                    bias=FALSE))

Merging Kallisto Quantification

Arkas runs kallisto pseudo-alignment quantification into a KallistoExperiment object, as an S4 object. The SummarizedExperiment accessors are the same for KallistoExperiment objects because KallistoExperiment containers are a sub-class of SummarizedExperiments We enforce uniformity between kallisto versions because if the kallisto versions are not the same, then there exists data variance between different KallistoExperiments' assay data and indexed references. In order to annotate the merged KallistoExperiment, TxDbLite is used to create SQL-lite transcriptome/repeatome databases. This tutorial already created the transcriptome package which can be accessed using transcriptomes() accessor on the KallistoExperiment object.

suppressPackageStartupMessages(library(arkas))

suppressPackageStartupMessages(library(TxDbLite))
samples<-c("n1","n2","n4","s1","s2","s4")
pathBase<-system.file("extdata",package="arkasData")
merged <- mergeKallisto(samples, outputPath=pathBase)
assays(merged)      
kallistoVersion(merged)     
transcriptomes(merged)
tail(tpm(merged))   

Annotations and RPKM/FPKM to Transcripts-Per-Million (TPM)

Many RNA-Seq experiments derive counts as reads per kilobase per million, or fragments per kilobase per million. The quantification from RNA-Seq transcript abundances is dependent on transcript length in proportion to relative abundance. The estimated probability of reads generated by a transcript is given by counting the number of reads that align to transcript divided by total number of mapped reads. Where the definition of transcript-per-million, TPM, is given as mean transcript length in kilobases multiplied by RPKM. The mean for transcript dependent lengths gives weighted measure for expression of the lengths across iso-forms. Arkas handles RPKM to TPM conversion by dividing the selected transcript RPKM score by the sum of all RPKM, and multiplying by 1e6 (Li,2009) TxDbLite is a sister package to arkas which can annotate kallisto-experiment classes.

Repeat Analysis

After creating the repeatome, arkas can be used to analyze repetitive elements For instance, one can plot repeat element transcripts using (counts/bootstrap MADs) as effect size, or analyze families of Alu, LTR, or Endongenous Retroviruses. Below we plot a heat map of quantified transcripts from our repeatome for species Homo Sapiens.

References

Bo, Li, et. al, "RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty."
Oxford Journals Bioinformatics. Oxford Journal, 9 Dec. 2009. Web. 20 June 2015.
http://bioinformatics.oxfordjournals.org/content/26/4/493.full.



RamsinghLab/arkas_staging documentation built on March 14, 2021, 11:40 a.m.