Kallisto is software developed by Nicolas Bray, Harold Pimentel, Pall Melsted, and Lior Pachter (UC Berkeley) that analyzes 30 million unaligned paired-end RNA-Seq reads in less than 5 minutes on a standard laptop computer. Kallisto quantifies transcript abundance from input RNA-Seq reads by using a process, known as pseudoalignment, which identifies the read-transcript compatibility matrix. arkas is a BioConductor package that extends functions and utilities for RNA-Seq analysis from raw reads to results in minutes.
Arkas was designed to reduce the programmative steps required to quantify and annotate multitudes of sample directories. Arkas calls Kallisto to perform on- the-fly transcriptome indexing and quantification recursively for numerous sample directories. For RNA- Seq projects with numerous sequenced samples, Arkas encapsulates expensive preparatory routines. Arkas programmatically orders FASTQ files output from DNA sequencers and inputs a list required by Kallisto for processing multitudes of demultiplexed reads. The Arkas function 'runKallisto' recursively indexes transcriptomes and quantifies abundances for any number of samples. The function 'mergeKallisto' merges quantified output into an object of ofsubclass a KallistoExperiment-class, SummarizedExperiment-class. Standard mutators and accessor methods from SummarizedExperiment- methods are preserved in KallistoExperiment-methods. Gene annotation is performed from user-selected bundled transcriptomes (ERCC, Ensembl, and/or RepBase) simultaneously merging annotated samples into one R object: KallistoExperiment. Arkas annotates genes for Homo-Sapiens GrCh38 and Mouse GrCm38 (NCBI). Routines such as 'annotateBundles' yields annotated genes from transcriptomes such as External RNA Control Consortium (ERCC), Ensembl release 81 of non-coding RNA, coding RNA, and a hg38 repeatome for both species.
For linux systems, after installing the dependencies, kallisto is installed via:
mkdir /KallistoSource cd /KallistoSource git clone https://github.com/pachterlab/kallisto.git cd ./kallisto mkdir ./build cd ./build cmake .. make make install
Arkas imports methods for identifying duplicated RepBase sequences, and imports these methods from TxDbLite. It is not uncommon for RepBase files with fixed species to have duplicated sequences and duplicated sequence names across fasta files. The package arkasData stores supplementary data for the package arkas; we've included raw RepBase files with duplicated repeats, and a set of fasta with manually excised duplicated RepBase sequences across fasta files.
Kallisto source must be built locally by indexing the reference files. There can not be duplicated sequences across reference fasta; Arkas will identify where in there reference duplicates occur.
suppressWarnings(suppressPackageStartupMessages(library(arkas))) suppressPackageStartupMessages(library(TxDbLite)) pathBase<-system.file("extdata",package="arkasData") FastaPath <- paste0(pathBase, "/fasta") fastaFilesDuplicates <- c( "ERCC.fa", ## spike-in controls "Homo_sapiens.RepBase.20_05.humrep.fa", ## repeats "Homo_sapiens.RepBase.20_05.humsub.fa") ## ALUs and such findDupes(paste0(FastaPath,"/",fastaFilesDuplicates)) #return a df with column of 0
Example of output with duplicated repeat sequences in Repeatome:
duplicates Homo_sapiens.RepBase.20_12.merged.fa1 LTR26B Homo_sapiens.RepBase.20_12.merged.fa2 SVA_A The duplicated repeats LTR26B at row number 222 have identical sequences: TRUE The duplicated repeats LTR26B at row number 1166 have identical sequences: TRUE The duplicated repeats SVA_A at row number 671 have identical sequences: TRUE The duplicated repeats SVA_A at row number 1207 have identical sequences: TRUE There are duplicated sequence names in your FASTA files:
Arkas saves time by generating a reference fasta file only if one does not exist. Note that Arkas enforces the uniformity of references generated by the same version of Kallisto, including the Kallisto version which generates the reference within the name. The reference name that Arkas generates specifies the Kallisto version, and all of the group transcriptomes. In this case, we have already generated the reference fasta, and can regenerate the reference by deleting it.
suppressWarnings(suppressPackageStartupMessages(library(arkas))) suppressPackageStartupMessages(library(TxDbLite)) pathBase<-system.file("extdata",package="arkasData") FastaPath <- paste0(pathBase, "/fasta") if(file.exists("/data")==FALSE){ system("sudo mkdir /data && sudo mkdir /data/output") system("sudo chmod -R 777 /data") } if(file.exists("/data/output")==FALSE){ system("sudo mkdir /data/output && sudo chmod -R 777 /data/output") } OutputPath<-"/data/output" fastqPath <- paste0(pathBase, "/fastq/demoFastqDir/") samples <- c(MrN="MrN", MrT="MrT") FastaFiles <- c( "ERCC.fa.gz", ## spike-in controls "Homo_sapiens.RepBase.20_05.merged.fa.gz") #no duplicates indexName <- indexKallisto(fastaFiles=FastaFiles, fastaPath=FastaPath)$indexName indexName Xtension<-"_*" library(parallel) results <- suppressWarnings(lapply(samples, runKallisto, indexName=indexName, fastqPath=fastqPath, fastaPath=FastaPath, bootstraps=20, outputPath=OutputPath, extension=Xtension, bias=FALSE))
Arkas runs kallisto pseudo-alignment quantification into a KallistoExperiment object, as an S4 object. The SummarizedExperiment accessors are the same for KallistoExperiment objects because KallistoExperiment containers are a sub-class of SummarizedExperiments We enforce uniformity between kallisto versions because if the kallisto versions are not the same, then there exists data variance between different KallistoExperiments' assay data and indexed references. In order to annotate the merged KallistoExperiment, TxDbLite is used to create SQL-lite transcriptome/repeatome databases. This tutorial already created the transcriptome package which can be accessed using transcriptomes() accessor on the KallistoExperiment object.
suppressPackageStartupMessages(library(arkas)) suppressPackageStartupMessages(library(TxDbLite)) samples<-c("n1","n2","n4","s1","s2","s4") pathBase<-system.file("extdata",package="arkasData") merged <- mergeKallisto(samples, outputPath=pathBase) assays(merged) kallistoVersion(merged) transcriptomes(merged) tail(tpm(merged))
Many RNA-Seq experiments derive counts as reads per kilobase per million, or fragments per kilobase per million. The quantification from RNA-Seq transcript abundances is dependent on transcript length in proportion to relative abundance. The estimated probability of reads generated by a transcript is given by counting the number of reads that align to transcript divided by total number of mapped reads. Where the definition of transcript-per-million, TPM, is given as mean transcript length in kilobases multiplied by RPKM. The mean for transcript dependent lengths gives weighted measure for expression of the lengths across iso-forms. Arkas handles RPKM to TPM conversion by dividing the selected transcript RPKM score by the sum of all RPKM, and multiplying by 1e6 (Li,2009) TxDbLite is a sister package to arkas which can annotate kallisto-experiment classes.
After creating the repeatome, arkas can be used to analyze repetitive elements For instance, one can plot repeat element transcripts using (counts/bootstrap MADs) as effect size, or analyze families of Alu, LTR, or Endongenous Retroviruses. Below we plot a heat map of quantified transcripts from our repeatome for species Homo Sapiens.
Bo, Li, et. al, "RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty."
Oxford Journals Bioinformatics. Oxford Journal, 9 Dec. 2009. Web. 20 June 2015.
http://bioinformatics.oxfordjournals.org/content/26/4/493.full.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.