Kallisto is software developed by Nicolas Bray, Harold Pimentel, Pall Melsted, and Lior Pachter (UC Berkeley) that analyzes 30 million unaligned paired-end RNA-Seq reads in less than 5 minutes on a standard laptop computer. Kallisto quantifies transcript abundance from input RNA-Seq reads by using a process, known as pseudoalignment, which identifies the read-transcript compatibility matrix. Arkas is a BioConductor package that extends functions and utilities for RNA-Seq analysis from raw reads to results in minutes.
Arkas was designed to reduce the programmative steps required to quantify and annotate multitudes of sample directories. Arkas calls Kallisto to perform on- the-fly transcriptome indexing and quantification recursively for numerous sample directories. For RNA- Seq projects with numerous sequenced samples, Arkas encapsulates expensive preparatory routines. Arkas programmatically orders FASTQ files output from DNA sequencers and inputs a list required by Kallisto for processing multitudes of demultiplexed reads. The Arkas function 'runKallisto' recursively indexes transcriptomes and quantifies abundances for any number of samples. The function 'mergeKallisto' merges quantified output into an object of ofsubclass a KallistoExperiment-class, SummarizedExperiment-class. Standard mutators and accessor methods from SummarizedExperiment- methods are preserved in KallistoExperiment-methods. Gene annotation is performed from user-selected bundled transcriptomes (ERCC, Ensembl, and/or RepBase) simultaneously merging annotated samples into one R object: KallistoExperiment. Arkas annotates genes for Homo-Sapiens GrCh38 and Mouse GrCm38 (NCBI). Routines such as 'annotateBundles' yields annotated genes from transcriptomes such as External RNA Control Consortium (ERCC), Ensembl release 81 of non-coding RNA, coding RNA, and a hg38 repeatome for both species.
For linux systems, after installing the dependencies, kallisto is installed via:
mkdir /KallistoSource cd /KallistoSource git clone https://github.com/pachterlab/kallisto.git cd ./kallisto mkdir ./build cd ./build cmake .. make make install
Arkas supports various levels of analysis, namely transcript-level or gene-level analysis which involves the Limma package for differential expression analysis.
Gene Wise Analysis is founded on the idea that groups of transcripts by a fixed Ensembl Gene ID is termed a "gene"; where "gene" counts are defined as the sum of all transcripts identified by the same unique Ensembl Gene Id. Gene Wise analysis generates bundled and aggregated transcripts associated with a specific Ensembl Gene ID. Arkas wraps limma around another method titled "collapseBundles", which collapses transcripts into appropriate groups and sums the quantified transcript counts of the group; these transcript aggregated counts are defined as "gene" counts.
Not all transcripts have the same function homology. Most folks agree that genes are made up by transcripts defined by the transcripts' coordinate location on the genome. However there are transcipt isoforms in DNMT3A and WT1 that have radically different biological function depending on the transcript isoform that is present. The problem with conducting only a gene level analysis is that many genes can have the same total gene level total quantified counts; however the biological mechanisms for the same "gene" can vary greatly by a single transcript isoform.
suppressWarnings(suppressPackageStartupMessages(library(arkas))) jsonFile <- system.file("extdata", "NS.JSON", package="arkas") appSession <- fetchAppSession(jsonFile) ## a names(appSession$samples) <- appSession$samples ## so column names get set appSession$outputPath <- system.file("extdata", "", package="arkasData") pathBase<-system.file("extdata",package="arkasData") fastaPath <- paste0(pathBase, "/fasta") appSession$fastaPath<-fastaPath NS <- mergeKallisto(appSession$samples, outputPath=appSession$outputPath)
In order to analyze bundle-aggregated transcripts defined as "genes", we create a design matrix which controls for individual effects and contrasts treatment effects across individual subjects.
NS$subject <- factor(substr(colnames(NS), 2, 2)) NS$treatment <- substr(colnames(NS), 1, 1) == "s" NS$ID <- NULL design <- with(as(colData(NS), "data.frame"), model.matrix( ~ treatment + subject )) rownames(design) <- colnames(NS) metadata(NS)$design <- design design
In order to run gene-wise analysis, Arkas requires that the merged KallistoExperiment must be annotated; this is because we must collapse transcripts into groups linked to unique Ensembl Gene Ids.
Library Annotations are built using TxDbLite; these annotation databases allow for lite annotations parsing gene names, bio-types and family type from reference fastas from ERCC, Ensembl, or RepBase. Currently exonic, intronic, or other coordinate dependent information is not included in TxDbLite. The supplemental package arkasData stores the ready-to-load annotation libraries under /extdata/Libraries directory. For demonstration, we build the libraries under the arkasData/extdata/fasta/tmp directory.
#library(SummarizedExperiment) #library(GenomicRanges) suppressPackageStartupMessages(library(TxDbLite)) suppressWarnings(suppressPackageStartupMessages(library(arkas))) jsonFile <- system.file("extdata", "NS.JSON", package="arkas") appSession <- fetchAppSession(jsonFile) names(appSession$samples) <- appSession$samples appSession$outputPath <- system.file("extdata", package="arkasData") fastaPath<-system.file("extdata","fasta",package="arkasData") appSession$fastaPath<-fastaPath pathBase<-system.file("extdata",package="arkasData") load(paste0(pathBase,"/annotatedKexp/NS.RData")) #pre-annotated NS$subject <- factor(substr(colnames(NS), 2, 2)) NS$treatment <- substr(colnames(NS), 1, 1) == "s" NS$ID <- NULL design <- with(as(colData(NS), "data.frame"), model.matrix( ~ subject+treatment )) rownames(design) <- colnames(NS) metadata(NS)$design <- design #returns a KallistoExperiment at the gene level GWA<-geneWiseAnalysis(NS,design=design, how="cpm", p.cutoff=0.05, fold.cutoff=1, read.cutoff=1, species="Homo.sapiens", fitOnly="TRUE") head(GWA$topGenes,n=20) metaGwa<-geneWiseAnalysis(NS,design=design, how="cpm", p.cutoff=0.05, fold.cutoff=1, read.cutoff=1, species="Homo.sapiens", fitOnly="FALSE") head(metaGwa$limmaWithMeta)
Arkas has a function "annotateFeatures.R" which annotates ERCC, Ensembl, and RepBase databases for species Homo-Sapiens, Mus-musculus, and Rattus norvegicus. The method "annotateFeatures.R" annotates the merged KallistoExperiment against every TxDbLite library simulatenously. These annotation databases are defined as 'lite' because they do not store exonic or intronic coordinates.
Gene wise analysis collapses transcripts into groups related to specific ensembl "gene" Ids. The package TxDbLite parses the Ensembl, or RepBase transcript fasta files and stores the respective gene id's associated with the given transcript documented in the transcript fasta header. Arkas' method for gene wise analysis calls "collapseBundles.R" which then calculates the aggregated total counts of transcripts for each unique gene id association. Thus the "gene" count is defined as the sum of all quantified transcripts associated with a specific gene identifier.
The expression results were generated by limma/voom and have the meta biotype, gene name, etc information included in the gene wise analysis results.
For fitOnly, the output contains only the results from limma/voom such as adj.P.Val, and Avg Expr, etc. If fitOnly is set to FALSE, then the output contains a list of limma derived expression values, and entrezID, gene name, and gene biotypes derived by biomaRt and TxDbLite respectively, and a Gene.Title description. The expression results were generated by limma/voom and have the meta biotype, gene name, etc information included in the gene wise analysis results. These meta informaiton results can be piped into Advaita iPathway Guide for an exhaustive pathway analysis.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.