Introduction

Kallisto is software developed by Nicolas Bray, Harold Pimentel, Pall Melsted, and Lior Pachter (UC Berkeley) that analyzes 30 million unaligned paired-end RNA-Seq reads in less than 5 minutes on a standard laptop computer. Kallisto quantifies transcript abundance from input RNA-Seq reads by using a process, known as pseudoalignment, which identifies the read-transcript compatibility matrix. Arkas is a BioConductor package that extends functions and utilities for RNA-Seq analysis from raw reads to results in minutes.

Reads to Quantification to Annotation

Arkas was designed to reduce the programmative steps required to quantify and annotate multitudes of sample directories. Arkas calls Kallisto to perform on- the-fly transcriptome indexing and quantification recursively for numerous sample directories. For RNA- Seq projects with numerous sequenced samples, Arkas encapsulates expensive preparatory routines. Arkas programmatically orders FASTQ files output from DNA sequencers and inputs a list required by Kallisto for processing multitudes of demultiplexed reads. The Arkas function 'runKallisto' recursively indexes transcriptomes and quantifies abundances for any number of samples. The function 'mergeKallisto' merges quantified output into an object of ofsubclass a KallistoExperiment-class, SummarizedExperiment-class. Standard mutators and accessor methods from SummarizedExperiment- methods are preserved in KallistoExperiment-methods. Gene annotation is performed from user-selected bundled transcriptomes (ERCC, Ensembl, and/or RepBase) simultaneously merging annotated samples into one R object: KallistoExperiment. Arkas annotates genes for Homo-Sapiens GrCh38 and Mouse GrCm38 (NCBI). Routines such as 'annotateBundles' yields annotated genes from transcriptomes such as External RNA Control Consortium (ERCC), Ensembl release 81 of non-coding RNA, coding RNA, and a hg38 repeatome for both species.

Kallisto Installation

For linux systems, after installing the dependencies, kallisto is installed via:

mkdir /KallistoSource           
cd /KallistoSource             
git clone https://github.com/pachterlab/kallisto.git         
cd ./kallisto       
mkdir ./build        
cd ./build       
cmake ..              
make          
make install   

Gene Wise Analysis

Arkas supports various levels of analysis, namely transcript-level or gene-level analysis which involves the Limma package for differential expression analysis.
Gene Wise Analysis is founded on the idea that groups of transcripts by a fixed Ensembl Gene ID is termed a "gene"; where "gene" counts are defined as the sum of all transcripts identified by the same unique Ensembl Gene Id. Gene Wise analysis generates bundled and aggregated transcripts associated with a specific Ensembl Gene ID. Arkas wraps limma around another method titled "collapseBundles", which collapses transcripts into appropriate groups and sums the quantified transcript counts of the group; these transcript aggregated counts are defined as "gene" counts.

The Measure Depends On The Level

Not all transcripts have the same function homology. Most folks agree that genes are made up by transcripts defined by the transcripts' coordinate location on the genome. However there are transcipt isoforms in DNMT3A and WT1 that have radically different biological function depending on the transcript isoform that is present. The problem with conducting only a gene level analysis is that many genes can have the same total gene level total quantified counts; however the biological mechanisms for the same "gene" can vary greatly by a single transcript isoform.

suppressWarnings(suppressPackageStartupMessages(library(arkas)))    

jsonFile <- system.file("extdata", "NS.JSON", package="arkas")
appSession <- fetchAppSession(jsonFile) ## a
names(appSession$samples) <- appSession$samples ## so column names get set 
appSession$outputPath <- system.file("extdata", "", package="arkasData")
pathBase<-system.file("extdata",package="arkasData")
fastaPath <- paste0(pathBase, "/fasta")
appSession$fastaPath<-fastaPath
NS <- mergeKallisto(appSession$samples, 
                      outputPath=appSession$outputPath)

Creating The Design Matrix

In order to analyze bundle-aggregated transcripts defined as "genes", we create a design matrix which controls for individual effects and contrasts treatment effects across individual subjects.

NS$subject <- factor(substr(colnames(NS), 2, 2))
NS$treatment <- substr(colnames(NS), 1, 1) == "s"
NS$ID <- NULL
design <- with(as(colData(NS), "data.frame"),
                  model.matrix( ~ treatment + subject ))
rownames(design) <- colnames(NS)
metadata(NS)$design <- design
design

Annotate!

In order to run gene-wise analysis, Arkas requires that the merged KallistoExperiment must be annotated; this is because we must collapse transcripts into groups linked to unique Ensembl Gene Ids.

Buiding Annotation libraries

Library Annotations are built using TxDbLite; these annotation databases allow for lite annotations parsing gene names, bio-types and family type from reference fastas from ERCC, Ensembl, or RepBase. Currently exonic, intronic, or other coordinate dependent information is not included in TxDbLite. The supplemental package arkasData stores the ready-to-load annotation libraries under /extdata/Libraries directory. For demonstration, we build the libraries under the arkasData/extdata/fasta/tmp directory.

#library(SummarizedExperiment)
#library(GenomicRanges)

suppressPackageStartupMessages(library(TxDbLite))
suppressWarnings(suppressPackageStartupMessages(library(arkas)))
jsonFile <- system.file("extdata", "NS.JSON", package="arkas")
appSession <- fetchAppSession(jsonFile)
names(appSession$samples) <- appSession$samples
appSession$outputPath <- system.file("extdata", package="arkasData")
fastaPath<-system.file("extdata","fasta",package="arkasData")
appSession$fastaPath<-fastaPath

pathBase<-system.file("extdata",package="arkasData")
load(paste0(pathBase,"/annotatedKexp/NS.RData")) #pre-annotated
NS$subject <- factor(substr(colnames(NS), 2, 2))
NS$treatment <- substr(colnames(NS), 1, 1) == "s"
NS$ID <- NULL
design <- with(as(colData(NS), "data.frame"),
                  model.matrix( ~ subject+treatment ))
rownames(design) <- colnames(NS)
metadata(NS)$design <- design
#returns a KallistoExperiment at the gene level
GWA<-geneWiseAnalysis(NS,design=design,
                       how="cpm",
                       p.cutoff=0.05,
                       fold.cutoff=1,
                       read.cutoff=1,
                       species="Homo.sapiens",
                        fitOnly="TRUE")

head(GWA$topGenes,n=20)


metaGwa<-geneWiseAnalysis(NS,design=design,
                       how="cpm",
                       p.cutoff=0.05,
                       fold.cutoff=1,
                       read.cutoff=1,
                       species="Homo.sapiens",
                        fitOnly="FALSE")

head(metaGwa$limmaWithMeta)

Annotating Merged KallistoExperiment Containers

Arkas has a function "annotateFeatures.R" which annotates ERCC, Ensembl, and RepBase databases for species Homo-Sapiens, Mus-musculus, and Rattus norvegicus. The method "annotateFeatures.R" annotates the merged KallistoExperiment against every TxDbLite library simulatenously. These annotation databases are defined as 'lite' because they do not store exonic or intronic coordinates.

Gene Wise Analysis

Gene wise analysis collapses transcripts into groups related to specific ensembl "gene" Ids. The package TxDbLite parses the Ensembl, or RepBase transcript fasta files and stores the respective gene id's associated with the given transcript documented in the transcript fasta header. Arkas' method for gene wise analysis calls "collapseBundles.R" which then calculates the aggregated total counts of transcripts for each unique gene id association. Thus the "gene" count is defined as the sum of all quantified transcripts associated with a specific gene identifier.

Expression Results

The expression results were generated by limma/voom and have the meta biotype, gene name, etc information included in the gene wise analysis results.

Understanding Gene Wise Analysis Output

For fitOnly, the output contains only the results from limma/voom such as adj.P.Val, and Avg Expr, etc. If fitOnly is set to FALSE, then the output contains a list of limma derived expression values, and entrezID, gene name, and gene biotypes derived by biomaRt and TxDbLite respectively, and a Gene.Title description. The expression results were generated by limma/voom and have the meta biotype, gene name, etc information included in the gene wise analysis results. These meta informaiton results can be piped into Advaita iPathway Guide for an exhaustive pathway analysis.



RamsinghLab/arkas_staging documentation built on March 14, 2021, 11:40 a.m.