In alexvpickering/rkal: Run, Import, and Export 'kallisto' Quantifications

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(rkal)

Installation

rkal can be installed as follows:

remotes::install.packages('alexvpickering/rkal')

Overview of rkal

kallisto is the fastest program for accurately quantifying transcript abundance from bulk RNA-seq data. rkal provides an R interface to kallisto functions index (for building a genome index) and quant (for running pseudoalignment). rkal automates argument specification for fastq files downloaded with GEOfastq and provides a convenient GUI for personal fastq files. After pseudoalignment, counts can be imported into R into a data structure that is compatible with crossmeta (for differential expression and meta analyses) and dseqr (a GUI for bulk and single-cell exploratory analyses as well as connectivity mapping). Alternatively, counts can be imported for GEO samples pre-aligned as part of the ARCHS4 project.

Getting Started using rkal

Prior to pseudoalignment, an index of the transcriptome must first be built:

#This will build the human Ensembl94 index for kallisto in the working directory
#this only needs to be run once
indices_dir <- getwd()
build_kallisto_index(indices_dir = indices_dir,
                     species = 'homo_sapiens', release = '94')

Next, we download an example fastq file with GEOfastq:

library(GEOfastq)
data_dir <- tempdir()

# first get metadata needed and download example fastq file
srp_meta <- crawl_gsms('GSM4875733')
res <- get_fastqs(srp_meta, data_dir)

Next we collect fastq file metadata needed to run pseudoalignement (are fastq files paired or single-end? Are there any samples split across multiple files?):

# we can get the necessary info automatically for fastqs from GEOfastq
quant_meta <- get_quant_meta(srp_meta, data_dir)

# for personal fastq files, a GUI will request this info in the next step

We are now ready to run pseudoalignment:

# exclude quant_meta for personal fastqs (will invoke GUI)
res <- run_kallisto_bulk(indices_dir, data_dir, quant_meta)

If you plan to use crossmeta or dseqr, you can easily generate a suitably annotated ExpressionSet:

eset <- load_seq(data_dir)

# if you downloaded the ARCHS4 pre-aligned GEO data
archs4_file <- '/path/to/human_matrix_v*.h5'
# eset <- load_archs4_seq(archs4_file, 'GSM4875733')

Alternative, see e.g. the tximport vignette if you prefer to run limma or DESeq2 differential expression analyses.