Importing transcript abundance with tximport

Introduction

Import and summarize transcript-level abundance estimates for transcript- and gene-level analysis with Bioconductor packages, such as edgeR, DESeq2, and limma-voom. The motivation and methods for the functions provided by the tximport package are described in the following article [@Soneson2015]:

Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research http://dx.doi.org/10.12688/f1000research.7563.1

In particular, the tximport pipeline offers the following benefits: (i) this approach corrects for potential changes in gene length across samples (e.g. from differential isoform usage) [@Trapnell2013Differential], (ii) some of the upstream quantification methods (Salmon, Sailfish, kallisto) are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files, and (iii) it is possible to avoid discarding those fragments that can align to multiple genes with homologous sequence, thus increasing sensitivity [@Robert2015Errors].

Note: another Bioconductor package, tximeta [@Love2020], extends tximport, offering the same functionality, plus the additional benefit of automatic addition of annotation metadata for commonly used transcriptomes (GENCODE, Ensembl, RefSeq for human and mouse). See the tximeta package vignette for more details. Whereas tximport outputs a simple list of matrices, tximeta will output a SummarizedExperiment object with appropriate GRanges added if the transcriptome is from one of the sources above for human and mouse.

library(knitr)
opts_chunk$set(tidy=TRUE,message=FALSE)

Import transcript-level estimates

We begin by locating some prepared files that contain transcript abundance estimates for six samples, from the tximportData package. The tximport pipeline will be nearly identical for various quantification tools, usually only requiring one change the type argument. We begin with quantification files generated by the Salmon software, and later show the use of tximport with any of:

First, we locate the directory containing the files. (Here we use system.file to locate the package directory, but for a typical use, we would just provide a path, e.g. "/path/to/dir".)

library(tximportData)
dir <- system.file("extdata", package="tximportData")
list.files(dir)

Next, we create a named vector pointing to the quantification files. We will create a vector of filenames first by reading in a table that contains the sample IDs, and then combining this with dir and "quant.sf.gz". (We gzipped the quantification files to make the data package smaller, this is not a problem for R functions that we use to import the files.)

samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
samples
files <- file.path(dir, "salmon", samples$run, "quant.sf.gz")
names(files) <- paste0("sample",1:6)
all(file.exists(files))

Transcripts need to be associated with gene IDs for gene-level summarization. If that information is present in the files, we can skip this step. For Salmon, Sailfish, and kallisto the files only provide the transcript ID. We first make a data.frame called tx2gene with two columns: 1) transcript ID and 2) gene ID. The column names do not matter but this column order must be used. The transcript ID must be the same one used in the abundance files.

Creating this tx2gene data.frame can be accomplished from a TxDb object and the select function from the AnnotationDbi package. The following code could be used to construct such a table:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
k <- keys(txdb, keytype="TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")

Note: if you are using an Ensembl transcriptome, the easiest way to create the tx2gene data.frame is to use the ensembldb packages. The annotation packages can be found by version number, and use the pattern EnsDb.Hsapiens.vXX. The transcripts function can be used with return.type="DataFrame", in order to obtain something like the df object constructed in the code chunk above. See the ensembldb package vignette for more details.

In this case, we've used the Gencode v27 CHR transcripts to build our index, and we used makeTxDbFromGFF and code similar to the chunk above to build the tx2gene table. We then read in a pre-constructed tx2gene table:

library(readr)
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))
head(tx2gene)

The tximport package has a single function for importing transcript-level estimates. The type argument is used to specify what software was used for estimation. A simple list with matrices, "abundance", "counts", and "length", is returned, where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the "length" matrix contains the effective gene lengths. The "length" matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices, as shown below.

Note: While tximport works without any dependencies, it is significantly faster to read in files using the readr package. If tximport detects that readr is installed, then it will use the readr::read_tsv function by default. A change from version 1.2 to 1.4 is that the reader is not specified by the user anymore, but chosen automatically based on the availability of the readr package. Advanced users can still customize the import of files using the importer argument.

library(tximport)
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
names(txi)
head(txi$counts)

We could alternatively generate counts from abundances, using the argument countsFromAbundance, scaled to library size, "scaledTPM", or additionally scaled using the average transcript length, averaged over samples and to library size, "lengthScaledTPM". Using either of these approaches, the counts are not correlated with length, and so the length matrix should not be provided as an offset for downstream analysis packages. As of tximport version 1.10, we have added a new countsFromAbundance option "dtuScaledTPM". This scaling option is designed for use with txOut=TRUE for differential transcript usage analyses. See ?tximport for details on the various countsFromAbundance options.

We can avoid gene-level summarization by setting txOut=TRUE, giving the original transcript level estimates as a list of matrices.

txi.tx <- tximport(files, type="salmon", txOut=TRUE)

These matrices can then be summarized afterwards using the function summarizeToGene. This then gives the identical list of matrices as using txOut=FALSE (default) in the first tximport call.

txi.sum <- summarizeToGene(txi.tx, tx2gene)
all.equal(txi$counts, txi.sum$counts)

Salmon

Salmon or Sailfish quant.sf files can be imported by setting type to "salmon" or "sailfish".

files <- file.path(dir,"salmon", samples$run, "quant.sf.gz")
names(files) <- paste0("sample",1:6)
txi.salmon <- tximport(files, type="salmon", tx2gene=tx2gene)
head(txi.salmon$counts)

We quantified with Sailfish against a different transcriptome, so we need to read in a different tx2gene for this next code chunk.

tx2knownGene <- read_csv(file.path(dir, "tx2gene.csv"))
files <- file.path(dir,"sailfish", samples$run, "quant.sf")
names(files) <- paste0("sample",1:6)
txi.sailfish <- tximport(files, type="sailfish", tx2gene=tx2knownGene)
head(txi.sailfish$counts)

Note: for previous version of Salmon or Sailfish, in which the quant.sf files start with comment lines, it is recommended to specify the importer argument as a function which reads in the lines beginning with the header. For example, using the following code chunk (un-evaluated):

txi <- tximport("quant.sf", type="none", txOut=TRUE,
                txIdCol="Name", abundanceCol="TPM",
                countsCol="NumReads", lengthCol="Length",
                importer=function(x) read_tsv(x, skip=8))

Salmon with inferential replicates

If inferential replicates (Gibbs or bootstrap samples) are present in expected locations relative to the quant.sf file, tximport will import these as well. Here we demonstrate using Salmon, run with only 5 Gibbs replicates (usually more Gibbs samples, such as 20 or 30, would be useful for capturing inferential uncertainty).

files <- file.path(dir,"salmon_gibbs", samples$run, "quant.sf.gz")
names(files) <- paste0("sample",1:6)
txi.inf.rep <- tximport(files, type="salmon", txOut=TRUE)
names(txi.inf.rep)
names(txi.inf.rep$infReps)
dim(txi.inf.rep$infReps$sample1)

The tximport arguments varReduce and dropInfReps can be used to summarize the inferential replicates into a single variance per transcript/gene and per sample, or to not import inferential replicates, respectively.

kallisto

kallisto abundance.h5 files can be imported by setting type to "kallisto". Note that this requires that you have the Bioconductor package rhdf5 installed. (Here we only demonstrate reading in transcript-level information.)

files <- file.path(dir, "kallisto_boot", samples$run, "abundance.h5")
names(files) <- paste0("sample",1:6)
txi.kallisto <- tximport(files, type="kallisto", txOut=TRUE)
head(txi.kallisto$counts)

kallisto with inferential replicates

Because the kallisto_boot directory also has inferential replicate information, it was imported as well.

names(txi.kallisto)
names(txi.kallisto$infReps)
dim(txi.kallisto$infReps$sample1)

kallisto with TSV files

kallisto abundance.tsv files can be imported as well, but this is typically slower than the approach above. Note that we add an additional argument in this code chunk, ignoreAfterBar=TRUE. This is because the Gencode transcripts have names like "ENST00000456328.2|ENSG00000223972.5|...", though our tx2gene table only includes the first "ENST" identifier. We therefore want to split the incoming quantification matrix rownames at the first bar "|", and only use this as an identifier. We didn't use this option earlier with Salmon, because we used the argument --gencode when running Salmon, which itself does the splitting upstream of tximport. Note that ignoreTxVersion and ignoreAfterBar are only to facilitating the summarization to gene level.

files <- file.path(dir, "kallisto", samples$run, "abundance.tsv.gz")
names(files) <- paste0("sample",1:6)
txi.kallisto.tsv <- tximport(files, type="kallisto", tx2gene=tx2gene,
                             ignoreAfterBar=TRUE)
head(txi.kallisto.tsv$counts)

RSEM

RSEM sample.genes.results files can be imported by setting type to "rsem", and txIn and txOut to FALSE.

files <- file.path(dir,"rsem", samples$run, paste0(samples$run, ".genes.results.gz"))
names(files) <- paste0("sample",1:6)
txi.rsem <- tximport(files, type="rsem", txIn=FALSE, txOut=FALSE)
head(txi.rsem$counts)

RSEM sample.isoforms.results files can be imported by setting type to "rsem", and txIn and txOut to TRUE.

files <- file.path(dir,"rsem", samples$run, paste0(samples$run, ".isoforms.results.gz"))
names(files) <- paste0("sample",1:6)
txi.rsem <- tximport(files, type="rsem", txIn=TRUE, txOut=TRUE)
head(txi.rsem$counts)

StringTie

StringTie t_data.ctab files giving the coverage and abundances for transcripts can be imported by setting type to stringtie. These files can be generated with the following command line call:

stringtie -eB -G transcripts.gff <source_file.bam> 

tximport will compute counts from the coverage information, by reversing the formula that StringTie uses to calculate coverage (see ?tximport). The read length is used in this formula, and so if you've set a different read length when using StringTie, you can provide this information with the readLength argument. The tx2gene table should connect transcripts to genes, and can be pulled out of one of the t_data.ctab files. The tximport call would look like the following (here not evaluated):

tmp <- read_tsv(files[1])
tx2gene <- tmp[,c("t_name","gene_name")]
txi <- tximport(files, type="stringtie", tx2gene=tx2gene)

Alevin

scRNA-seq data quantified with Alevin can be easily imported using tximport. The following unevaluated example shows import of the quants matrix (for a live example, see the unit test file test_alevin.R). A single file should be specified which will import a gene-by-cell matrix of data.

files <- "path/to/alevin/quants_mat.gz"
txi <- tximport(files, type="alevin")

Downstream DGE in Bioconductor

Note: there are two suggested ways of importing estimates for use with differential gene expression (DGE) methods. The first method, which we show below for edgeR and for DESeq2, is to use the gene-level estimated counts from the quantification tools, and additionally to use the transcript-level abundance estimates to calculate a gene-level offset that corrects for changes to the average transcript length across samples. The code examples below accomplish these steps for you, keeping track of appropriate matrices and calculating these offsets. For edgeR you need to assign a matrix to y$offset, but the function DESeqDataSetFromTximport takes care of creation of the offset for you. Let's call this method "original counts and offset".

The second method is to use the tximport argument countsFromAbundance="lengthScaledTPM" or "scaledTPM", and then to use the gene-level count matrix txi$counts directly as you would a regular count matrix with these software. Let's call this method "bias corrected counts without an offset"

Note: Do not manually pass the original gene-level counts to downstream methods without an offset. The only case where this would make sense is if there is no length bias to the counts, as happens in 3' tagged RNA-seq data (see section below). The original gene-level counts are in txi$counts when tximport was run with countsFromAbundance="no". This is simply passing the summed estimated transcript counts, and does not correct for potential differential isoform usage (the offset), which is the point of the tximport methods [@Soneson2015] for gene-level analysis. Passing uncorrected gene-level counts without an offset is not recommended by the tximport package authors. The two methods we provide here are: "original counts and offset" or "bias corrected counts without an offset". Passing txi to DESeqDataSetFromTximport as outlined below is correct: the function creates the appropriate offset for you to perform gene-level differential expression.

3' tagged RNA-seq

If you have 3' tagged RNA-seq data, then correcting the counts for gene length will induce a bias in your analysis, because the counts do not have length bias. Instead of using the default full-transcript-length pipeline, we recommend to use the original counts, e.g. txi$counts as a counts matrix, e.g. providing to DESeqDataSetFromMatrix or to the edgeR or limma functions without calculating an offset and without using countsFromAbundance.

edgeR

An example of creating a DGEList for use with edgeR [@Robinson2010]:

library(edgeR)
library(csaw)
cts <- txi$counts
normMat <- txi$length

# Obtaining per-observation scaling factors for length,
# adjusted to avoid changing the magnitude of the counts.
normMat <- normMat / exp(rowMeans(log(normMat)))
normCts <- cts / normMat

# Computing effective library sizes from scaled counts,
# to account for composition biases between samples.
library(edgeR)
eff.lib <- calcNormFactors(normCts) * colSums(normCts)

# Combining effective library sizes with the length factors,
# and calculating offsets for a log-link GLM.
normMat <- sweep(normMat, 2, eff.lib, "*")
normMat <- log(normMat)

# Creating a DGEList object for use in edgeR.
y <- DGEList(cts)
y <- scaleOffset(y, normMat)
# filtering
keep <- filterByExpr(y)
y <- y[keep,]
# y is now ready for estimate dispersion functions
# see edgeR User's Guide

For creating a matrix of CPMs within edgeR, the following code chunk can be used:

se <- SummarizedExperiment(assays=list(counts=y$counts, offset=y$offset))
se$totals <- y$samples$lib.size
library(csaw)
cpms <- calculateCPM(se, use.offsets=TRUE, log=FALSE)

DESeq2

An example of creating a DESeqDataSet for use with DESeq2 [@Love2014]:

library(DESeq2)

The user should make sure the rownames of sampleTable align with the colnames of txi$counts, if there are colnames. The best practice is to read sampleTable from a CSV file, and to construct files from a column of sampleTable, as was shown in the tximport examples above.

sampleTable <- data.frame(condition=factor(rep(c("A","B"),each=3)))
rownames(sampleTable) <- colnames(txi$counts)
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~ condition)
# dds is now ready for DESeq()
# see DESeq2 vignette

limma-voom

An example of creating a data object for use with limma-voom [@Law2014]. Because limma-voom does not use the offset matrix stored in y$offset, we recommend using the scaled counts generated from abundances, either "scaledTPM" or "lengthScaledTPM":

files <- file.path(dir,"salmon", samples$run, "quant.sf.gz")
names(files) <- paste0("sample",1:6)
txi <- tximport(files, type="salmon",
                tx2gene=tx2gene,
                countsFromAbundance="lengthScaledTPM")
library(limma)
y <- DGEList(txi$counts)
# filtering
keep <- filterByExpr(y)
y <- y[keep,]
y <- calcNormFactors(y)
design <- model.matrix(~ condition, data=sampleTable)
v <- voom(y, design)
# v is now ready for lmFit()
# see limma User's Guide

Acknowledgments

The development of tximport has benefited from contributions and suggestions from:

Rob Patro (inferential replicates import), Andrew Parker Morgan (RHDF5 support), Ryan C. Thompson (RHDF5 support), Matt Shirley (ignoreTxVersion), Avi Srivastava (alevin import), Scott Van Buren (infReps testing), Stephen Turner, Richard Smith-Unna, Rory Kirchner, Martin Morgan, Jenny Drnevich, Patrick Kimes, Leon Fodoulian, Koen Van den Berge, Aaron Lun,

Session info

sessionInfo()

References



Try the tximport package in your browser

Any scripts or data that you put into this service are public.

tximport documentation built on Nov. 8, 2020, 8:20 p.m.