knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    txcutr = citation("txcutr")[1]
)

Basics

Motivation

Some RNA-sequencing library preparations generate sequencing reads from specific locations in transcripts. For example, 3'-end tag methods, which include methods like Lexogen's Quant-Seq or 10X's Chromium Single Cell 3', generate reads from the 3' ends of transcripts. For applications interested in quantifying alternative cleavage site usage, using a truncated form of transcriptome annotation can help simplify downstream analysis and, in some pipelines, provide for more accurate estimates. This package implements methods to simplify the generation of such truncated annotations.

Install txcutr

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. r Biocpkg("txcutr") is a R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install r Biocpkg("txcutr") by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("txcutr")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

Required Background

r Biocpkg("txcutr") is based on many other packages and in particular on those that have implemented the infrastructure needed for dealing with transcriptome annotations. That is, packages like r Biocpkg("GenomicRanges") and r Biocpkg("GenomicFeatures"). While we anticipate most use cases for txcutr to not require manipulating GRanges or TxDb objects directly, it would be beneficial to be familiar with the vignettes provided in these packages.

Asking for Help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. We would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the txcutr tag and check any previous posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

Citing txcutr

We hope that r Biocpkg("txcutr") will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation("txcutr")

Quick Start to Using txcutr

Starting from TxDb Objects

Many transcriptome annotations, especially for model organisms, are already directly available as TxDb annotation packages. One can browse a list of available TxDb objects on the Bioconductor website.

Example TxDb

For demonstration purposes, we will work with the SGD genes for the yeast Saccharomyces cerevisiae, and restrict the processing to chrI.

library(GenomicFeatures)
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene

## constrain to "chrI"
seqlevels(txdb) <- "chrI"

## view the lengths of first ten transcripts
transcriptLengths(txdb)[1:10,]

As seen from the above output, the transcript lengths are variable, but in a 3'-end tag-based RNA-sequencing library, we expect reads to only come from a narrow region at the 3' end. We can use the methods in the r Biocpkg("txcutr") package to restrict the annotation to that region for each transcript.

Transcriptome Truncation

Let's use the truncateTxome method to restrict transcripts to their last 500 nucleotides (nts).

library(txcutr)

txdb_w500 <- truncateTxome(txdb, maxTxLength=500)

transcriptLengths(txdb_w500)[1:10,]

Observe how the all transcripts that were previously longer than 500 nts are now exactly 500 nts. Also, any transcripts that were already short enough remain unchanged.

Exporting a New Annotation

We can now export this new transcriptome as a GTF file that could be used in an alignment pipeline or genome viewer. Note that ending the file name with .gz will automatically signal to that the file should be exported with gzip compression.

gtf_file <- tempfile("sacCer3_chrI.sgd.txcutr_w500", fileext=".gtf.gz")
exportGTF(txdb_w500, file=gtf_file)

Exporting Sequences

If we need to prepare an index for alignment or pseudo-alignment, this requires exporting the corresponding sequences of the truncated transcripts. To do this, will need to load a BSgenome object for sacCer3.

library(BSgenome.Scerevisiae.UCSC.sacCer3)
sacCer3 <- BSgenome.Scerevisiae.UCSC.sacCer3

fasta_file <- tempfile("sacCer3_chrI.sgd.txcutr_w500", fileext=".fa.gz")
exportFASTA(txdb_w500, genome=sacCer3, file=fasta_file)

Merge Table

Another file that might be needed by some quantification tools is a merge table. That is, a file that represents a map from the input transcripts to either output transcripts or genes. In some pipelines, one may wish to merge any transcripts that have any overlap. In others, there might be some minimum amount of distance between 3' ends below which it may be uninteresting to discern between. For this, txcutr provides a generateMergeTable method, together with a minDistance argument.

To create a merge for any overlaps whatsoever, one would set the minDistance to be identical to the maximum transcript length.

df_merge <- generateMergeTable(txdb_w500, minDistance=500)

head(df_merge, 10)

In this particular case, these first transcripts each map to themselves. This should not be a surprise, since very few genes in the input annotation had alternative transcripts. However, we can filter to highlight any transcripts that would get merged.

df_merge[df_merge$tx_in != df_merge$tx_out,]

Algorithmically, txcutr will give preference to more distal transcripts in the merge assignment. For example, looking at transcripts from gene YAL026C:

transcripts(txdb_w500, columns=c("tx_name", "gene_id"),
            filter=list(gene_id=c("YAL026C")))

we see that they are on the negative strand and YAL026C-A is more distal in this orientation. This is consistent with YAL026C being merged into YAL026C-A if we are merging all overlapping transcripts.

Note that one can also directly export the merge table using the exportMergeTable function.

Notes on Usage

Making TxDb Objects

The r Biocpkg("GenomicFeatures") package provides a number of means to create custom TxDb objects that can then be used by txcutr. Alternative workflows might include starting from a GTF/GFF3 transcriptome annotation and using the GenomicRanges::makeTxDbFromGFF method, or using the GenomicRanges::makeTxDbFromBiomart method to create the object from Biomart resources.

BiocParallel

The truncation step can be computationally intensive, especially for species whose annotations include many alternative transcript isoforms, such as mouse or human. The truncateTxome method makes use of BiocParallel::bplapply when applicable, and will respect the BiocParallel::register() setting, unless an explicit BPPARAM argument is supplied.

We encourage use of a parallel parameter option, such as MulticoreParam, but it should be noted that with mammalian transcriptomes this results in a maximum RAM usage between 3-4GB per core. Please ensure adequate memory per core is available.

Alternative Cleavage and Polyadenylation

This tool developed out of a pipeline for quantifying 3' UTR isoform expression from scRNA-seq. In such an application, it can be helpful to pre-filter transcripts that may be included in an annotation, but do not have validated 3' ends. Of particular note are the GENCODE annotations: we recommend pre-filtering GENCODE GTF/GFF files to remove any entries with the tag mRNA_end_NF.

Reproducibility

The r Biocpkg("txcutr") package r Citep(bib[["txcutr"]]) was made possible thanks to:

This package was developed using r BiocStyle::Biocpkg("biocthis").

Code for creating the vignette

## Create the vignette
library("rmarkdown")
system.time(render("intro.Rmd", "BiocStyle::html_document"))

## Extract the R code
library("knitr")
knit("intro.Rmd", tangle = TRUE)

Date the vignette was generated.

## Date the vignette was generated
Sys.time()

Wallclock time spent generating the vignette.

## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)

R session information.

## Session info
library("sessioninfo")
options(width = 120)
session_info()

Bibliography

This vignette was generated using r Biocpkg("BiocStyle") r Citep(bib[["BiocStyle"]]) with r CRANpkg("knitr") r Citep(bib[["knitr"]]) and r CRANpkg("rmarkdown") r Citep(bib[["rmarkdown"]]) running behind the scenes.

Citations made with r CRANpkg("RefManageR") r Citep(bib[["RefManageR"]]).

## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))


mfansler/txcutr documentation built on Dec. 21, 2021, 4:59 p.m.