knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Originally, this package was written when the
kallisto | bustools concept was still experimental, to test a new and fast way to generate the gene count matrix from fastq files for scRNA-seq. In the past year,
kallisto | bustools has matured. Now there's a wrapper kb-python that can download a prebuilt
kallisto index for human and mice and call
kallisto bus and
bustools to get the gene count matrix. So largely, the old way of calling
kallisto bus and
bustools, and some functionalities of
BUSpaRse, such as getting transcript to gene mapping, are obsolete.
So now the focus of
BUSpaRse has shifted to finer control of the transcripts that go into the transcriptome and more options. Now all
tr2g_* functions (except
tr2g_ensembl) can filter transcripts for gene and transcript biotypes, only keep standard chromosomes (so no scaffolds and haplotypes), and extract the filtered transcripts from the transcriptome. GTF files from Ensembl, Ensembl fasta files, GFF3 files from Ensembl and RefSeq, TxDb, and EnsDb can all be used here.
library(BUSpaRse) library(BSgenome.Hsapiens.UCSC.hg38) library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(EnsDb.Hsapiens.v86)
The transcriptome can be downloaded from a specified version of Ensembl and filtered for biotypes and standard chromosomes, not only for the vertebrate database (www.ensembl.org and its mirrors), but also other Ensembl sites for plants, fungi, protists, and metazoa. The
gene_biotype_use = "cellranger" means that the same gene biotypes Cell Ranger uses for its reference package are used here. By default, only standard chromosomes are kept. The
dl_transcriptome function not only downloads the transcriptome and filters it, it also output the
tr2g.tsv file of all transcripts in the filtered transcriptome, without column names, so can be directly used for
Wonder which biotypes are available? The lists of all gene and transcript biotypes from Ensembl are now provided in this package, and can be queried by
Resources for common invertebrate model organisms such as Drosophila melanogaster and C. elegans are actually available on the vertebrate site (www.ensembl.org).
# For Drosophila dl_transcriptome("Drosophila melanogaster", out_path = "fly", gene_biotype_use = "cellranger", verbose = FALSE) list.files("fly")
The first file is the original fasta file. The second is the
tr2g file without column names. The third is the filtered fasta file.
For C. elegans, from an archived version of Ensembl
dl_transcriptome("Caenorhabditis elegans", out_path = "worm", verbose = FALSE, gene_biotype_use = "cellranger", ensembl_version = 97) list.files("worm")
For Saccharomyces cerevisiae. Note that the versioning of Ensembl for the plants, fungi, and etc. sites, that are actually www.ensemblgenomes.org, is different from that of the vertebrate site.
dl_transcriptome("Saccharomyces cerevisiae", out_path = "yeast", type = "fungus", gene_biotype_use = "cellranger", verbose = FALSE) list.files("yeast")
The transcript to gene data frame can be generated by directly querying Ensembl with biomart. This can query not only the vertebrate database (www.ensembl.org), but also the Ensembl databases for other organisms, such as plants (plants.ensembl.org) and fungi (fungi.ensembl.org). By default, this will use the most recent version of Ensembl, but older versions can also be used. By default, Ensembl transcript ID (with version number), gene ID (with version number), and gene symbol are downloaded, but other attributes available on Ensembl can be downloaded as well. Make sure that the Ensembl version matches the Ensembl version of transcriptome used for kallisto index.
# Specify other attributes tr2g_mm <- tr2g_ensembl("Mus musculus", ensembl_version = 99, other_attrs = "description", gene_biotype_use = "cellranger")
# Plants tr2g_at <- tr2g_ensembl("Arabidopsis thaliana", type = "plant")
We need a FASTA file for the transcriptome used to build kallisto index. Transcriptome FASTA files from Ensembl contains gene annotation in the sequence name of each transcript. Transcript and gene information can be extracted from the sequence name. At present, only Ensembl FASTA files or FASTA files with sequence names formatted like in Ensembl are accepted.
By default, the
tr2g.tsv file and filtered fasta file (if filtering for biotypes and chromosomes) are written to disk, but these can be turned off so only the
tr2g data frame is returned into the R session.
# Subset of a real Ensembl FASTA file toy_fasta <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse") tr2g_fa <- tr2g_fasta(file = toy_fasta, write_tr2g = FALSE, save_filtered = FALSE) head(tr2g_fa)
If you have GTF or GFF3 files for other purposes, these can also be used to generate the transcript to gene file. Now
tr2g_gff3 can extract transcriptome from a genome that is either a
BSgenome or a
# Subset of a reral GTF file from Ensembl toy_gtf <- system.file("testdata/gtf_test.gtf", package = "BUSpaRse") tr2g_tg <- tr2g_gtf(toy_gtf, Genome = BSgenome.Hsapiens.UCSC.hg38, gene_biotype_use = "cellranger", out_path = "gtf") head(tr2g_tg)
A new GTF or GFF3 file after filtering biotypes and chromosomes is also written, and this can be turned off by setting
save_filtered_gtf = FALSE or
save_filtered_gff = FALSE. The transcriptome, with biotypes filtered and only standard chromosomes, is in
compress_fa = TRUE to gzip it.
TxDb is a class for storing transcript annotations from the Bioconductor package
TxDb.Hsapiens.UCSC.hg38.knownGene does not have biotype information or gene symbols.
tr2g_hs <- tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, get_transcriptome = FALSE, write_tr2g = FALSE) head(tr2g_hs)
EnsDb is a class for Ensembl gene annotations, from the Bioconductor package
ensembldb. Ensembl annotations as
EnsDb are available on
AnnotationHub (since version 87), and some older versions are stand alone packages (e.g.
tr2g_hs86 <- tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE, write_tr2g = FALSE, gene_biotype_use = "cellranger", use_gene_version = FALSE, use_transcript_version = FALSE) head(tr2g_hs86)
There used to be sections about
save_tr2g_bustools, but these functions have been superseded by the new version of
tr2g functions and
dl_transcriptome, which sort the transcriptome after extracting it so the
tr2g and the transcriptome are in the same order. The new version of
tr2g functions and
dl_transcriptome also by default writes the
tr2g.tsv without column names with the first column as transcript and the second as gene to disk.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.