tr2g_EnsDb: Get transcript and gene info from EnsDb objects

View source: R/tr2g.R

tr2g_EnsDbR Documentation

Get transcript and gene info from EnsDb objects


Bioconductor provides Ensembl genome annotation in AnnotationHub; older versions of Ensembl annotation can be obtained from packages like EnsDb.Hsapiens.v86. This is an alternative to querying Ensembl with biomart; Ensembl's server seems to be less stable than that of Bioconductor. However, more information and species are available on Ensembl biomart than on AnnotationHub.


  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  other_attrs = NULL,
  use_gene_name = TRUE,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_col = "TXBIOTYPE",
  gene_biotype_col = "GENEBIOTYPE",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  compress_fa = FALSE,
  overwrite = FALSE



Ann EnsDb object, such as from AnnotationHub or EnsDb.Hsapiens.v86.


Either a BSgenome or a XStringSet object of genomic sequences, where the intronic sequences will be extracted from. Use genomeStyles to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.


Logical, whether to extract transcriptome from genome with the GTF file. If filtering biotypes or chromosomes, the filtered GRanges will be used to extract transcriptome.


Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.


Logical, whether to write tr2g to disk. If TRUE, then a file tr2g.tsv will be written into out_path.


Character vector. Other attributes to get from the EnsDb object, such as gene symbol and position on the genome. Use columns to see which attributes are available.


Logical, whether to get gene names.


Logical, whether to include version number in the Ensembl transcript ID. To decide whether to include transcript version number, check whether version numbers are included in the transcripts.txt in the kallisto output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.


Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number.


Character vector of length 1. Tag in attribute field corresponding to transcript biotype.


Character vector of length 1. Tag in attribute field corresponding to gene biotype.


Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, retained_intron is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See data("ensembl_tx_biotypes") for all available transcript biotypes from Ensembl.


Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See data("cellranger_biotypes") for gene biotypes the Cell Ranger reference uses. See data("ensembl_gene_biotypes") for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.


Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to TRUE. Only applicable to species found in genomeStyles().


Logical, whether to compress the output fasta file. If TRUE, then the fasta file will be gzipped.


Logical, whether to overwrite if files with names of outputs written to disk already exist.


A data frame with at least 2 columns: gene for gene ID, transcript for transcript ID, and optionally gene_name for gene names. If other_attrs has been specified, then those will also be columns in the data frame returned.

See Also

ensembl_gene_biotypes ensembl_tx_biotypes cellranger_biotypes

Other functions to retrieve transcript and gene info: sort_tr2g(), tr2g_TxDb(), tr2g_ensembl(), tr2g_fasta(), tr2g_gff3(), tr2g_gtf(), transcript2gene()


tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE, write_tr2g = FALSE,
 use_transcript_version = FALSE,
 use_gene_version = FALSE)

lambdamoses/BUStoolsR documentation built on Aug. 28, 2022, 1:35 p.m.