tr2g_TxDb: Get transcript and gene info from TxDb objects
In BUSpaRse: kallisto | bustools R utilities

Description Usage Arguments Value See Also Examples

The genome and gene annotations of some species can be conveniently obtained from Bioconductor packages. This is more convenient than downloading GTF files from Ensembl and reading it into R. In these packages, the gene annotation is stored in a TxDb object, which has standardized names for gene IDs, transcript IDs, exon IDs, and so on, which are stored in the metadata fields in GTF and GFF3 files, which are not standardized. This function extracts transcript and corresponding gene information from gene annotation stored in a TxDb object.

tr2g_TxDb(
  txdb,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  chrs_only = TRUE,
  compress_fa = FALSE,
  overwrite = FALSE
)

`txdb`	A `TxDb` object with gene annotation.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`get_transcriptome`	Logical, whether to extract transcriptome from genome with the GTF file. If filtering biotypes or chromosomes, the filtered `GRanges` will be used to extract transcriptome.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`write_tr2g`	Logical, whether to write tr2g to disk. If `TRUE`, then a file `tr2g.tsv` will be written into `out_path`.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`overwrite`	Logical, whether to overwrite if files with names of outputs written to disk already exist.

A data frame with 3 columns: gene for gene ID, transcript for transcript ID, and tx_id for internal transcript IDs used to avoid duplicate transcript names. For TxDb packages from Bioconductor, gene ID is Entrez ID, while transcript IDs are Ensembl IDs with version numbers for TxDb.Hsapiens.UCSC.hg38.knownGene. In some cases, the transcript ID have duplicates, and this is resolved by adding numbers to make the IDs unique.

A data frame with 3 columns: gene for gene ID, transcript for transcript ID, and gene_name for gene names. If other_attrs has been specified, then those will also be columns in the data frame returned.

Other functions to retrieve transcript and gene info: sort_tr2g(), tr2g_EnsDb(), tr2g_ensembl(), tr2g_fasta(), tr2g_gff3(), tr2g_gtf(), transcript2gene()

1
2
3

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)
tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, BSgenome.Hsapiens.UCSC.hg38)