tr2g_TxDb: Get transcript and gene info from TxDb objects

View source: R/tr2g.R

tr2g_TxDbR Documentation

Get transcript and gene info from TxDb objects

Description

The genome and gene annotations of some species can be conveniently obtained from Bioconductor packages. This is more convenient than downloading GTF files from Ensembl and reading it into R. In these packages, the gene annotation is stored in a TxDb object, which has standardized names for gene IDs, transcript IDs, exon IDs, and so on, which are stored in the metadata fields in GTF and GFF3 files, which are not standardized. This function extracts transcript and corresponding gene information from gene annotation stored in a TxDb object.

Usage

tr2g_TxDb(
  txdb,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  chrs_only = TRUE,
  compress_fa = FALSE,
  overwrite = FALSE
)

Arguments

txdb

A TxDb object with gene annotation.

Genome

Either a BSgenome or a XStringSet object of genomic sequences, where the intronic sequences will be extracted from. Use genomeStyles to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.

get_transcriptome

Logical, whether to extract transcriptome from genome with the GTF file. If filtering biotypes or chromosomes, the filtered GRanges will be used to extract transcriptome.

out_path

Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.

write_tr2g

Logical, whether to write tr2g to disk. If TRUE, then a file tr2g.tsv will be written into out_path.

chrs_only

Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to TRUE. Only applicable to species found in genomeStyles().

compress_fa

Logical, whether to compress the output fasta file. If TRUE, then the fasta file will be gzipped.

overwrite

Logical, whether to overwrite if files with names of outputs written to disk already exist.

Value

A data frame with 3 columns: gene for gene ID, transcript for transcript ID, and tx_id for internal transcript IDs used to avoid duplicate transcript names. For TxDb packages from Bioconductor, gene ID is Entrez ID, while transcript IDs are Ensembl IDs with version numbers for TxDb.Hsapiens.UCSC.hg38.knownGene. In some cases, the transcript ID have duplicates, and this is resolved by adding numbers to make the IDs unique.

A data frame with 3 columns: gene for gene ID, transcript for transcript ID, and gene_name for gene names. If other_attrs has been specified, then those will also be columns in the data frame returned.

See Also

Other functions to retrieve transcript and gene info: sort_tr2g(), tr2g_EnsDb(), tr2g_ensembl(), tr2g_fasta(), tr2g_gff3(), tr2g_gtf(), transcript2gene()

Other functions to retrieve transcript and gene info: sort_tr2g(), tr2g_EnsDb(), tr2g_ensembl(), tr2g_fasta(), tr2g_gff3(), tr2g_gtf(), transcript2gene()

Examples

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(BSgenome.Hsapiens.UCSC.hg38)
tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, BSgenome.Hsapiens.UCSC.hg38)
# Clean up
file.remove("transcriptome.fa", "tr2g.tsv")

BUStools/BUSpaRse documentation built on March 3, 2024, 9:11 a.m.