tr2g_gtf: Get transcript and gene info from GTF file

Description Usage Arguments Details Value See Also Examples

View source: R/tr2g.R

Description

This function reads a GTF file and extracts the transcript ID and corresponding gene ID. This function assumes that the GTF file is properly formatted. See http://mblab.wustl.edu/GTF2.html for a detailed description of proper GTF format. Note that GFF3 files have a somewhat different and more complicated format in the attribute field, which this function does not support. See http://gmod.org/wiki/GFF3 for a detailed description of proper GFF3 format. To extract transcript and gene information from GFF3 files, see the function tr2g_gff3 in this package.

Usage

1
2
3
4
tr2g_gtf(file, type_use = "exon", transcript_id = "transcript_id",
  gene_id = "gene_id", gene_name = "gene_name",
  transcript_version = "transcript_version",
  gene_version = "gene_version", version_sep = ".", verbose = TRUE)

Arguments

file

Path to a GTF file to be read. The file can remain gzipped.

type_use

Character vector, the values taken by the type field in the GTF file that denote the desired transcripts. This can be "exon", "transcript", "mRNA", and etc.

transcript_id

Character vector of length 1. Tag in attribute field corresponding to transcript IDs. This argument must be supplied and cannot be NA or NULL. Will throw error if tag indicated in this argument does not exist.

gene_id

Character vector of length 1. Tag in attribute field corresponding to gene IDs. This argument must be supplied and cannot be NA or NULL. Note that this is different from gene symbols, which do not have to be unique. This can be Ensembl or Entrez IDs. However, if the gene symbols are in fact unique for each gene, you may supply the tag for human readable gene symbols to this argument. Will throw error if tag indicated in this argument does not exist.

gene_name

Character vector of length 1. Tag in attribute field corresponding to gene symbols. This argument can be NA or NULL if you are fine with non-human readable gene IDs and do not wish to extract human readable gene symbols.

transcript_version

Character vector of length 1. Tag in attribute field corresponding to transcript version number. If your GTF file does not include transcript version numbers, or if you do not wish to include the version number, then use NULL for this argument. To decide whether to include transcript version number, check whether version numbers are included in the transcripts.txt in the kallisto output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.

gene_version

Character vector of length 1. Tag in attribute field corresponding to gene version number. If your GTF file does not include gene version numbers, or if you do not wish to include the version number, then use NULL for this argument. Unlike transcript version number, it's up to you whether to include gene version number.

version_sep

Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl.

verbose

Whether to display progress.

Details

Transcript and gene versions may not be present in all GTF files, so these arguments are optional. This function has arguments for transcript and gene version numbers because Ensembl IDs have version numbers. For Ensembl IDs, we recommend including the version number, since a change in version number signals a change in the entity referred to by the ID after reannotation. If a version is used, then it will be appended to the ID, separated by version_sep.

The transcript and gene IDs are The attribute field (the last field) of GTF files can be complicated and inconsistent across different sources. Please check the attribute tags in your GTF file and consider the arguments of this function carefully. The defaults are set according to Ensembl GTF files; defaults may not work for files from other sources. Due to the general lack of standards for the attribute field, you may need to further clean up the output of this function.

Value

A data frame at least 2 columns: gene for gene ID, transcript for transcript ID, and optionally, gene_name for gene names.

See Also

Other functions to retrieve transcript and gene info: sort_tr2g, tr2g_EnsDb, tr2g_TxDb, tr2g_ensembl, tr2g_fasta, tr2g_gff3, transcript2gene

Examples

1
2
3
4
5
6
7
toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gtf_test.gtf", sep = "/")
# Default
tr2g <- tr2g_gtf(file = file_use, verbose = FALSE)
# Excluding version numbers
tr2g <- tr2g_gtf(file = file_use, transcript_version = NULL,
  gene_version = NULL)

sarangian/deaRscripts documentation built on Dec. 12, 2019, 12:48 a.m.