tr2g_gff3: Get transcript and gene info from GFF3 file
In lambdamoses/BUStoolsR: kallisto | bustools R utilities

tr2g_gff3

R Documentation

Get transcript and gene info from GFF3 file

Description

This function reads a GFF3 file and extracts the transcript ID and corresponding gene ID. This function assumes that the GFF3 file is properly formatted. See http://gmod.org/wiki/GFF3 for a detailed description of proper GFF3 format. Note that GTF files have a somewhat different and simpler format in the attribute field, which this function does not support. See http://mblab.wustl.edu/GTF2.html for a detailed description of proper GTF format. To extract transcript and gene information from GTF files, see the function tr2g_gtf in this package. Some files bearing the .gff3 are in fact more like the GTF format. If this is so, then change the extension to .gtf and use the function tr2g_gtf in this package instead.

Usage

tr2g_gff3(
  file,
  Genome = NULL,
  get_transcriptome = TRUE,
  out_path = ".",
  write_tr2g = TRUE,
  transcript_id = "transcript_id",
  gene_id = "gene_id",
  gene_name = "Name",
  transcript_version = "version",
  gene_version = "version",
  version_sep = ".",
  transcript_biotype_col = "biotype",
  gene_biotype_col = "biotype",
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  compress_fa = FALSE,
  save_filtered_gff = TRUE,
  overwrite = FALSE,
  source = c("ensembl", "refseq")
)

Arguments

`file`	Path to a GTF file to be read. The file can remain gzipped. Use `getGTF` from the `biomartr` package to download GTF files from Ensembl, and use `getGFF` from `biomartr` to download GFF3 files from Ensembl and RefSeq.
`Genome`	Either a `BSgenome` or a `XStringSet` object of genomic sequences, where the intronic sequences will be extracted from. Use `genomeStyles` to check which styles are supported for your organism of interest; supported styles can be interconverted. If the style in your genome or annotation is not supported, then the style of chromosome names in the genome and annotation should be manually set to be consistent.
`get_transcriptome`	Logical, whether to extract transcriptome from genome with the GTF file. If filtering biotypes or chromosomes, the filtered `GRanges` will be used to extract transcriptome.
`out_path`	Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.
`write_tr2g`	Logical, whether to write tr2g to disk. If `TRUE`, then a file `tr2g.tsv` will be written into `out_path`.
`transcript_id`	Character vector of length 1. Tag in `attribute` field corresponding to transcript IDs. This argument must be supplied and cannot be `NA` or `NULL`. Will throw error if tag indicated in this argument does not exist.
`gene_id`	Character vector of length 1. Tag in `attribute` field corresponding to gene IDs. This argument must be supplied and cannot be `NA` or `NULL`. Note that this is different from gene symbols, which do not have to be unique. This can be Ensembl or Entrez IDs. However, if the gene symbols are in fact unique for each gene, you may supply the tag for human readable gene symbols to this argument. Will throw error if tag indicated in this argument does not exist. This is typically "gene_id" for annotations from Ensembl and "gene" for refseq.
`gene_name`	Character vector of length 1. Tag in `attribute` field corresponding to gene symbols. This argument can be `NA` or `NULL` if you are fine with non-human readable gene IDs and do not wish to extract human readable gene symbols.
`transcript_version`	Character vector of length 1. Tag in `attribute` field corresponding to transcript version number. If your GTF file does not include transcript version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. To decide whether to include transcript version number, check whether version numbers are included in the `transcripts.txt` in the `kallisto` output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.
`gene_version`	Character vector of length 1. Tag in `attribute` field corresponding to gene version number. If your GTF file does not include gene version numbers, or if you do not wish to include the version number, then use `NULL` for this argument. Unlike transcript version number, it's up to you whether to include gene version number.
`version_sep`	Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl.
`transcript_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to transcript biotype.
`gene_biotype_col`	Character vector of length 1. Tag in `attribute` field corresponding to gene biotype.
`transcript_biotype_use`	Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, `retained_intron` is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See `data("ensembl_tx_biotypes")` for all available transcript biotypes from Ensembl.
`gene_biotype_use`	Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See `data("cellranger_biotypes")` for gene biotypes the Cell Ranger reference uses. See `data("ensembl_gene_biotypes")` for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.
`chrs_only`	Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to `TRUE`. Only applicable to species found in `genomeStyles()`.
`compress_fa`	Logical, whether to compress the output fasta file. If `TRUE`, then the fasta file will be gzipped.
`save_filtered_gff`	Logical. If filtering type, biotypes, and/or chromosomes, whether to save the filtered `GRanges` as a GFF3 file.
`overwrite`	Logical, whether to overwrite if files with names of outputs written to disk already exist.
`source`	Name of the database where this GFF3 file was downloaded. Must be either "ensembl" or "refseq".

Details

Transcript and gene versions may not be present in all GTF files, so these arguments are optional. This function has arguments for transcript and gene version numbers because Ensembl IDs have version numbers. For Ensembl IDs, we recommend including the version number, since a change in version number signals a change in the entity referred to by the ID after reannotation. If a version is used, then it will be appended to the ID, separated by version_sep.

The transcript and gene IDs are The attribute field (the last field) of GTF files can be complicated and inconsistent across different sources. Please check the attribute tags in your GTF file and consider the arguments of this function carefully. The defaults are set according to Ensembl GTF files; defaults may not work for files from other sources. Due to the general lack of standards for the attribute field, you may need to further clean up the output of this function.

Value

A data frame at least 2 columns: gene for gene ID, transcript for transcript ID, and optionally, gene_name for gene names.

Note

The defaults here are for Ensembl GFF3 files. To see all attribute tags for Ensembl and RefSeq GFF3 files, see data("ensembl_gff_mcols") and data("refseq_gff_mcols").

Examples

toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gff3_test.gff3", sep = "/")
# Default
tr2g <- tr2g_gff3(file = file_use, write_tr2g = FALSE, 
get_transcriptome = FALSE, save_filtered_gff = FALSE)
# Excluding version numbers
tr2g <- tr2g_gff3(file = file_use, transcript_version = NULL,
  gene_version = NULL, write_tr2g = FALSE, get_transcriptome = FALSE,
  save_filtered_gff = FALSE)

lambdamoses/BUStoolsR documentation built on Aug. 1, 2024, 6:30 a.m.