tr2g_gff3 | R Documentation |
This function reads a GFF3 file and extracts the transcript ID and
corresponding gene ID. This function assumes that the GFF3 file is properly
formatted. See http://gmod.org/wiki/GFF3 for a detailed
description of proper GFF3 format. Note that GTF files have a somewhat
different and simpler format in the attribute field, which this function does
not support. See http://mblab.wustl.edu/GTF2.html for a detailed
description of proper GTF format. To extract transcript and gene information
from GTF files, see the function tr2g_gtf
in this package.
Some files bearing the .gff3
are in fact more like the GTF format. If
this is so, then change the extension to .gtf
and use the function
tr2g_gtf
in this package instead.
tr2g_gff3(
file,
Genome = NULL,
get_transcriptome = TRUE,
out_path = ".",
write_tr2g = TRUE,
transcript_id = "transcript_id",
gene_id = "gene_id",
gene_name = "Name",
transcript_version = "version",
gene_version = "version",
version_sep = ".",
transcript_biotype_col = "biotype",
gene_biotype_col = "biotype",
transcript_biotype_use = "all",
gene_biotype_use = "all",
chrs_only = TRUE,
compress_fa = FALSE,
save_filtered_gff = TRUE,
overwrite = FALSE,
source = c("ensembl", "refseq")
)
file |
Path to a GTF file to be read. The file can remain gzipped. Use
|
Genome |
Either a |
get_transcriptome |
Logical, whether to extract transcriptome from
genome with the GTF file. If filtering biotypes or chromosomes, the filtered
|
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
write_tr2g |
Logical, whether to write tr2g to disk. If |
transcript_id |
Character vector of length 1. Tag in |
gene_id |
Character vector of length 1. Tag in |
gene_name |
Character vector of length 1. Tag in |
transcript_version |
Character vector of length 1. Tag in |
gene_version |
Character vector of length 1. Tag in |
version_sep |
Character to separate bewteen the main ID and the version number. Defaults to ".", as in Ensembl. |
transcript_biotype_col |
Character vector of length 1. Tag in
|
gene_biotype_col |
Character vector of length 1. Tag in |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
compress_fa |
Logical, whether to compress the output fasta file. If
|
save_filtered_gff |
Logical. If filtering type, biotypes, and/or
chromosomes, whether to save the filtered |
overwrite |
Logical, whether to overwrite if files with names of outputs written to disk already exist. |
source |
Name of the database where this GFF3 file was downloaded. Must be either "ensembl" or "refseq". |
Transcript and gene versions may not be present in all GTF files, so these
arguments are optional. This function has arguments for transcript and gene
version numbers because Ensembl IDs have version numbers. For Ensembl IDs, we
recommend including the version number, since a change in version number
signals a change in the entity referred to by the ID after reannotation. If a
version is used, then it will be appended to the ID, separated by
version_sep
.
The transcript and gene IDs are The attribute
field (the last
field) of GTF files can be complicated and inconsistent across different
sources. Please check the attribute
tags in your GTF file and consider
the arguments of this function carefully. The defaults are set according to
Ensembl GTF files; defaults may not work for files from other sources. Due to
the general lack of standards for the attribute
field, you may need to
further clean up the output of this function.
A data frame at least 2 columns: gene
for gene ID,
transcript
for transcript ID, and optionally, gene_name
for
gene names.
The defaults here are for Ensembl GFF3 files. To see all attribute
tags for Ensembl and RefSeq GFF3 files, see data("ensembl_gff_mcols")
and
data("refseq_gff_mcols")
.
ensembl_gene_biotypes ensembl_tx_biotypes cellranger_biotypes ensembl_gtf_mcols ensembl_gff_mcols refseq_gff_mcols
Other functions to retrieve transcript and gene info:
sort_tr2g()
,
tr2g_EnsDb()
,
tr2g_TxDb()
,
tr2g_ensembl()
,
tr2g_fasta()
,
tr2g_gtf()
,
transcript2gene()
toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "gff3_test.gff3", sep = "/")
# Default
tr2g <- tr2g_gff3(file = file_use, write_tr2g = FALSE,
get_transcriptome = FALSE, save_filtered_gff = FALSE)
# Excluding version numbers
tr2g <- tr2g_gff3(file = file_use, transcript_version = NULL,
gene_version = NULL, write_tr2g = FALSE, get_transcriptome = FALSE,
save_filtered_gff = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.