tr2g_fasta | R Documentation |
FASTA files, such as those for cDNA and ncRNA from Ensembl, might have genome annotation information in the name of each sequence entry. This function extracts the transcript and gene IDs from such FASTA files.
tr2g_fasta(
file,
out_path = ".",
write_tr2g = TRUE,
use_gene_name = TRUE,
use_transcript_version = TRUE,
use_gene_version = TRUE,
transcript_biotype_use = "all",
gene_biotype_use = "all",
chrs_only = TRUE,
save_filtered = TRUE,
compress_fa = FALSE,
overwrite = FALSE
)
file |
Path to the FASTA file to be read. The file can remain gzipped. |
out_path |
Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory. |
write_tr2g |
Logical, whether to write tr2g to disk. If |
use_gene_name |
Logical, whether to get gene names. |
use_transcript_version |
Logical, whether to include version number in
the Ensembl transcript ID. To decide whether to
include transcript version number, check whether version numbers are included
in the |
use_gene_version |
Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number. |
transcript_biotype_use |
Character, can be "all" or
a vector of transcript biotypes to be used. Transcript biotypes aren't
entirely the same as gene biotypes. For instance, in Ensembl annotation,
|
gene_biotype_use |
Character, can be "all", "cellranger", or
a vector of gene biotypes to be used. If "cellranger", then the biotypes
used by Cell Ranger's reference are used. See |
chrs_only |
Logical, whether to include chromosomes only, for GTF and
GFF files can contain annotations for scaffolds, which are not incorporated
into chromosomes. This will also exclude haplotypes. Defaults to |
save_filtered |
If filtering for biotype and chromosomes, whether to
save the filtered fasta file. If |
compress_fa |
Logical, whether to compress the output fasta file. If
|
overwrite |
Logical, whether to overwrite if files with names of outputs written to disk already exist. |
At present, this function only works with FASTA files from Ensembl, and uses regex to extract vertebrate Ensembl IDs. Sequence names should be formatted as follows:
ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1 gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
If your FASTA file sequence names are formatted differently, then you must
extract the transcript and gene IDs by some other means. The Bioconductor
package Biostrings
is recommended; after reading the FASTA file into
R, the sequence names can be accessed by the names
function.
While normally, you should call sort_tr2g
to sort the
transcript IDs from the output of the tr2g_*
family of functions, If
the FASTA file supplied here is the same as the one used to build the
kallisto index, then the transcript IDs in the output of this function are in
the same order as in the kallisto index, so you can skip sort_tr2g
and proceed directly to EC2gene
with the output of this
function.
A data frame with at least 2 columns: gene
for gene ID,
transcript
for transcript ID, and optionally gene_name
for gene
names.
ensembl_gene_biotypes ensembl_tx_biotypes cellranger_biotypes
Other functions to retrieve transcript and gene info:
sort_tr2g()
,
tr2g_EnsDb()
,
tr2g_TxDb()
,
tr2g_ensembl()
,
tr2g_gff3()
,
tr2g_gtf()
,
transcript2gene()
toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "fasta_test.fasta", sep = "/")
tr2g <- tr2g_fasta(file = file_use, save_filtered = FALSE, write_tr2g = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.