tr2g_fasta: Get transcript and gene info from names in FASTA files

View source: R/tr2g.R

tr2g_fastaR Documentation

Get transcript and gene info from names in FASTA files

Description

FASTA files, such as those for cDNA and ncRNA from Ensembl, might have genome annotation information in the name of each sequence entry. This function extracts the transcript and gene IDs from such FASTA files.

Usage

tr2g_fasta(
  file,
  out_path = ".",
  write_tr2g = TRUE,
  use_gene_name = TRUE,
  use_transcript_version = TRUE,
  use_gene_version = TRUE,
  transcript_biotype_use = "all",
  gene_biotype_use = "all",
  chrs_only = TRUE,
  save_filtered = TRUE,
  compress_fa = FALSE,
  overwrite = FALSE
)

Arguments

file

Path to the FASTA file to be read. The file can remain gzipped.

out_path

Directory to save the outputs written to disk. If this directory does not exist, then it will be created. Defaults to the current working directory.

write_tr2g

Logical, whether to write tr2g to disk. If TRUE, then a file tr2g.tsv will be written into out_path.

use_gene_name

Logical, whether to get gene names.

use_transcript_version

Logical, whether to include version number in the Ensembl transcript ID. To decide whether to include transcript version number, check whether version numbers are included in the transcripts.txt in the kallisto output directory. If that file includes version numbers, then trannscript version numbers must be included here as well. If that file does not include version numbers, then transcript version numbers must not be included here.

use_gene_version

Logical, whether to include version number in the Ensembl gene ID. Unlike transcript version number, it's up to you whether to include gene version number.

transcript_biotype_use

Character, can be "all" or a vector of transcript biotypes to be used. Transcript biotypes aren't entirely the same as gene biotypes. For instance, in Ensembl annotation, retained_intron is a transcript biotype, but not a gene biotype. If "cellranger", then a warning will be given. See data("ensembl_tx_biotypes") for all available transcript biotypes from Ensembl.

gene_biotype_use

Character, can be "all", "cellranger", or a vector of gene biotypes to be used. If "cellranger", then the biotypes used by Cell Ranger's reference are used. See data("cellranger_biotypes") for gene biotypes the Cell Ranger reference uses. See data("ensembl_gene_biotypes") for all available gene biotypes from Ensembl. Note that gene biotypes and transcript biotypes are not always the same.

chrs_only

Logical, whether to include chromosomes only, for GTF and GFF files can contain annotations for scaffolds, which are not incorporated into chromosomes. This will also exclude haplotypes. Defaults to TRUE. Only applicable to species found in genomeStyles().

save_filtered

If filtering for biotype and chromosomes, whether to save the filtered fasta file. If TRUE, the file will be tx_filtered.fa in out_path.

compress_fa

Logical, whether to compress the output fasta file. If TRUE, then the fasta file will be gzipped.

overwrite

Logical, whether to overwrite if files with names of outputs written to disk already exist.

Details

At present, this function only works with FASTA files from Ensembl, and uses regex to extract vertebrate Ensembl IDs. Sequence names should be formatted as follows:

ENST00000632684.1 cdna chromosome:GRCh38:7:142786213:142786224:1
gene:ENSG00000282431.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene
gene_symbol:TRBD1 description:T cell receptor beta diversity 1
[Source:HGNC Symbol;Acc:HGNC:12158]

If your FASTA file sequence names are formatted differently, then you must extract the transcript and gene IDs by some other means. The Bioconductor package Biostrings is recommended; after reading the FASTA file into R, the sequence names can be accessed by the names function.

While normally, you should call sort_tr2g to sort the transcript IDs from the output of the tr2g_* family of functions, If the FASTA file supplied here is the same as the one used to build the kallisto index, then the transcript IDs in the output of this function are in the same order as in the kallisto index, so you can skip sort_tr2g and proceed directly to EC2gene with the output of this function.

Value

A data frame with at least 2 columns: gene for gene ID, transcript for transcript ID, and optionally gene_name for gene names.

See Also

ensembl_gene_biotypes ensembl_tx_biotypes cellranger_biotypes

Other functions to retrieve transcript and gene info: sort_tr2g(), tr2g_EnsDb(), tr2g_TxDb(), tr2g_ensembl(), tr2g_gff3(), tr2g_gtf(), transcript2gene()

Examples

toy_path <- system.file("testdata", package = "BUSpaRse")
file_use <- paste(toy_path, "fasta_test.fasta", sep = "/")
tr2g <- tr2g_fasta(file = file_use, save_filtered = FALSE, write_tr2g = FALSE)

BUStools/BUSpaRse documentation built on Aug. 2, 2024, 5:07 a.m.