prepare_annotation_files: Prepare comprehensive sets of annotated genomic features

View source: R/riboseqc.R

prepare_annotation_filesR Documentation

Prepare comprehensive sets of annotated genomic features

Description

This function processes a gtf file and a twobit file (created using faToTwoBit from ucsc tools: http://hgdownload.soe.ucsc.edu/admin/exe/ ) to create a comprehensive set of genomic regions of interest in genomic and transcriptomic space (e.g. introns, UTRs, start/stop codons). In addition, by linking genome sequence and annotation, it extracts additional info, such as gene and transcript biotypes, genetic codes for different organelles, or chromosomes and transcripts lengths.

Usage

prepare_annotation_files(annotation_directory, twobit_file, gtf_file,
  scientific_name = "Homo.sapiens", annotation_name = "genc25",
  export_bed_tables_TxDb = TRUE, forge_BSgenome = TRUE,
  genome_seq = NULL, circ_chroms = DEFAULT_CIRC_SEQS,
  create_TxDb = TRUE)

Arguments

annotation_directory

The target directory which will contain the output files

twobit_file

Full path to the genome file in twobit format

gtf_file

Full path to the annotation file in GTF format

scientific_name

A name to give to the organism studied; must be two words separated by a ".", defaults to Homo.sapiens

annotation_name

A name to give to annotation used; defaults to genc25

export_bed_tables_TxDb

Export coordinates and info about different genomic regions in the annotation_directory? It defaults to TRUE

forge_BSgenome

Forge and install a BSgenome package? It defaults to TRUE

create_TxDb

Create a TxDb object and a *Rannot object? It defaults to TRUE

Details

This function uses the makeTxDbFromGFF function to create a TxDb object and extract genomic regions and other info to a *Rannot R file; the mapToTranscripts and mapFromTranscripts functions are used to map features to genomic or transcript-level coordinates. GTF file mist contain "exon" and "CDS" lines, where each line contains "transcript_id" and "gene_id" values. Additional values such as "gene_biotype" or "gene_name" are also extracted. Regarding sequences, the twobit file, together with input scientific and annotation names, is used to forge and install a BSgenome package using the forgeBSgenomeDataPkg function.

The resulting GTF_annotation object (obtained after runnning load_annotation) contains:

txs: annotated transcript boundaries.
txs_gene: GRangesList including transcript grouped by gene.
seqinfo: indicating chromosomes and chromosome lengths.
start_stop_codons: the set of annotated start and stop codon, with respective transcript and gene_ids. reprentative_mostcommon,reprentative_boundaries and reprentative_5len represent the most common start/stop codon, the most upstream/downstream start/stop codons and the start/stop codons residing on transcripts with the longest 5'UTRs
cds_txs: GRangesList including CDS grouped by transcript.
introns_txs: GRangesList including introns grouped by transcript.
cds_genes: GRangesList including CDS grouped by gene.
exons_txs: GRangesList including exons grouped by transcript.
exons_bins: the list of exonic bins with associated transcripts and genes.
junctions: the list of annotated splice junctions, with associated transcripts and genes.
genes: annotated genes coordinates.
threeutrs: collapsed set of 3'UTR regions, with correspinding gene_ids. This set does not overlap CDS region.
fiveutrs: collapsed set of 5'UTR regions, with correspinding gene_ids. This set does not overlap CDS region.
ncIsof: collapsed set of exonic regions of protein_coding genes, with correspinding gene_ids. This set does not overlap CDS region.
ncRNAs: collapsed set of exonic regions of non_coding genes, with correspinding gene_ids. This set does not overlap CDS region.
introns: collapsed set of intronic regions, with correspinding gene_ids. This set does not overlap exonic region.
intergenicRegions: set of intergenic regions, defined as regions with no annotated genes on either strand.
trann: DataFrame object including (when available) the mapping between gene_id, gene_name, gene_biotypes, transcript_id and transcript_biotypes.
cds_txs_coords: transcript-level coordinates of ORF boundaries, for each annotated coding transcript. Additional columns are the same as as for the start_stop_codons object.
genetic_codes: an object containing the list of genetic code ids used for each chromosome/organelle. see GENETIC_CODE_TABLE for more info.
genome: the name of the forged BSgenome package, or an FaFile_Circ object. Loaded with load_annotation function.
stop_in_gtf: stop codon, as defined in the annotation.

Value

a TxDb file and a *Rannot files are created in the specified annotation_directory. In addition, a BSgenome object is forged, installed, and linked to the *Rannot object

Author(s)

Lorenzo Calviello, calviello.l.bio@gmail.com

See Also

load_annotation, forgeBSgenomeDataPkg, makeTxDbFromGFF.


ohlerlab/RiboseQC documentation built on Aug. 15, 2023, 7:30 a.m.