prepare_annotation_files: Prepare comprehensive sets of annotated genomic features
In ohlerlab/RiboseQC: RiboseQC, a Comprehensive Ribo-Seq Analysis Tool

prepare_annotation_files

R Documentation

Prepare comprehensive sets of annotated genomic features

Description

This function processes a gtf file and a twobit file (created using faToTwoBit from ucsc tools: http://hgdownload.soe.ucsc.edu/admin/exe/ ) to create a comprehensive set of genomic regions of interest in genomic and transcriptomic space (e.g. introns, UTRs, start/stop codons). In addition, by linking genome sequence and annotation, it extracts additional info, such as gene and transcript biotypes, genetic codes for different organelles, or chromosomes and transcripts lengths.

Usage

prepare_annotation_files(annotation_directory, twobit_file, gtf_file,
  scientific_name = "Homo.sapiens", annotation_name = "genc25",
  export_bed_tables_TxDb = TRUE, forge_BSgenome = TRUE,
  genome_seq = NULL, circ_chroms = DEFAULT_CIRC_SEQS,
  create_TxDb = TRUE)

Arguments

`annotation_directory`	The target directory which will contain the output files
`twobit_file`	Full path to the genome file in twobit format
`gtf_file`	Full path to the annotation file in GTF format
`scientific_name`	A name to give to the organism studied; must be two words separated by a ".", defaults to Homo.sapiens
`annotation_name`	A name to give to annotation used; defaults to genc25
`export_bed_tables_TxDb`	Export coordinates and info about different genomic regions in the annotation_directory? It defaults to `TRUE`
`forge_BSgenome`	Forge and install a `BSgenome` package? It defaults to `TRUE`
`create_TxDb`	Create a `TxDb` object and a *Rannot object? It defaults to `TRUE`

Details

This function uses the makeTxDbFromGFF function to create a TxDb object and extract genomic regions and other info to a *Rannot R file; the mapToTranscripts and mapFromTranscripts functions are used to map features to genomic or transcript-level coordinates. GTF file mist contain "exon" and "CDS" lines, where each line contains "transcript_id" and "gene_id" values. Additional values such as "gene_biotype" or "gene_name" are also extracted. Regarding sequences, the twobit file, together with input scientific and annotation names, is used to forge and install a BSgenome package using the forgeBSgenomeDataPkg function.

The resulting GTF_annotation object (obtained after runnning load_annotation) contains:

txs: annotated transcript boundaries.
txs_gene: GRangesList including transcript grouped by gene.
seqinfo: indicating chromosomes and chromosome lengths.
start_stop_codons: the set of annotated start and stop codon, with respective transcript and gene_ids. reprentative_mostcommon,reprentative_boundaries and reprentative_5len represent the most common start/stop codon, the most upstream/downstream start/stop codons and the start/stop codons residing on transcripts with the longest 5'UTRs
cds_txs: GRangesList including CDS grouped by transcript.
introns_txs: GRangesList including introns grouped by transcript.
cds_genes: GRangesList including CDS grouped by gene.
exons_txs: GRangesList including exons grouped by transcript.
exons_bins: the list of exonic bins with associated transcripts and genes.
junctions: the list of annotated splice junctions, with associated transcripts and genes.
genes: annotated genes coordinates.
threeutrs: collapsed set of 3'UTR regions, with correspinding gene_ids. This set does not overlap CDS region.
fiveutrs: collapsed set of 5'UTR regions, with correspinding gene_ids. This set does not overlap CDS region.
ncIsof: collapsed set of exonic regions of protein_coding genes, with correspinding gene_ids. This set does not overlap CDS region.
ncRNAs: collapsed set of exonic regions of non_coding genes, with correspinding gene_ids. This set does not overlap CDS region.
introns: collapsed set of intronic regions, with correspinding gene_ids. This set does not overlap exonic region.
intergenicRegions: set of intergenic regions, defined as regions with no annotated genes on either strand.
trann: DataFrame object including (when available) the mapping between gene_id, gene_name, gene_biotypes, transcript_id and transcript_biotypes.
cds_txs_coords: transcript-level coordinates of ORF boundaries, for each annotated coding transcript. Additional columns are the same as as for the start_stop_codons object.
genetic_codes: an object containing the list of genetic code ids used for each chromosome/organelle. see GENETIC_CODE_TABLE for more info.
genome: the name of the forged BSgenome package, or an FaFile_Circ object. Loaded with load_annotation function.
stop_in_gtf: stop codon, as defined in the annotation.

Value

a TxDb file and a *Rannot files are created in the specified annotation_directory. In addition, a BSgenome object is forged, installed, and linked to the *Rannot object

Author(s)

Lorenzo Calviello, calviello.l.bio@gmail.com

ohlerlab/RiboseQC
RiboseQC, a Comprehensive Ribo-Seq Analysis Tool

prepare_annotation_files: Prepare comprehensive sets of annotated genomic features
In ohlerlab/RiboseQC: RiboseQC, a Comprehensive Ribo-Seq Analysis Tool

Prepare comprehensive sets of annotated genomic features

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Related to prepare_annotation_files in ohlerlab/RiboseQC...

R Package Documentation

Browse R Packages

We want your feedback!

ohlerlab/RiboseQC RiboseQC, a Comprehensive Ribo-Seq Analysis Tool

prepare_annotation_files: Prepare comprehensive sets of annotated genomic features In ohlerlab/RiboseQC: RiboseQC, a Comprehensive Ribo-Seq Analysis Tool

Prepare comprehensive sets of annotated genomic features

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Related to prepare_annotation_files in ohlerlab/RiboseQC...

R Package Documentation

Browse R Packages

We want your feedback!

ohlerlab/RiboseQC
RiboseQC, a Comprehensive Ribo-Seq Analysis Tool

prepare_annotation_files: Prepare comprehensive sets of annotated genomic features
In ohlerlab/RiboseQC: RiboseQC, a Comprehensive Ribo-Seq Analysis Tool