findOrfs: Finding ORFs in genomes
In microseq: Basic Biological Sequence Handling

findOrfs

R Documentation

Finding ORFs in genomes

Description

Finds all ORFs in prokaryotic genome sequences.

Usage

findOrfs(genome, circular = F, trans.tab = 11)

Arguments

`genome`	A fasta object (`tibble`) with the genome sequence(s).
`circular`	Logical indicating if the genome sequences are completed, circular sequences.
`trans.tab`	Translation table.

Details

A prokaryotic Open Reading Frame (ORF) is defined as a sub-sequence starting with a start-codon (ATG, GTG or TTG), followed by an integer number of triplets (codons), and ending with a stop-codon (TAA, TGA or TAG, unless trans.tab = 4, see below). This function will locate all such ORFs in a genome.

The argument genome is a fasta object, i.e. a table with columns ‘⁠Header⁠’ and ‘⁠Sequence⁠’, and will typically have several sequences (chromosomes/plasmids/scaffolds/contigs). It is vital that the first token (characters before first space) of every ‘⁠Header⁠’ is unique, since this will be used to identify these genome sequences in the output.

By default the genome sequences are assumed to be linear, i.e. contigs or other incomplete fragments of a genome. In such cases there will usually be some truncated ORFs at each end, i.e. ORFs where either the start- or the stop-codon is lacking. In the orf.table returned by this function this is marked in the ‘⁠Attributes⁠’ column. The texts "Truncated=10" or "Truncated=01" indicates truncated at the beginning or end of the genomic sequence, respectively. If the supplied genome is a completed genome, with circular chromosome/plasmids, set the flag circular = TRUE and no truncated ORFs will be listed. In cases where an ORF runs across the origin of a circular genome sequences, the stop coordinate will be larger than the length of the genome sequence. This is in line with the specifications of the GFF3 format, where a ‘⁠Start⁠’ cannot be larger than the corresponding ‘⁠End⁠’.

An alternative translation table may be specified, and as of now the only alternative implemented is table 4. This means codon TGA is no longer a stop, but codes for Tryptophan. This coding is used by some bacteria (e.g. under the orders Entomoplasmatales and Mycoplasmatales).

Note that for any given stop-codon there are usually multiple start-codons in the same reading frame. This function will return all such nested ORFs, i.e. the same stop position may appear multiple times. If you want ORFs with the most upstream start-codon only (LORFs), then filter the output from this function with lorfs.

Value

This function returns an orf.table, which is simply a tibble with columns adhering to the GFF3 format specifications (a gff.table), see readGFF. If you want to retrieve the actual ORF sequences, use gff2fasta.

Author(s)

Lars Snipen and Kristian Hovde Liland.

Examples

# Using a genome file in this package
genome.file <- file.path(path.package("microseq"),"extdata","small.fna")

# Reading genome and finding orfs
genome <- readFasta(genome.file)
orf.tbl <- findOrfs(genome)

# Pipeline for finding LORFs of minimum length 100 amino acids
# and collecting their sequences from the genome
findOrfs(genome) %>% 
 lorfs() %>% 
 filter(orfLength(., aa = TRUE) > 100) %>% 
 gff2fasta(genome) -> lorf.tbl

microseq documentation built on Aug. 21, 2023, 5:10 p.m.