findOrfs | R Documentation |
Finds all ORFs in prokaryotic genome sequences.
findOrfs(genome, circular = F, trans.tab = 11)
genome |
A fasta object ( |
circular |
Logical indicating if the genome sequences are completed, circular sequences. |
trans.tab |
Translation table. |
A prokaryotic Open Reading Frame (ORF) is defined as a sub-sequence
starting with a start-codon (ATG, GTG or TTG), followed by an integer number
of triplets (codons), and ending with a stop-codon (TAA, TGA or TAG, unless
trans.tab = 4
, see below). This function will locate all such ORFs in
a genome.
The argument genome
is a fasta object, i.e. a table with columns
‘Header’ and ‘Sequence’, and will typically have several sequences
(chromosomes/plasmids/scaffolds/contigs). It is vital that the first
token (characters before first space) of every ‘Header’ is
unique, since this will be used to identify these genome sequences in the
output.
By default the genome sequences are assumed to be linear, i.e. contigs or
other incomplete fragments of a genome. In such cases there will usually be
some truncated ORFs at each end, i.e. ORFs where either the start- or the
stop-codon is lacking. In the orf.table
returned by this function this
is marked in the ‘Attributes’ column. The texts "Truncated=10" or
"Truncated=01" indicates truncated at the beginning or end of the genomic
sequence, respectively. If the supplied genome
is a completed genome,
with circular chromosome/plasmids, set the flag circular = TRUE
and no
truncated ORFs will be listed. In cases where an ORF runs across the origin
of a circular genome sequences, the stop coordinate will be larger than the
length of the genome sequence. This is in line with the specifications of
the GFF3 format, where a ‘Start’ cannot be larger than the
corresponding ‘End’.
An alternative translation table may be specified, and as of now the only alternative implemented is table 4. This means codon TGA is no longer a stop, but codes for Tryptophan. This coding is used by some bacteria (e.g. under the orders Entomoplasmatales and Mycoplasmatales).
Note that for any given stop-codon there are usually multiple start-codons
in the same reading frame. This function will return all such nested ORFs,
i.e. the same stop position may appear multiple times. If you want ORFs with
the most upstream start-codon only (LORFs), then filter the output from this
function with lorfs
.
This function returns an orf.table
, which is simply a
tibble
with columns adhering to the GFF3 format specifications
(a gff.table
), see readGFF
. If you want to retrieve
the actual ORF sequences, use gff2fasta
.
Lars Snipen and Kristian Hovde Liland.
readGFF
, gff2fasta
, lorfs
.
# Using a genome file in this package
genome.file <- file.path(path.package("microseq"),"extdata","small.fna")
# Reading genome and finding orfs
genome <- readFasta(genome.file)
orf.tbl <- findOrfs(genome)
# Pipeline for finding LORFs of minimum length 100 amino acids
# and collecting their sequences from the genome
findOrfs(genome) %>%
lorfs() %>%
filter(orfLength(., aa = TRUE) > 100) %>%
gff2fasta(genome) -> lorf.tbl
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.