FindGenes: Find Genes in a Genome

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/FindGenes.R

Description

Predicts the start and stop positions of protein coding genes in a genome.

Usage

1
2
3
4
5
6
7
FindGenes(myDNAStringSet,
          geneticCode = getGeneticCode("11"),
          minGeneLength = 60,
          allowEdges = TRUE,
          allScores = FALSE,
          showPlot = FALSE,
          verbose = TRUE)

Arguments

myDNAStringSet

A DNAStringSet object of unaligned sequences representing a genome.

geneticCode

A named character vector defining the translation from codons to amino acids. Optionally, an "alt_init_codons" attribute can be used to specify alternative initiation codons. By default, the bacterial and archael genetic code is used, which has seven possible initiation codons: ATG, GTG, TTG, CTG, ATA, ATT, and ATC.

minGeneLength

Integer specifying the minimum length of genes to find in the genome.

allowEdges

Logical determining whether to allow genes that run off the edge of the sequences. If TRUE (the default), genes can be identified with implied starts or ends outside the boundaries of myDNAStringSet, although the boundary will be set to the last possible codon position. This is useful when genome sequences are circular or incomplete.

allScores

Logical indicating whether to return information about all possible open reading frame or only the predicted genes (the default).

showPlot

Logical determining whether a plot is displayed with the distribution of gene lengths and scores. (See details section below.)

verbose

Logical indicating whether to print information about the predictions on each iteration. (See details section below.)

Details

Protein coding genes are identified by learning their characteristic signature directly from the genome, i.e., ab initio prediction. Gene signatures are derived from the content of the open reading frame and surrounding signals that indicate the presence of a gene. Genes are assumed to not contain introns or frame shifts, making the function best suited for prokaryotic genomes.

If showPlot is TRUE then a plot is displayed with four panels. The upper left panel shows the fitted distribution of background open reading frame lengths. The upper right panel shows this distribution on top of the fitted distribution of predicted gene lengths. The lower left panel shows the fitted distribution of scores for the intergenic spacing between genes on the same and opposite genome strands. The bottom right panel shows the total score of open reading frames and predicted genes by length.

If verbose is TRUE, information is shown about the predictions at each iteration of gene finding. The mean score difference between genes and non-genes is updated at each iteration, unless it is negative, in which case the score is dropped and a "-" is displayed. The columns denote the number of iterations ("Iter"), number of codon scoring models ("Models"), start codon scores ("Start"), upstream k-mer motif scores ("Motif"), mRNA folding scores ("Fold"), initial codon bias scores ("Init"), upstream nucleotide bias scores ("UpsNt"), termination codon bias scores ("Term"), ribosome binding site scores ("RBS"), codon autocorrelation scores ("Auto"), stop codon scores ("Stop"), and number of predicted genes ("Genes").

Value

An object of class Genes, which is stored as a matrix with information corresponding to each open reading frame.

Author(s)

Erik Wright eswright@pitt.edu

See Also

ExtractGenes, Genes-class, WriteGenes

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# import a test genome
fas <- system.file("extdata",
	"Chlamydia_trachomatis_NC_000117.fas.gz",
	package="DECIPHER")
genome <- readDNAStringSet(fas)

x <- FindGenes(genome)
x
genes <- ExtractGenes(x, genome)
proteins <- ExtractGenes(x, genome, type="AAStringSet")

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.