AssessGenes: Assess Genes

Description Usage Arguments Details Value See Also Examples

View source: R/AssessGenes.R

Description

Assess and categorize a set of genes for a genome using proteomics hits, evolutionarily conserved starts, and evolutionarily conserved stops as evidence

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
AssessGenes(geneLeftPos,
            geneRightPos = NA_integer_,
            geneStrand = NA_character_,
            inputMapObj,
            geneSource = "",
            minCovNum = 10,
            minCovPct = 5,
            minConCovRatio_Strong = 0.99,
            limConCovRatio_NotCon = 0.8,
            maxN_AltConStart = 200,
            frac_AltConStart = 0.5,
            minConCovRatio_Stop = 0.5,
            noConStopsGeneFrac = 0.5,
            minNumProtHitsNORFs = 2L,
            minLenNORFs = 0,
            allowNestedNORFs = FALSE,
            useNTermProt = FALSE,
            verbose = TRUE)

Arguments

geneLeftPos

An integer vector with the left positions of each gene, in terms of the forward strand. Can also be a GRanges object from the GenomicRanges package that holds all of the positional information (including strand) for the genes. In that case, the next two parameters should be left as NA.

geneRightPos

An integer vector with the right positions of each gene, in terms of the forward strand. Should be left at the default value of NA_integer_ if geneLeftPos is a GRanges object.

geneStrand

A character vector consisting of "+" and "-", specifying which strand each gene is on. Should be left at the default value of NA_character_ if geneLeftPos is a GRanges object.

inputMapObj

EITHER an object of class Assessment and subclass DataMap OR a character string corresponding to the strain identifier for one of such objects from AssessORFData.

geneSource

Optional character string that describes the source of the gene set, i.e. a database or gene prediction program. Used when viewing and identifying the object returned by the function.

minCovNum

Minimum number of related genomes required to have synteny to a position in the central genome. Recommended to use the default value.

minCovPct

Minimum percentage of related genomes required to have synteny to a position in the central genome. Must be an integer ranging from 0 to 100. Recommended to use the default value.

minConCovRatio_Strong

Minimum value of the start codon conservation to coverage ratio needed to call a start strongly conserved. Must range from 0 to 1. Lower values allow more conserved starts through. Recommended to use the default value.

limConCovRatio_NotCon

Maximum, non-inclusive value of the conservation to coverage ratio needed to call a possible conserved start not conserved. Used when making a decision on how to categorize the conserved start evidence. Must range from 0 to 1 Recommended to use the default value.

maxN_AltConStart

Maximum nucleotide distance that non-predicted, conserved starts can be away from the start of an ORF (the previous in-frame stop) in order for such starts to be considered an alternative to the predicted start. Recommend to use the default value.

frac_AltConStart

Value from 0 to 1 describing the fractional range of positions in a ORF, starting from the previous in-frame stop and moving towards the ORF-ending stop to use in search for non-predicted, conserved starts in order for such starts to be considered an alternative to the predicted start. For example, a value of 0.25 means that the first quarter of the ORF is checked, a value of 0.5 correspond to the first half of the ORF, etc. Recommended to use the default value.

minConCovRatio_Stop

Minimum value of the stop codon conservation to coverage ratio needed to say a position in the central genome corresponds to a conserved stop across the related genomes. Must range from 0 to 1. Lower values allow more conserved stops through. Recommended to use the default value.

noConStopsGeneFrac

Value from 0 to 1 describing the fractional range of positions in a gene, starting from the start of the gene and moving towards the stop of the gene, to use in searching for conserved stops. For example, a value of 0.25 means that the first quarter of the gene is checked for conserved stops, a value of 0.5 correspond to the first half of the gene, etc. Recommended to use the default value.

minNumProtHitsNORFs

Number of peptide hits required to be in an ORF with protein hits but no given/predicted gene start in order for such an ORF to be included in the final output.

minLenNORFs

Minimum ORF length required to include an ORF with protein hits but no given/predicted gene start in the final output.

allowNestedNORFs

Logical indicating whether or not to include ORFs with protein hits but no given/predicted gene starts that are completely nested within an ORF in another frame in the final output.

useNTermProt

Logical indicating whether or not to treat proteomics evidence in the given mapping object as originating from N-terminal proteomics experiments. The mapping object must be built with N-terminal proteomics data. Default value is FALSE.

verbose

Logical indicating whether or not to display progress and status messages.

Details

For each of the given genes, AssessGene assigns a category based on where conserved starts, conserved stops, and/or proteomics hits are located in relation to the start of the gene. The category assignments for the genes are stored in the CategoryAssignments vector in the Results object returned by the function. Please see Assessment-class for a list of all possible categories and their descriptions.

If geneLeftPos is a GRanges object, then the left and right positions of each gene along with the strand of each gene are extracted from the object. Any sequence names given for the genes within the GRanges object are ignored, and the CategoryAssignments in the returned Results object follows the same order as to how the genes are listed within the GRanges object.

If gene positional information is instead given as three vectors, then the three vectors, geneLeftPos, geneRightPos, and geneStrand, must all be of the same length. The same index within each vector must provide information on the same gene (think of the vectors as columns of the same table). geneLeftPos and geneRightPos describe the upstream and downstream positions (respectively) for each gene in terms of the forward strand. For genes on the forward strand, geneLeftPos corresponds to the start positions and geneRightPos corresponds to stop positions. For genes on the reverse strand, geneLeftPos corresponds to the stop positions and geneRightPos corresponds to the start positions. Gene positions on the reverse strand must be relative to the 5' to 3' direction of the forward strand (as opposed to being relative to the 5' to 3' direction of the reverse strand). This means that none of the elements of geneLeftPos can be greater than (or equal to) the corresponding element in geneRightPos. The CategoryAssignments in the returned Results object has the same length as and aligns with the indexing of the three given gene positional information vectors.

Please ensure that the same genome used in the mapping function is also used to derive the set of genes for this assessment function. The function will only error if any gene positions are outside the bounds of the genome and does not make any other checks to make sure the genes are valid for the genome.

The maximum of either minCovNum (option 1) or minCovPct divided 100 then multiplied by the number of related genomes (option 2) is used as the minimum coverage required in determining conserved starts and stops.

Additionally, open reading frames with proteomics evidence but no gene start are categorized based on whether or not there is a conserved start upstream of the proteomic evidence. The positions and lengths of these open reading frames are included in the N_CS-_PE+_ORFs and N_CS+_PE+_ORFs matrices within the final object that is returned.

If the proteomics evidence provided in the given mapping object comes from N-terminal proteomics experiments (i.e., if the value of the NTermProteomics item within the mapping object is TRUE), the useNTermProt can be set to TRUE to impose stricter requirements on the use of proteomics evidence in determining the correctness of the given genes. When useNTermProt is set to TRUE, the start of first peptide mapping to an ORF where there is a given gene must directly align with the start of that gene or be one codon off from the start (in cases where the protein product of the gene has undergone N-terminal methionine excision) in order for the gene to be considered as having supporting protein evidence. If the first peptide hit does not align like that, the gene is considered as having disproving protein evidence. Currently, N-terminal proteomics does not produce enough N-terminal peptides so setting this flag as TRUE does not provide meaningful results. It is recommended to leave this flag as FALSE in all situations.

Value

An object of class Assessment and subclass Results

See Also

Assessment-class

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
## Example showing the minimum number of arguments that need to be specified:

## Not run: 
myResObj <- AssessGenes(geneLeftPos = myGenesLeft,
                        geneRightPos = myGenesRight,
                        geneStrand = myGenesStrand,
                        inputMapObj = myMapObj)

## End(Not run)



## Example from vignette is shown below

currMapObj <- readRDS(system.file("extdata",
                                  "MGAS5005_PreSaved_DataMapObj.rds",
                                  package = "AssessORF"))

currProdigal <- readLines(system.file("extdata",
                                      "MGAS5005_Prodigal.sco",
                                      package = "AssessORF"))[-1:-2]

prodigalLeft <- as.numeric(sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 2L))
prodigalRight <- as.numeric(sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 3L))
prodigalStrand <- sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 4L)

currResObj <- AssessGenes(geneLeftPos = prodigalLeft,
                          geneRightPos = prodigalRight,
                          geneStrand = prodigalStrand,
                          inputMapObj = currMapObj,
                          geneSource = "Prodigal")

print(currResObj)

DRK248/AssessORF documentation built on Jan. 30, 2020, 7:05 p.m.