Description Usage Arguments Details Value See Also Examples
Assess and categorize a set of genes for a genome using proteomics hits, evolutionarily conserved starts, and evolutionarily conserved stops as evidence
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | AssessGenes(geneLeftPos,
geneRightPos = NA_integer_,
geneStrand = NA_character_,
inputMapObj,
geneSource = "",
minCovNum = 10,
minCovPct = 5,
minConCovRatio_Strong = 0.99,
limConCovRatio_NotCon = 0.8,
maxN_AltConStart = 200,
frac_AltConStart = 0.5,
minConCovRatio_Stop = 0.5,
noConStopsGeneFrac = 0.5,
minNumProtHitsNORFs = 2L,
minLenNORFs = 0,
allowNestedNORFs = FALSE,
useNTermProt = FALSE,
verbose = TRUE)
|
geneLeftPos |
An integer vector with the left positions of each gene, in terms of the forward strand.
Can also be a |
geneRightPos |
An integer vector with the right positions of each gene, in terms of the forward strand.
Should be left at the default value of |
geneStrand |
A character vector consisting of "+" and "-", specifying which strand each gene is on.
Should be left at the default value of |
inputMapObj |
EITHER an object of class |
geneSource |
Optional character string that describes the source of the gene set, i.e. a database or gene prediction program. Used when viewing and identifying the object returned by the function. |
minCovNum |
Minimum number of related genomes required to have synteny to a position in the central genome. Recommended to use the default value. |
minCovPct |
Minimum percentage of related genomes required to have synteny to a position in the central genome. Must be an integer ranging from 0 to 100. Recommended to use the default value. |
minConCovRatio_Strong |
Minimum value of the start codon conservation to coverage ratio needed to call a start strongly conserved. Must range from 0 to 1. Lower values allow more conserved starts through. Recommended to use the default value. |
limConCovRatio_NotCon |
Maximum, non-inclusive value of the conservation to coverage ratio needed to call a possible conserved start not conserved. Used when making a decision on how to categorize the conserved start evidence. Must range from 0 to 1 Recommended to use the default value. |
maxN_AltConStart |
Maximum nucleotide distance that non-predicted, conserved starts can be away from the start of an ORF (the previous in-frame stop) in order for such starts to be considered an alternative to the predicted start. Recommend to use the default value. |
frac_AltConStart |
Value from 0 to 1 describing the fractional range of positions in a ORF, starting from the previous in-frame stop and moving towards the ORF-ending stop to use in search for non-predicted, conserved starts in order for such starts to be considered an alternative to the predicted start. For example, a value of 0.25 means that the first quarter of the ORF is checked, a value of 0.5 correspond to the first half of the ORF, etc. Recommended to use the default value. |
minConCovRatio_Stop |
Minimum value of the stop codon conservation to coverage ratio needed to say a position in the central genome corresponds to a conserved stop across the related genomes. Must range from 0 to 1. Lower values allow more conserved stops through. Recommended to use the default value. |
noConStopsGeneFrac |
Value from 0 to 1 describing the fractional range of positions in a gene, starting from the start of the gene and moving towards the stop of the gene, to use in searching for conserved stops. For example, a value of 0.25 means that the first quarter of the gene is checked for conserved stops, a value of 0.5 correspond to the first half of the gene, etc. Recommended to use the default value. |
minNumProtHitsNORFs |
Number of peptide hits required to be in an ORF with protein hits but no given/predicted gene start in order for such an ORF to be included in the final output. |
minLenNORFs |
Minimum ORF length required to include an ORF with protein hits but no given/predicted gene start in the final output. |
allowNestedNORFs |
Logical indicating whether or not to include ORFs with protein hits but no given/predicted gene starts that are completely nested within an ORF in another frame in the final output. |
useNTermProt |
Logical indicating whether or not to treat proteomics evidence in the given mapping object as originating from N-terminal proteomics experiments. The mapping object must be built with N-terminal proteomics data. Default value is FALSE. |
verbose |
Logical indicating whether or not to display progress and status messages. |
For each of the given genes, AssessGene
assigns a category based on where conserved starts, conserved stops, and/or
proteomics hits are located in relation to the start of the gene. The category assignments for the genes are stored in the
CategoryAssignments
vector in the Results
object returned by the function. Please see
Assessment-class
for a list of all possible categories and their descriptions.
If geneLeftPos
is a GRanges
object, then the left and right positions of each gene along with the strand of each
gene are extracted from the object. Any sequence names given for the genes within the GRanges
object are ignored, and
the CategoryAssignments
in the returned Results
object follows the same order as to how the genes are listed
within the GRanges
object.
If gene positional information is instead given as three vectors, then the three vectors, geneLeftPos
, geneRightPos
,
and geneStrand
, must all be of the same length. The same index within each vector must provide information on the same gene
(think of the vectors as columns of the same table). geneLeftPos
and geneRightPos
describe the upstream and downstream
positions (respectively) for each gene in terms of the forward strand. For genes on the forward strand, geneLeftPos
corresponds
to the start positions and geneRightPos
corresponds to stop positions. For genes on the reverse strand, geneLeftPos
corresponds to the stop positions and geneRightPos
corresponds to the start positions. Gene positions on the reverse strand
must be relative to the 5' to 3' direction of the forward strand (as opposed to being relative to the 5' to 3' direction of the reverse
strand). This means that none of the elements of geneLeftPos
can be greater than (or equal to) the corresponding element in
geneRightPos
. The CategoryAssignments
in the returned Results
object has the same length as and aligns with the
indexing of the three given gene positional information vectors.
Please ensure that the same genome used in the mapping function is also used to derive the set of genes for this assessment function. The function will only error if any gene positions are outside the bounds of the genome and does not make any other checks to make sure the genes are valid for the genome.
The maximum of either minCovNum
(option 1) or minCovPct
divided 100 then multiplied by the number of related genomes
(option 2) is used as the minimum coverage required in determining conserved starts and stops.
Additionally, open reading frames with proteomics evidence but no gene start are categorized based on whether or not there
is a conserved start upstream of the proteomic evidence. The positions and lengths of these open reading frames are included
in the N_CS-_PE+_ORFs
and N_CS+_PE+_ORFs
matrices within the final object that is returned.
If the proteomics evidence provided in the given mapping object comes from N-terminal proteomics experiments (i.e., if the value of
the NTermProteomics
item within the mapping object is TRUE), the useNTermProt
can be set to TRUE to impose stricter
requirements on the use of proteomics evidence in determining the correctness of the given genes. When useNTermProt
is set to
TRUE, the start of first peptide mapping to an ORF where there is a given gene must directly align with the start of that gene or be
one codon off from the start (in cases where the protein product of the gene has undergone N-terminal methionine excision) in order for
the gene to be considered as having supporting protein evidence. If the first peptide hit does not align like that, the gene is
considered as having disproving protein evidence. Currently, N-terminal proteomics does not produce enough N-terminal peptides so
setting this flag as TRUE does not provide meaningful results. It is recommended to leave this flag as FALSE in all situations.
An object of class Assessment
and subclass Results
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | ## Example showing the minimum number of arguments that need to be specified:
## Not run:
myResObj <- AssessGenes(geneLeftPos = myGenesLeft,
geneRightPos = myGenesRight,
geneStrand = myGenesStrand,
inputMapObj = myMapObj)
## End(Not run)
## Example from vignette is shown below
currMapObj <- readRDS(system.file("extdata",
"MGAS5005_PreSaved_DataMapObj.rds",
package = "AssessORF"))
currProdigal <- readLines(system.file("extdata",
"MGAS5005_Prodigal.sco",
package = "AssessORF"))[-1:-2]
prodigalLeft <- as.numeric(sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 2L))
prodigalRight <- as.numeric(sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 3L))
prodigalStrand <- sapply(strsplit(currProdigal, "_", fixed=TRUE), `[`, 4L)
currResObj <- AssessGenes(geneLeftPos = prodigalLeft,
geneRightPos = prodigalRight,
geneStrand = prodigalStrand,
inputMapObj = currMapObj,
geneSource = "Prodigal")
print(currResObj)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.