Description Usage Arguments Details Value See Also Examples
View source: R/MapAssessmentData.R
Maps proteomics hits and evolutionarily conserved starts to a central genome
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | MapAssessmentData(genomes_DBFile,
tblName = "Seqs",
central_ID,
related_IDs,
protHits_Seqs,
protHits_Scores = rep.int(1, length(protHits_Seqs)),
strainID = "",
speciesName = "",
protHits_Threshold = 0,
protHits_IsNTerm = FALSE,
related_KMerLen = 8,
related_MinDist = 0.01,
related_MaxDistantN = 1000,
startCodons = c("ATG", "GTG", "TTG"),
ema_AlphaVal = 0.1,
ema_MinVal = 0.6,
useProt = TRUE,
useCons = TRUE,
processors = 1,
verbose = TRUE)
|
genomes_DBFile |
A SQLite connection object or a character string specifying the path to the database file. |
tblName |
Character string specifying the table where the genome sequences are located. |
central_ID |
Character string specifying which identifier corresponds to the central genome, the genome to which the proteomics data and evolutionary conservation data will be mapped. |
related_IDs |
Character vector of strings specifying identifiers that correspond to related genomes, the genomes that will be used to determine which start codons (ATG, GTG, and TTG) are evolutionarily conserved. |
protHits_Seqs |
Character vector of amino acid strings that correspond to the sequences for the proteomics hits. |
protHits_Scores |
Numeric vector of (confidence) scores for the proteomics hits. Scores cannot be negative. The default option assigns a score of one to each proteomics hit. |
strainID |
Optional character string that specifies the strain identifier that the central genome corresponds to. |
speciesName |
Optional character string that specifies the name of the species that the central genome corresponds to. |
protHits_Threshold |
Optional number that specifies what percent of the lowest scoring proteomics hits should be dropped. Must be a non-negative integer less than 100. |
protHits_IsNTerm |
Logical describing whether or not the proteomics hits come from N-terminal proteomics. Default value is false. |
related_KMerLen |
The k-mer length to be used when measuring distances between the central genome and related genomes. Default value is 8. Recommended to use the default value. |
related_MinDist |
The minimum fractional distance required for a related genome to be used in finding evolutionary conservation. Used to prevent the inclusion of related genomes that are too similar to the central genome. Default value is 0.01. Recommended to use the default value. |
related_MaxDistantN |
The maximum number of related genomes to use in finding evolutionary conservation after the related genomes have been sorted from most distantly related to most closely related in relation to the central genome. Default value is 1000. |
startCodons |
A character vector consisting of three-letter DNA strings to use as the start codons when finding evolutionarily conserved starts. |
ema_AlphaVal |
The alpha value to use when calculating the exponential moving average over an alignment derived from a synteny map. Default value is 0.1. Recommended to use the default value. |
ema_MinVal |
The minimum exponential moving average value required for an alignment position to be incorporated into the conservation vectors. Default value is 0.6. Recommended to use the default value. |
useProt |
Logical indicating whether or not proteomics evidence should be mapped to the genome.
Default value is true. Cannot be false if |
useCons |
Logical indicating whether or not evolutionary conservation evidence should be mapped to the genome.
Default value is true. Cannot be false if |
processors |
Number describing the how many processors to use with DECIPHER functions. Should be either a positive integer that describes the number of processors to use or NULL to detect and use all available processors. |
verbose |
Logical indicating whether or not to display progress and status messages. |
MapAssessmentData
maps the given data (either proteomics data, evolutionary conservation data, or both) to the
given central genome and stores those mappings in the object outputted by the function. The object that is outputted can
then be used to assess the quality of genes predicted for that same central genome.
All genomes used inside this function, including the central genome, must be inside the specified table of the specified database. If the central genome is not found, the function returns an error. Please see the Using AssessORF vignette for details on how to populate a database with genomic sequences.
Information on the proteomics hits is primarily given by protHits_Seqs
and protHits_Scores
. The sequences
(protHits_Seqs
) are mapped to the six-frame translations of the central genome, and the scores (protHits_Scores
)
are used in thresholding and plotting the proteomics hits.
protHits_Scores
can be a single number. In that case, that number is used the as the score for all proteomics hits.
Otherwise, the protHits_Scores
must be of the same length as protHits_Seqs
.
Only proteomics hits with a score greater than the value of the percentile that corresponds to the value of protHits_Threshold
will be kept and the rest of the hits will be dropped. If all the proteomics hits have the same score or if protHits_Threshold
is zero, no thresholding will occur and no hits will be dropped.
Please note that the logical parameter protHits_IsNTerm
has no effect on how the proteomics evidence is mapped to the central
genome but it can be used to affect how genes are assessed and categorized in AssessGenes
. The NTermProteomics
item in
the outputted object is set to the value of protHits_IsNTerm
(TRUE or FALSE). Users then have the option of requiring that
AssessGenes
specifically perform N-terminal proteomics assessment when categorizing genes via the useNTermProt
parameter
to the AssessGenes
function. To summarize, the protHits_IsNTerm
parameter in the MapAssessmentData
function and
the useNTermProt
in the AssessGenes
function must both be set to TRUE in order to perform N-terminal proteomics
assessment. See AssessGenes
for more details.
Evolutionarily conserved starts and conserved stop are found by first measuring how far the related genomes are from the central genome using k-mer frequencies. Next, synteny is mapped between the central genome and each of the most distant related genomes, and alignments are built from those synteny maps. An exponential moving average (EMA) is calculated over the alignment (based on whether the central genome is identical to the related genome at that position) to filter out areas of poor alignment. The synteny maps and filterd alignments provide information on how often each position in the central genome is covered by syntenic matches to related genomes (coverage), how often those positions correspond to the start codons (start codon conservation) in both genomes, and how often those positions correspond to stop codons in related genomes (stop codon conservation). A ratio of conservation to coverage is used in downstream functions to measure the strength of both conserved starts and conserved stops.
Related genomes should be from species that are closely related to the given strain. related_IDs
specifies the identifiers
for the sequences of the related genomes inside the database. A related genome identifier (each element of related_IDs
) is
considered invalid and not used when finding evolutionary conservation if it is not found in the databse. Please note that the function
will only error when none of the related genomes are found.
If there are less valid related genomes in the sequence database than value of related_MaxDistantN
, all valid related genomes
will be used in finding evolutionary conservation.
The logical flag useProt
is used to indicate whether or not proteomics evidence has been provided and should be mapped to
the genome. Error checking will not occur for any arguments that involve proteomics if it is false.
The logical flag useCons
is used to indicate whether or not evolutionary conservation evidence has been provided and should be
mapped to the genome. Error checking will not occur for any arguments that involve evolutionary conservation if it is false.
An object of class Assessment
and subclass DataMap
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | ## Example showing the minimum number of arguments that need to be specified
## to map both proteomics and evolutionary conservation data:
## Not run:
myMapObj <- MapAssessmentData(myDBFile, central_ID = "1",
related_IDs = as.character(2:1001),
protHits_Seqs = myProtSeqs)
## End(Not run)
## Runnable example that uses evolutionary conservation data only:
## Human adenovirus 1 is the strain of interest, and the set of Adenoviridae
## genomes will serve as the set of genome. The cenral genome, also known as
## the genome of human adenovirus 1, is at identifier 1. The related genomes
## are at identifiers 2 - 13.
myMapObj <- MapAssessmentData(system.file("extdata",
"Adenoviridae.sqlite",
package = "AssessORF"),
central_ID = "1",
related_IDs = as.character(2:13),
speciesName = "Human adenovirus 1",
useProt = FALSE)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.