MapAssessmentData: Map Evidence to a Genome

Description Usage Arguments Details Value See Also Examples

View source: R/MapAssessmentData.R

Description

Maps proteomics hits and evolutionarily conserved starts to a central genome

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
MapAssessmentData(genomes_DBFile,
                  tblName = "Seqs",
                  central_ID,
                  related_IDs,
                  protHits_Seqs,
                  protHits_Scores = rep.int(1, length(protHits_Seqs)),
                  strainID = "",
                  speciesName = "",
                  protHits_Threshold = 0,
                  protHits_IsNTerm = FALSE,
                  related_KMerLen = 8,
                  related_MinDist = 0.01,
                  related_MaxDistantN = 1000,
                  startCodons = c("ATG", "GTG", "TTG"),
                  ema_AlphaVal = 0.1,
                  ema_MinVal = 0.6,
                  useProt = TRUE,
                  useCons = TRUE,
                  processors = 1,
                  verbose = TRUE)

Arguments

genomes_DBFile

A SQLite connection object or a character string specifying the path to the database file.

tblName

Character string specifying the table where the genome sequences are located.

central_ID

Character string specifying which identifier corresponds to the central genome, the genome to which the proteomics data and evolutionary conservation data will be mapped.

related_IDs

Character vector of strings specifying identifiers that correspond to related genomes, the genomes that will be used to determine which start codons (ATG, GTG, and TTG) are evolutionarily conserved.

protHits_Seqs

Character vector of amino acid strings that correspond to the sequences for the proteomics hits.

protHits_Scores

Numeric vector of (confidence) scores for the proteomics hits. Scores cannot be negative. The default option assigns a score of one to each proteomics hit.

strainID

Optional character string that specifies the strain identifier that the central genome corresponds to.

speciesName

Optional character string that specifies the name of the species that the central genome corresponds to.

protHits_Threshold

Optional number that specifies what percent of the lowest scoring proteomics hits should be dropped. Must be a non-negative integer less than 100.

protHits_IsNTerm

Logical describing whether or not the proteomics hits come from N-terminal proteomics. Default value is false.

related_KMerLen

The k-mer length to be used when measuring distances between the central genome and related genomes. Default value is 8. Recommended to use the default value.

related_MinDist

The minimum fractional distance required for a related genome to be used in finding evolutionary conservation. Used to prevent the inclusion of related genomes that are too similar to the central genome. Default value is 0.01. Recommended to use the default value.

related_MaxDistantN

The maximum number of related genomes to use in finding evolutionary conservation after the related genomes have been sorted from most distantly related to most closely related in relation to the central genome. Default value is 1000.

startCodons

A character vector consisting of three-letter DNA strings to use as the start codons when finding evolutionarily conserved starts.

ema_AlphaVal

The alpha value to use when calculating the exponential moving average over an alignment derived from a synteny map. Default value is 0.1. Recommended to use the default value.

ema_MinVal

The minimum exponential moving average value required for an alignment position to be incorporated into the conservation vectors. Default value is 0.6. Recommended to use the default value.

useProt

Logical indicating whether or not proteomics evidence should be mapped to the genome. Default value is true. Cannot be false if useCons is false.

useCons

Logical indicating whether or not evolutionary conservation evidence should be mapped to the genome. Default value is true. Cannot be false if useProt is false.

processors

Number describing the how many processors to use with DECIPHER functions. Should be either a positive integer that describes the number of processors to use or NULL to detect and use all available processors.

verbose

Logical indicating whether or not to display progress and status messages.

Details

MapAssessmentData maps the given data (either proteomics data, evolutionary conservation data, or both) to the given central genome and stores those mappings in the object outputted by the function. The object that is outputted can then be used to assess the quality of genes predicted for that same central genome.

All genomes used inside this function, including the central genome, must be inside the specified table of the specified database. If the central genome is not found, the function returns an error. Please see the Using AssessORF vignette for details on how to populate a database with genomic sequences.

Information on the proteomics hits is primarily given by protHits_Seqs and protHits_Scores. The sequences (protHits_Seqs) are mapped to the six-frame translations of the central genome, and the scores (protHits_Scores) are used in thresholding and plotting the proteomics hits.

protHits_Scores can be a single number. In that case, that number is used the as the score for all proteomics hits. Otherwise, the protHits_Scores must be of the same length as protHits_Seqs.

Only proteomics hits with a score greater than the value of the percentile that corresponds to the value of protHits_Threshold will be kept and the rest of the hits will be dropped. If all the proteomics hits have the same score or if protHits_Threshold is zero, no thresholding will occur and no hits will be dropped.

Please note that the logical parameter protHits_IsNTerm has no effect on how the proteomics evidence is mapped to the central genome but it can be used to affect how genes are assessed and categorized in AssessGenes. The NTermProteomics item in the outputted object is set to the value of protHits_IsNTerm (TRUE or FALSE). Users then have the option of requiring that AssessGenes specifically perform N-terminal proteomics assessment when categorizing genes via the useNTermProt parameter to the AssessGenes function. To summarize, the protHits_IsNTerm parameter in the MapAssessmentData function and the useNTermProt in the AssessGenes function must both be set to TRUE in order to perform N-terminal proteomics assessment. See AssessGenes for more details.

Evolutionarily conserved starts and conserved stop are found by first measuring how far the related genomes are from the central genome using k-mer frequencies. Next, synteny is mapped between the central genome and each of the most distant related genomes, and alignments are built from those synteny maps. An exponential moving average (EMA) is calculated over the alignment (based on whether the central genome is identical to the related genome at that position) to filter out areas of poor alignment. The synteny maps and filterd alignments provide information on how often each position in the central genome is covered by syntenic matches to related genomes (coverage), how often those positions correspond to the start codons (start codon conservation) in both genomes, and how often those positions correspond to stop codons in related genomes (stop codon conservation). A ratio of conservation to coverage is used in downstream functions to measure the strength of both conserved starts and conserved stops.

Related genomes should be from species that are closely related to the given strain. related_IDs specifies the identifiers for the sequences of the related genomes inside the database. A related genome identifier (each element of related_IDs) is considered invalid and not used when finding evolutionary conservation if it is not found in the databse. Please note that the function will only error when none of the related genomes are found.

If there are less valid related genomes in the sequence database than value of related_MaxDistantN, all valid related genomes will be used in finding evolutionary conservation.

The logical flag useProt is used to indicate whether or not proteomics evidence has been provided and should be mapped to the genome. Error checking will not occur for any arguments that involve proteomics if it is false.

The logical flag useCons is used to indicate whether or not evolutionary conservation evidence has been provided and should be mapped to the genome. Error checking will not occur for any arguments that involve evolutionary conservation if it is false.

Value

An object of class Assessment and subclass DataMap

See Also

Assessment-class

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## Example showing the minimum number of arguments that need to be specified
## to map both proteomics and evolutionary conservation data:

## Not run: 
myMapObj <- MapAssessmentData(myDBFile, central_ID = "1",
                              related_IDs = as.character(2:1001),
                              protHits_Seqs = myProtSeqs)

## End(Not run)


## Runnable example that uses evolutionary conservation data only:
## Human adenovirus 1 is the strain of interest, and the set of Adenoviridae
## genomes will serve as the set of genome. The cenral genome, also known as
## the genome of human adenovirus 1, is at identifier 1. The related genomes
## are at identifiers 2 - 13.

myMapObj <- MapAssessmentData(system.file("extdata",
                                          "Adenoviridae.sqlite",
                                          package = "AssessORF"),
                              central_ID = "1",
                              related_IDs = as.character(2:13),
                              speciesName = "Human adenovirus 1",
                              useProt = FALSE)

AssessORF documentation built on Nov. 8, 2020, 4:52 p.m.