Description Usage Arguments Details Value Author(s) References See Also Examples
View source: R/CorrectFrameshifts.R
Corrects the reading frame to mitigate the impact of frameshift errors caused by insertions or deletions in unaligned nucleotide sequences.
1 2 3 4 5 6 7 8 9 10 11 12 13 | CorrectFrameshifts(myXStringSet,
myAAStringSet,
type = "indels",
acceptDistance = 0.01,
rejectDistance = 0.60,
maxComparisons = 10,
gapOpening = -13,
gapExtension = -1,
frameShift = -15,
geneticCode = GENETIC_CODE,
substitutionMatrix = "PFASUM50",
verbose = TRUE,
processors = 1)
|
myXStringSet |
A |
myAAStringSet |
An |
type |
Character string indicating the type of result desired. This should be (an abbreviation of) one of |
acceptDistance |
Numeric giving the maximum distance from a reference sequence that is acceptable to skip any remaining comparisons. |
rejectDistance |
Numeric giving the maximum distance from a reference sequence that is allowed when correcting frameshifts. Sequences in |
maxComparisons |
The number of reference comparisons to make before stopping the search for a closer reference sequence. |
gapOpening |
Numeric giving the cost for opening a gap between the query and reference sequences. |
gapExtension |
Numeric giving the cost for extending an open gap between the query and reference sequences. |
frameShift |
Numeric giving the cost for shifting between frames of the query sequence. |
geneticCode |
Named character vector in the same format as |
substitutionMatrix |
Either a substitution matrix representing the substitution scores for matching two amino acids or the name of the amino acid substitution matrix. The latter may be one of the following: “BLOSUM45”, “BLOSUM50”, “BLOSUM62”, “BLOSUM80”, “BLOSUM100”, “PAM30”, “PAM40”, “PAM70”, “PAM120”, “PAM250”, “PFASUM50” (the default), or “MIQS”. |
verbose |
Logical indicating whether to display progress. |
processors |
The number of processors to use, or |
Accurate translation of protein coding sequences can be greatly disrupted by one or two nucleotide phase shifts that occasionally occur during DNA sequencing. These frameshift errors can potentially be corrected through comparison with other unshifted protein sequences. This function uses a set of reference amino acid sequences (AAStringSet
) to find and correct frameshift errors in a set of nucleotide sequences (myXStringSet
). First, three frame translation of the nucleotide sequences is performed, and the nearest reference sequence is selected. Then the optimal reading frame at each position is determined based on a variation of the Guan & Uberbacher (1996) method. Putative insertions and/or deletions (indels) are returned in the result, typically with close proximity to the true indel locations. For a comparison of this method to others, see Wang et al. (2013).
If type
is "sequences"
or "both"
, then frameshifts are corrected by adding N
's and/or removing nucleotides. Note that this changes the nucleotide sequence, and the new sequence often has minor errors because the exact location of the indel(s) cannot be determined. However, the original frameshifts that disrupted the entire downstream sequence are reduced to local perturbations. All of the returned nucleotide sequences will have a reading frame starting from the first position. This allows direct translation, and in practice works well if there is a similar reference myAAStringSet
with the correct reading frame. Hence it is more important that myAAStringSet
contain a wide variety of sequences than it is that it contain a lot of sequences.
Multiple inputs control the time required for frameshift correction. The number of sequences in the reference set (myAAStringSet
) will affect the speed of the first search for similar sequences. Assessing frameshifts in the second step requires order N*M
time, where N
and M
are the lengths of the query (myXStringSet
) and reference sequences. Two parameters control the number of assessments that are made for each sequence: (1) maxComparisons
determines the maximum number of reference sequences to compare to each query sequence, and (2) acceptDist
defines the maximum distance between a query and reference that is acceptable before continuing to the next query sequence. A lower value for maxComparisons
or a higher value for acceptDist
will accelerate frameshift correction, potentially at the expense of some accuracy.
If type
is "indels"
then the returned object is a list with the same length as myXStringSet
. Each element is a list with four components:
"insertions" |
Approximate positions of inserted nucleotides, which could be removed to correct the reading frame, or excess nucleotides at the 3'-end that make the length longer than a multiple of three. |
"deletions" |
Approximate positions of deleted nucleotides, which could be added back to correct the reading frame. |
"distance" |
The amino acid distance from the nearest reference sequence, between 0 and 1. |
"index" |
The integer index of the reference sequence that was used for frame correction, or |
Note that positions in insertions
and deletions
are sometimes repeated to indicate that the same position needs to be shifted successively more than once to correct the reading frame.
If type
is "sequences"
then the returned object is an XStringSet
of the same type as the input (myXStringSet
). Nucleotides are added or deleted as necessary to correct for frameshifts. The returned sequences all have a reading frame starting from position 1, so that they can be translated directly.
If type
is "both"
then the returned object is a list with two components: one for the "indels"
and the other for the "sequences"
.
Erik Wright eswright@pitt.edu
Guan, X., & Uberbacher, E. C. (1996). Alignments of DNA and protein sequences containing frameshift errors. Computer Applications in the Biosciences : CABIOS, 12(1), 31-40.
Wang, Q., et al. (2013). Ecological Patterns of nifH Genes in Four Terrestrial Climatic Zones Explored with Targeted Metagenomics Using FrameBot, a New Informatics Tool. mBio, 4(5), e00592-13-e00592-13.
AlignTranslation
, OrientNucleotides
, PFASUM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
# introduce artificial indels
n_ins <- 2 # insertions per sequence
shifted <- replaceAt(dna,
lapply(width(dna),
sample,
n_ins),
sample(DNA_BASES,
n_ins,
replace=TRUE))
n_dels <- 1 # deletions per sequence
shifted <- replaceAt(shifted,
as(lapply(width(shifted),
function(x) {
IRanges(sample(x,
n_dels),
width=1)
}), "IRangesList"))
# to make frameshift correction more challenging,
# only supply 20 reference amino acid sequences
s <- sample(length(dna), 20)
x <- CorrectFrameshifts(shifted,
translate(dna[s]),
type="both")
# there was a wide range of distances
# to the nearest reference sequence
quantile(unlist(lapply(x[[1]], `[`, "distance")))
# none of the sequences were > rejectDistance
# from the nearest reference sequence
length(which(unlist(lapply(x[[1]], `[`, "index"))==0))
# the number of indels was generally correct
table(unlist(lapply(x[[1]], function(x) {
length(x$insertions)})))/length(shifted)
table(unlist(lapply(x[[1]], function(x) {
length(x$deletions)})))/length(shifted)
# align and display the translations
AA <- AlignTranslation(x$sequences,
readingFrame=1,
type="AAStringSet")
BrowseSeqs(AA)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.