CorrectFrameshifts: Corrects Frameshift Errors In Protein Coding Sequences
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description Usage Arguments Details Value Author(s) References See Also Examples

Corrects the reading frame to mitigate the impact of frameshift errors caused by insertions or deletions in unaligned nucleotide sequences.

CorrectFrameshifts(myXStringSet,
                   myAAStringSet,
                   type = "indels",
                   acceptDistance = 0.01,
                   rejectDistance = 0.60,
                   maxComparisons = 10,
                   gapOpening = -13,
                   gapExtension = -1,
                   frameShift = -15,
                   geneticCode = GENETIC_CODE,
                   substitutionMatrix = "PFASUM50",
                   verbose = TRUE,
                   processors = 1)

`myXStringSet`	A `DNAStringSet` or `RNAStringSet` of unaligned protein coding sequences in 5' to 3' orientation.
`myAAStringSet`	An `AAStringSet` of reference sequences having accurate translations.
`type`	Character string indicating the type of result desired. This should be (an abbreviation of) one of `"indels"`, `"sequences"`, or `"both"`. (See details section below.)
`acceptDistance`	Numeric giving the maximum distance from a reference sequence that is acceptable to skip any remaining comparisons.
`rejectDistance`	Numeric giving the maximum distance from a reference sequence that is allowed when correcting frameshifts. Sequences in `myXStringSet` that are greater than `rejectDistance` from the nearest reference sequence will only have their length trimmed from the 3'-end to a multiple of three nucleotides without any frameshift correction.
`maxComparisons`	The number of reference comparisons to make before stopping the search for a closer reference sequence.
`gapOpening`	Numeric giving the cost for opening a gap between the query and reference sequences.
`gapExtension`	Numeric giving the cost for extending an open gap between the query and reference sequences.
`frameShift`	Numeric giving the cost for shifting between frames of the query sequence.
`geneticCode`	Named character vector in the same format as `GENETIC_CODE` (the default), which represents the standard genetic code.
`substitutionMatrix`	Either a substitution matrix representing the substitution scores for matching two amino acids or the name of the amino acid substitution matrix. The latter may be one of the following: “BLOSUM45”, “BLOSUM50”, “BLOSUM62”, “BLOSUM80”, “BLOSUM100”, “PAM30”, “PAM40”, “PAM70”, “PAM120”, “PAM250”, “PFASUM50” (the default), or “MIQS”.
`verbose`	Logical indicating whether to display progress.
`processors`	The number of processors to use, or `NULL` to automatically detect and use all available processors.

Accurate translation of protein coding sequences can be greatly disrupted by one or two nucleotide phase shifts that occasionally occur during DNA sequencing. These frameshift errors can potentially be corrected through comparison with other unshifted protein sequences. This function uses a set of reference amino acid sequences (AAStringSet) to find and correct frameshift errors in a set of nucleotide sequences (myXStringSet). First, three frame translation of the nucleotide sequences is performed, and the nearest reference sequence is selected. Then the optimal reading frame at each position is determined based on a variation of the Guan & Uberbacher (1996) method. Putative insertions and/or deletions (indels) are returned in the result, typically with close proximity to the true indel locations. For a comparison of this method to others, see Wang et al. (2013).

If type is "sequences" or "both", then frameshifts are corrected by adding N's and/or removing nucleotides. Note that this changes the nucleotide sequence, and the new sequence often has minor errors because the exact location of the indel(s) cannot be determined. However, the original frameshifts that disrupted the entire downstream sequence are reduced to local perturbations. All of the returned nucleotide sequences will have a reading frame starting from the first position. This allows direct translation, and in practice works well if there is a similar reference myAAStringSet with the correct reading frame. Hence it is more important that myAAStringSet contain a wide variety of sequences than it is that it contain a lot of sequences.

Multiple inputs control the time required for frameshift correction. The number of sequences in the reference set (myAAStringSet) will affect the speed of the first search for similar sequences. Assessing frameshifts in the second step requires order N*M time, where N and M are the lengths of the query (myXStringSet) and reference sequences. Two parameters control the number of assessments that are made for each sequence: (1) maxComparisons determines the maximum number of reference sequences to compare to each query sequence, and (2) acceptDist defines the maximum distance between a query and reference that is acceptable before continuing to the next query sequence. A lower value for maxComparisons or a higher value for acceptDist will accelerate frameshift correction, potentially at the expense of some accuracy.

If type is "indels" then the returned object is a list with the same length as myXStringSet. Each element is a list with four components:

`"insertions"`	Approximate positions of inserted nucleotides, which could be removed to correct the reading frame, or excess nucleotides at the 3'-end that make the length longer than a multiple of three.
`"deletions"`	Approximate positions of deleted nucleotides, which could be added back to correct the reading frame.
`"distance"`	The amino acid distance from the nearest reference sequence, between 0 and 1.
`"index"`	The integer index of the reference sequence that was used for frame correction, or `0` if no reference sequence was within `rejectDistance`.

Note that positions in insertions and deletions are sometimes repeated to indicate that the same position needs to be shifted successively more than once to correct the reading frame.

If type is "sequences" then the returned object is an XStringSet of the same type as the input (myXStringSet). Nucleotides are added or deleted as necessary to correct for frameshifts. The returned sequences all have a reading frame starting from position 1, so that they can be translated directly.

If type is "both" then the returned object is a list with two components: one for the "indels" and the other for the "sequences".

Erik Wright eswright@pitt.edu

Guan, X., & Uberbacher, E. C. (1996). Alignments of DNA and protein sequences containing frameshift errors. Computer Applications in the Biosciences : CABIOS, 12(1), 31-40.

Wang, Q., et al. (2013). Ecological Patterns of nifH Genes in Four Terrestrial Climatic Zones Explored with Targeted Metagenomics Using FrameBot, a New Informatics Tool. mBio, 4(5), e00592-13-e00592-13.

AlignTranslation, OrientNucleotides, PFASUM

fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)

# introduce artificial indels
n_ins <- 2 # insertions per sequence
shifted <- replaceAt(dna,
	lapply(width(dna),
		sample,
		n_ins),
	sample(DNA_BASES,
		n_ins,
		replace=TRUE))
n_dels <- 1 # deletions per sequence
shifted <- replaceAt(shifted,
	as(lapply(width(shifted),
		function(x) {
			IRanges(sample(x,
					n_dels),
				width=1)
		}), "IRangesList"))

# to make frameshift correction more challenging,
# only supply 20 reference amino acid sequences
s <- sample(length(dna), 20)
x <- CorrectFrameshifts(shifted,
	translate(dna[s]),
	type="both")

# there was a wide range of distances
# to the nearest reference sequence
quantile(unlist(lapply(x[[1]], `[`, "distance")))

# none of the sequences were > rejectDistance
# from the nearest reference sequence
length(which(unlist(lapply(x[[1]], `[`, "index"))==0))

# the number of indels was generally correct
table(unlist(lapply(x[[1]], function(x) {
	length(x$insertions)})))/length(shifted)
table(unlist(lapply(x[[1]], function(x) {
	length(x$deletions)})))/length(shifted)

# align and display the translations
AA <- AlignTranslation(x$sequences,
	readingFrame=1,
	type="AAStringSet")
BrowseSeqs(AA)

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.

DECIPHER index

Package overview Classify Sequences Design Group-Specific FISH Probes Design Group-Specific Primers Design Microarray Probes Design Primers That Yield Group-Specific Signatures Finding Chimeric Sequences Getting Started DECIPHERing The Art of Multiple Sequence Alignment in R The Magic of Gene Finding

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

CorrectFrameshifts: Corrects Frameshift Errors In Protein Coding Sequences
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Related to CorrectFrameshifts in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER Tools for curating, analyzing, and manipulating biological sequences

CorrectFrameshifts: Corrects Frameshift Errors In Protein Coding Sequences In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Related to CorrectFrameshifts in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

CorrectFrameshifts: Corrects Frameshift Errors In Protein Coding Sequences
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences