AlignSeqs: Align a Set of Unaligned Sequences

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/AlignSeqs.R

Description

Performs profile-to-profile alignment of multiple unaligned sequences following a guide tree.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
AlignSeqs(myXStringSet,
         guideTree = NULL,
         iterations = 2,
         refinements = 1,
         gapOpening = c(-18, -16),
         gapExtension = c(-2, -1),
         useStructures = TRUE,
         structures = NULL,
         FUN = AdjustAlignment,
         levels = c(0.9, 0.7, 0.7, 0.4, 10, 5, 5, 2),
         alphabet = AA_REDUCED[[1]],
         processors = 1,
         verbose = TRUE,
         ...)

Arguments

myXStringSet

An AAStringSet, DNAStringSet, or RNAStringSet object of unaligned sequences.

guideTree

Either NULL or a dendrogram giving the ordered tree structure in which to align profiles. If NULL then a guide tree will be automatically constructed based on the order of shared k-mers.

iterations

Number of iteration steps to perform. During each iteration step the guide tree is regenerated based on the alignment and the sequences are realigned.

refinements

Number of refinement steps to perform. During each refinement step groups of sequences are realigned to rest of the sequences, and the best of these two alignments (before and after realignment) is kept.

gapOpening

Single numeric giving the cost for opening a gap in the alignment, or two numbers giving the minimum and maximum costs. In the latter case the cost will be varied depending upon whether the groups of sequences being aligned are nearly identical or maximally distant.

gapExtension

Single numeric giving the cost for extending an open gap in the alignment, or two numbers giving the minimum and maximum costs. In the latter case the cost will be varied depending upon whether the groups of sequences being aligned are nearly identical or maximally distant.

useStructures

Logical indicating whether to use secondary structure predictions during alignment. If TRUE (the default), secondary structure probabilities will be automatically calculated for amino acid and RNA sequences if they are not provided (i.e., when structures is NULL).

structures

Either a list of secondary structure probabilities matching the structureMatrix, such as that output by PredictHEC or PredictDBN, or NULL to generate the structures automatically. Only applicable if myXStringSet is an AAStringSet or RNAStringSet.

FUN

A function to be applied after each profile-to-profile alignment. (See details section below.)

levels

Numeric with eight elements specifying the levels at which to trigger events. (See details section below.)

alphabet

Character vector of amino acid groupings used to reduce the 20 standard amino acids into smaller groups. Alphabet reduction helps to find more distant homologies between sequences. A non-reduced amino acid alphabet can be used by setting alphabet equal to AA_STANDARD. Only applicable if myXStringSet is an AAStringSet.

processors

The number of processors to use, or NULL to automatically detect and use all available processors.

verbose

Logical indicating whether to display progress.

...

Further arguments to be passed directly to AlignProfiles, including perfectMatch, misMatch, gapPower, terminalGap, restrict, anchor, normPower, substitutionMatrix, and structureMatrix.

Details

The profile-to-profile method aligns a sequence set by merging profiles along a guide tree until all the input sequences are aligned. This process has three main steps: (1) If guideTree=NULL, an initial single-linkage guide tree is constructed based on a distance matrix of shared k-mers. Alternatively, a dendrogram can be provided as the initial guideTree. (2) If iterations is greater than zero, then a UPGMA guide tree is built based on the initial alignment and the sequences are re-aligned along this tree. This process repeated iterations times or until convergence. (3) If refinements is greater than zero, then subsets of the alignment are re-aligned to the remainder of the alignment. This process generates two alignments, the best of which is chosen based on its sum-of-pairs score. This refinement process is repeated refinements times, or until convergence.

The purpose of levels is to speed-up the alignment process by not running time consuming processes when they are unlikely to change the outcome. The first four levels control when refinements occur and the function FUN is run on the alignment. The default levels specify that these events should happen when above 0.9 (AA; levels[1]) or 0.7 (DNA/RNA; levels[3]) average dissimilarity on the initial tree, when above 0.7 (AA; levels[2]) or 0.4 (DNA/RNA; levels[4]) average dissimilarity on the iterative tree(s), and after every tenth improvement made during refinement. The sixth element of levels (levels[6]) prevents FUN from being applied at any point to less than 5 sequences.

The FUN function is always applied just before returning the alignment so long as there are at least levels[6] sequences. The default FUN is AdjustAlignment, but FUN can be any function that takes in an XStringSet as its first argument, as well as weights, processors, and substitutionMatrix as optional arguments. For example, the default FUN could be altered to not perform any changes by setting it equal to function(x, ...) return(x), where x is an XStringSet.

Secondary structures are automatically computed for amino acid and RNA sequences unless structures are provided or useStructures is FALSE. The default structureMatrix is used unless an alternative is provided. For RNA sequences, secondary structures are only computed when the total length of the initial guide tree is at least 5 (levels[7]) or the length of subsequent trees is at least 2 (levels[8]).

Value

An XStringSet of aligned sequences.

Author(s)

Erik Wright eswright@pitt.edu

References

Wright, E. S. (2015). DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinformatics, 16, 322. http://doi.org/10.1186/s12859-015-0749-z

Wright, E. S. (2020). RNAconTest: comparing tools for noncoding RNA multiple sequence alignment based on structural consistency. RNA 2020, 26, 531-540.

See Also

AdjustAlignment, AlignDB, AlignProfiles, AlignSynteny, AlignTranslation, IdClusters, ReadDendrogram, StaggerAlignment

Examples

1
2
3
4
5
6
7
8
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
dna <- SearchDB(db, remove="all")
alignedDNA <- AlignSeqs(dna)
BrowseSeqs(alignedDNA, highlight=1)

# use secondary structure with RNA sequences
alignedRNA <- AlignSeqs(RNAStringSet(dna))
BrowseSeqs(alignedRNA, highlight=1)

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.