AlignProfiles: Align Two Sets of Aligned Sequences
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description Usage Arguments Details Value Author(s) References See Also Examples

Aligns two sets of one or more aligned sequences by first generating representative profiles, then aligning the profiles with dynamic programming, and finally merging the two aligned sequence sets.

AlignProfiles(pattern,
              subject,
              p.weight = 1,
              s.weight = 1,
              p.struct = NULL,
              s.struct = NULL,
              perfectMatch = 5,
              misMatch = 0,
              gapOpening = -13,
              gapExtension = -1,
              gapPower = -0.5,
              terminalGap = -5,
              restrict = c(-1000, 2, 10),
              anchor = 0.7,
              normPower = c(1, 0),
              substitutionMatrix = NULL,
              structureMatrix = NULL,
              processors = 1)

`pattern`	An `AAStringSet`, `DNAStringSet`, or `RNAStringSet` object of aligned sequences to use as the pattern.
`subject`	A `XStringSet` object of aligned sequences to use as the subject. Must match the type of the pattern.
`p.weight`	A numeric vector of weights for each sequence in the pattern to use in generating a profile, or a single number implying equal weights.
`s.weight`	A numeric vector of weights for each sequence in the subject to use in generating a profile, or a single number implying equal weights.
`p.struct`	Either `NULL` (the default), a matrix, or a list of matrices with one list element per sequence in the pattern. (See details section below.)
`s.struct`	Either `NULL` (the default), a matrix, or a list of matrices with one list element per sequence in the subject. (See details section below.)
`perfectMatch`	Numeric giving the reward for aligning two matching nucleotides in the alignment. Only applicable for `DNAStringSet` or `RNAStringSet` inputs.
`misMatch`	Numeric giving the cost for aligning two mismatched nucleotides in the alignment. Only applicable for `DNAStringSet` or `RNAStringSet` inputs.
`gapOpening`	Numeric giving the cost for opening a gap in the alignment.
`gapExtension`	Numeric giving the cost for extending an open gap in the alignment.
`gapPower`	Numeric specifying the exponent to use in the gap cost function. (See details section below.)
`terminalGap`	Numeric giving the cost for allowing leading and trailing gaps ("-" or "." characters) in the alignment. Either two numbers, the first for leading gaps and the second for trailing gaps, or a single number for both.
`restrict`	Numeric vector of length three controlling the degree of restriction around ridge lines in the dynamic programming matrix. The first element determines the span of the region around a ridge that is considered during alignment. The default (`-1000`) will align most inputs that can reasonably be globally aligned without any loss in accuracy. Input sequences with high similarity could be more restricted (e.g., `-500`), whereas a `pattern` and `subject` with little overlap could be less restricted (e.g., `-10000`). The second element sets the minimum slope to either side of a ridge that is required to allow restriction at any point. The third element sets the minimum duration of the ridge required to begin restricting the matrix around the ridge. The duration of the ridge is defined as the number of consecutive positions meeting the first two conditions for restriction. (See details section below.)
`anchor`	Numeric giving the fraction of sequences with identical k-mers required to become an anchor point, or `NA` to not use anchors. Alternatively, a matrix specifying anchor regions. (See details section below.)
`normPower`	Numeric giving the exponent that controls the degree of normalization applied to scores by column occupancy. If two numerics are provided, the first controls the normalization power of terminal gaps, while the second controls that of internal gaps. A `normPower` of `0` does not normalize the scores, which results in all columns of the profiles being weighted equally, and is the optimal value for aligning fragmentary sequences. A `normPower` of `1` normalizes the score for aligning two positions by their column occupancy (1 - fraction of gaps). A `normPower` greater than `1` more strongly discourages aligning with “gappy” regions of the alignment. (See details section below.)
`substitutionMatrix`	Either a substitution matrix representing the substitution scores for an alignment or the name of the amino acid substitution matrix to use in alignment. The latter may be one of the following: “BLOSUM45”, “BLOSUM50”, “BLOSUM62”, “BLOSUM80”, “BLOSUM100”, “PAM30”, “PAM40”, “PAM70”, “PAM120”, “PAM250”, or “MIQS”. The default (NULL) will use the `perfectMatch` and `misMatch` penalties for DNA/RNA or `PFASUM50` for AA. (See examples section below.)
`structureMatrix`	A structure matrix if `p.struct` and `s.struct` are supplied, or `NULL` otherwise.
`processors`	The number of processors to use, or `NULL` to automatically detect and use all available processors.

Profiles are aligned using dynamic programming, a variation of the Needleman-Wunsch algorithm for global alignment. The dynamic programming method requires order N*M time and memory space where N and M are the width of the pattern and subject. This method works by filling in a matrix of the possible “alignment space” by considering all matches, insertions, and deletions between two sequence profiles. The highest scoring alignment is then used to add gaps to each of the input sequence sets.

Heuristics can be useful to improve performance on long input sequences. The restrict parameter can be used to dynamically constrain the possible “alignment space” to only paths that will likely include the final alignment, which in the best case can improve the speed from quadratic time to nearly linear time. The degree of restriction is important, and the default value of restrict is reasonable in the vast majority of cases. It is also possible to prevent restriction by setting restrict to such extreme values that these requirements will never be met (e.g., c(-1e10, 1e10, 1e10)).

The argument anchor can be used to split the global alignment into multiple sub-alignments. This can greatly decrease the memory requirement for long sequences when appropriate anchor points can be found. Anchors are 15-mer (for DNA/RNA) or 7-mer (for AA) subsequences that are shared between at least anchor fraction of pattern(s) and subject(s). Anchored ranges are extended along the length of each sequence in a manner designed to split the alignment into sub-alignments that can be separately solved. For most input sequences, the default anchoring has no effect on accuracy, but anchoring can be disabled by setting anchor=NA.

Alternatively, anchor can be a matrix with 4 rows and one column per anchor. The first two rows correspond to the anchor start and end positions in the pattern sequence(s), and the second two rows are the equivalent anchor region in the subject sequence(s). Anchors specified in this manner must be strictly increasing (non-overlapping) in both sequences, and have an anchor width of at least two positions. Note that the anchors do not have to be equal length, in which case the anchor regions will also be aligned. Manually splitting the alignment into more subtasks can sometimes make it more efficient, but typically automatic anchoring is effective.

The argument normPower determines how the distribution of information is treated during alignment. Higher values of normPower encourage alignment between columns with higher occupancy (1 - fraction of gaps), and de-emphasize the alignment of columns containing many gaps. A normPower of 0 will treat all columns equally regardless of occupancy, which can be useful when the pattern or subject contain many incomplete (fragment) sequences. For example, normPower should be set to 0 when aligning many short reads to a longer reference sequence.

The arguments p.struct and s.struct may be used to provide secondary structure probabilities in the form of a list containing matrices or a single matrix. If the input is a list, then each list element must contain a matrix with dimensions q*w, where q is the number of possible secondary structure states, and w is the width of the unaligned pattern sequence. Values in each matrix represent the probability of the given state at that position in the sequence. Alternatively, a single matrix can be used as input if w is the width of the entire pattern or subject alignment. A structureMatrix must be supplied along with the structures. The functions PredictHEC and PredictDBN can be used to predict secondary structure probabilities in the format required by AlignProfiles (for amino acid and RNA sequences, respectively).

The gap cost function is based on the observation that gap lengths are best approximated by a Zipfian distribution (Chang & Benner, 2004). The cost of inserting a gap of length L is equal to: gapOpening + gapExtension*sum(seq_len(L - 1)^gapPower) when L > 1, and gapOpen when L = 1. This function effectively penalizes shorter gaps significantly more than longer gaps when gapPower < 0, and is equivalent to the affine gap penalty when gapPower is 0.

An XStringSet of aligned sequences.

Erik Wright eswright@pitt.edu

Chang, M. S. S., & Benner, S. A. (2004). Empirical Analysis of Protein Insertions and Deletions Determining Parameters for the Correct Placement of Gaps in Protein Sequence Alignments. Journal of Molecular Biology, 341(2), 617-631.

Needleman, S., & Wunsch, C. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443-453.

Wright, E. S. (2015). DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinformatics, 16, 322. http://doi.org/10.1186/s12859-015-0749-z

Wright, E. S. (2020). RNAconTest: comparing tools for noncoding RNA multiple sequence alignment based on structural consistency. RNA 2020, 26, 531-540.

Yu, Y.-K., et al. (2015). Log-odds sequence logos. Bioinformatics, 31(3), 324-331. http://doi.org/10.1093/bioinformatics/btu634

AlignDB, AlignSeqs, AlignSynteny, AlignTranslation, PFASUM, MIQS

# align two sets of sequences
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
dna1 <- SearchDB(db, remove="common", limit=100) # the first 100 sequences
dna2 <- SearchDB(db, remove="common", limit="100,100") # the rest
alignedDNA <- AlignProfiles(dna1, dna2)
BrowseSeqs(alignedDNA, highlight=1)

# specify a DNA substitution matrix
subMatrix <- matrix(0,
                    nrow=4, ncol=4,
                    dimnames=list(DNA_BASES, DNA_BASES))
diag(subMatrix) <- 5 # perfectMatch
alignedDNA.defaultSubM <- AlignProfiles(dna1, dna2, substitutionMatrix=subMatrix)
all(alignedDNA.defaultSubM==alignedDNA)

# specify a different DNA substitution matrix
subMatrix2 <- matrix(c(12, 3, 5, 3, 3, 12, 3, 6, 5, 3, 11, 3, 3, 6, 3, 9),
                    nrow=4, ncol=4,
                    dimnames=list(DNA_BASES, DNA_BASES))
alignedDNA.alterSubM <- AlignProfiles(dna1, dna2, substitutionMatrix=subMatrix2)
all(alignedDNA.alterSubM==alignedDNA)

# anchors are found automatically by default, but it is also
# possible to specify anchor regions between the sequences
anchors <- matrix(c(774, 788, 752, 766), nrow=4)
anchors
subseq(dna1, anchors[1, 1], anchors[2, 1])
subseq(dna2, anchors[3, 1], anchors[4, 1])
alignedDNA2 <- AlignProfiles(dna1, dna2, anchor=anchors)

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.

DECIPHER index

Package overview Classify Sequences Design Group-Specific FISH Probes Design Group-Specific Primers Design Microarray Probes Design Primers That Yield Group-Specific Signatures Finding Chimeric Sequences Getting Started DECIPHERing The Art of Multiple Sequence Alignment in R The Magic of Gene Finding

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

AlignProfiles: Align Two Sets of Aligned Sequences
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Related to AlignProfiles in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER Tools for curating, analyzing, and manipulating biological sequences

AlignProfiles: Align Two Sets of Aligned Sequences In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Related to AlignProfiles in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

AlignProfiles: Align Two Sets of Aligned Sequences
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences