Description Usage Arguments Details Value Author(s) References See Also Examples
Predicts a consensus RNA secondary structure from a multiple sequence alignment using mutual information.
1 2 3 4 5 6 7 8 9 10 11 12 | PredictDBN(myXStringSet,
type = "states",
minOccupancy = 0.5,
impact = c(1, 1.2, 0.4, -1),
avgProdCorr = 1,
slope = 2,
shift = 1.3,
threshold = 0.7,
pseudoknots = 1,
weight = NA,
processors = 1,
verbose = TRUE)
|
myXStringSet |
A |
type |
Character string indicating the type of results desired. This should be (an unambiguous abbreviation of) one of |
minOccupancy |
Numeric specifying the minimum occupancy (1 - fraction of gaps) required to include a column of the alignment in the prediction. |
impact |
A vector with four elements giving the weights of A/U, G/C, G/U, and other pairings, respectively. The last element of |
avgProdCorr |
Numeric specifying the weight of the average product correction (APC) term, as described in Buslje et al. (2009). |
slope |
Numeric giving the slope of the sigmoid used to convert mutual information values to scores ranging from zero to one. |
shift |
Numeric giving the relative shift of the sigmoid used to convert mutual information values to scores ranging from zero to one. |
threshold |
Numeric specifying the score threshold at which to consider positions for pairing. Only applicable if |
pseudoknots |
Integer indicating the maximum order of pseudoknots that are acceptable. A value of |
weight |
Either a numeric vector of weights for each sequence, a single number implying equal weights, or |
processors |
The number of processors to use, or |
verbose |
Logical indicating whether to display progress. |
PredictDBN
employs an extension of the method described by Freyhult et al. (2005) for determining a consensus RNA secondary structure. It uses the mutual information (H) measure to find covarying positions in a multiple sequence alignment. The original method is modified by the addition of different weights for each type of base pairing and each input sequence. The formula for mutual information between positions i and j then becomes:
H(i,j) = ∑_{XY \in bp}^{} ≤ft( impact(XY) \cdot f_{i,j}(XY) \cdot \log_2 ≤ft( \frac{f_{i,j}(XY)}{f_{i}(X) \cdot f_{j}(Y)} \right) \right)
where, bp denotes the base pairings A/U, C/G, and G/U; impact
is their weight; f is the frequency of single bases or pairs weighted by the corresponding weight
of each sequence.
A penalty is then added for bases that are inconsistent with pairing:
H_{mod}(i,j) = H(i,j) + ∑_{XY \notin bp}^{} \Big( impact(XY) \cdot f_{i,j}(XY) \Big)
Next an average product correction (Buslje et al., 2009) is applied to the matrix H:
H_{APC}(i,j) = H_{mod}(i,j) - avgProdCorr \cdot \frac{\overline{H_{mod}(i,.)} \cdot \overline{H_{mod}(.,j)}}{\overline{H_{mod}(.,.)}}
The mutual information values are then rescaled between 0
and 1
by applying a sigmoidal transformation, which is controlled by shift
and slope
:
H_{final}(i,j) = ≤ft( 1 + \exp ≤ft( slope \cdot log_e ≤ft( \frac{H_{APC}(i,j)}{shift \cdot H_{APC}[n]} \right) \right) \right)^{-1}
where, n is the number of positions having minOccupancy
divided by two (i.e., the maximum possible number of paired positions) and H_{APC}[n] denotes the n^{th} highest value in the matrix H_{APC}.
If type
is "states"
or "pairs"
, the secondary structure is determined using a variant of the Nussinov algorithm similar to that described by Venkatachalam et al. (2014). Pairings with a score below threshold
are not considered during the traceback. If psuedoknots
is greater than 0
, paired positions are removed from consideration and the method is applied again to find pseudoknots.
In practice the secondary structure prediction is most accurate when the input alignment is of high quality, contains a wide diversity of sequences, the number of sequences is large, no regions are completely conserved across all sequences, and most of the sequences span the entire alignment (i.e., there are few partial/incomplete sequences).
If type
is "states"
(the default), then the output is a character vector with the predicted secondary structure assignment for each position in myXStringSet
. Standard dot-bracket notation (DBN) is used, where “.” signifies an unpaired position, “(” and “)” a paired position, and successive “[]”, “{}”, and “<>” indicate increasing order pseudoknots. Columns below minOccupancy
are denoted by the “-” character to indicate that they contained too many gaps to be included in the consensus structure.
If type
is "pairs"
, then a matrix is returned with one row for each base pairing and three columns giving the positions of the paired bases and their pseudoknot
order.
If type
is "evidence"
, then a matrix is returned with one row for each base pairing and three columns giving the positions of the paired bases and their respective scores (greater than or equal to threshold
). This differs from type
"pairs"
in that "evidence"
does not perform a traceback. Therefore, it is possible to have conflicting evidence where a single base has evidence for pairing with multiple other bases.
If type
is "scores"
, then a matrix of three rows is returned, where the values in a column represent the maximum score for a state in each position. Columns sum to 1
if the position was above minOccupancy
and 0
otherwise.
If type
is "structures"
, then the output is a list with one element for each sequence in myXStringSet
. Each list element contains a matrix of dimension 3 (each state) by the number of nucleotides in the sequence. Columns of the matrix sum to zero where the nucleotide was located in a position that was below minOccupancy
. Otherwise, positions are considered paired if they are consistent with pairing (i.e., A/U, C/G, or G/U) in the consensus secondary structure.
If type
is "search"
then an attempt is made to find additional secondary structure beyond positions exhibiting covariation. First, anchors are identified as pairs of covarying positions with their score above threshold
. Next, the regions between anchors are searched for previously unidentified stem loops. Finally, any helices are assigned a score according to their length, i.e. one minus the probability of finding that many consecutive pairs within the anchor boundaries by chance. Hence, output type
"search"
will find secondary structure outside of the consensus structure shared by most sequences, and can identify secondary structure in conserved alignment regions.
Erik Wright eswright@pitt.edu
Buslje, C., et al. (2009). Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics, 25(9), 1125-1131.
Freyhult, E., et al. (2005). Predicting RNA Structure Using Mutual Information. Applied Bioinformatics, 4(1), 53-59.
Venkatachalam, B., et al. (2014). Faster algorithms for RNA-folding using the Four-Russians method. Algorithms for Molecular Biology : AMB, 9(1), 1-12.
Wright, E. S. (2020). RNAconTest: comparing tools for noncoding RNA multiple sequence alignment based on structural consistency. RNA 2020, 26, 531-540.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | # load the example non-coding RNA sequences
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
rna <- SearchDB(db, type="RNAStringSet")
# predict the secondary structure in dot-bracket notation (dbn)
p <- PredictDBN(rna, "states") # predict the secondary structure in dbn
p # pairs are denoted by (), and (optionally) pseudoknots by [], {}, and <>
# convert the dot-bracket notation into pairs of positions within the alignment
p <- PredictDBN(rna, "pairs") # paired positions in the alignment
head(p) # matrix giving the pairs and their pseudoknot order (when > 0)
# plot an arc diagram with the base pairings
plot(NA, xlim=c(0, 1), ylim=c(0, 1),
xaxs="i", yaxs="i",
xlab="Alignment position", ylab="",
bty="n", xaxt="n", yaxt="n")
ticks <- pretty(seq_len(width(rna)[1]))
axis(1, ticks/width(rna)[1], ticks)
rs <- c(seq(0, pi, len=100), NA)
r <- (p[, 2] - p[, 1] + 1)/width(rna)[1]/2
r <- rep(r, each=101)
x <- (p[, 1] + p[, 2])/2/width(rna)[1]
x <- rep(x, each=101) + r*cos(rs)
y <- r*sin(rs)/max(r, na.rm=TRUE)
lines(x, y, xpd=TRUE)
# show all available evidence of base pairing
p <- PredictDBN(rna, "evidence") # all pairs with scores >= threshold
head(p) # matrix giving the pairs and their scores
# determine the score at every alignment position
p <- PredictDBN(rna, "scores") # score in the alignment
p["(", 122] # score for left-pairing at alignment position 122
p[")", 260] # score for right-pairing at alignment position 260
# find the scores individually for every sequence in the alignment
p <- PredictDBN(rna, "structures") # scores per sequence
p[[1]][, 1] # the scores for the first position in the first sequence
p[[2]][, 10] # the scores for the tenth position in the second sequence
# these positional scores can be used as shades of red, green, and blue:
BrowseSeqs(rna, patterns=p) # red = unpaired, green = left-pairing, blue = right
# positions in black are not part of the consensus secondary structure
# search for additional secondary structure between the consensus pairs
p <- PredictDBN(rna, "search") # scores per sequence after searching
BrowseSeqs(rna, patterns=p) # red = unpaired, green = left-pairing, blue = right
# note that "search" identified many more pairings than "structures"
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.