Description Usage Arguments Details Value Author(s) References See Also Examples

Predicts a consensus RNA secondary structure from a multiple sequence alignment using mutual information.

1 2 3 4 5 6 7 8 9 10 11 12 | ```
PredictDBN(myXStringSet,
type = "states",
minOccupancy = 0.5,
impact = c(1, 1.2, 0.4, -1),
avgProdCorr = 1,
slope = 2,
shift = 1.3,
threshold = 0.7,
pseudoknots = 1,
weight = 1,
processors = 1,
verbose = TRUE)
``` |

`myXStringSet` |
A |

`type` |
Character string indicating the type of results desired. This should be (an unambiguous abbreviation of) one of |

`minOccupancy` |
Numeric specifying the minimum occupancy (1 - fraction of gaps) required to include a column of the alignment in the prediction. |

`impact` |
A vector with four elements giving the weights of A/U, G/C, G/U, and other pairings, respectively. The last element of |

`avgProdCorr` |
Numeric specifying the weight of the average product correction (APC) term, as described in Buslje et al. (2009). |

`slope` |
Numeric giving the slope of the sigmoid used to convert mutual information values to scores ranging from zero to one. |

`shift` |
Numeric giving the relative shift of the sigmoid used to convert mutual information values to scores ranging from zero to one. |

`threshold` |
Numeric specifying the score threshold at which to consider positions for pairing. Only applicable if |

`pseudoknots` |
Integer indicating the maximum order of pseudoknots that are acceptable. A value of |

`weight` |
A numeric vector of weights for each sequence, or a single number implying equal weights. |

`processors` |
The number of processors to use, or |

`verbose` |
Logical indicating whether to display progress. |

`PredictDBN`

employs an extension of the method described by Freyhult et al. (2005) for determining a consensus RNA secondary structure. It uses the mutual information (*H*) measure to find covarying positions in a multiple sequence alignment. The original method is modified by the addition of different weights for each type of base pairing and each input sequence. The formula for mutual information between positions *i* and *j* then becomes:

*H(i,j) = ∑_{XY \in bp}^{} ≤ft( impact(XY) \cdot f_{i,j}(XY) \cdot \log_2 ≤ft( \frac{f_{i,j}(XY)}{f_{i}(X) \cdot f_{j}(Y)} \right) \right)*

where, *bp* denotes the base pairings A/U, C/G, and G/U; `impact`

is their weight; *f* is the frequency of single bases or pairs weighted by the corresponding `weight`

of each sequence.

A penalty is then added for bases that are inconsistent with pairing:

*H_{mod}(i,j) = H(i,j) + ∑_{XY \notin bp}^{} \Big( impact(XY) \cdot f_{i,j}(XY) \Big)*

Next an average product correction (Buslje et al., 2009) is applied to the matrix *H*:

*H_{APC}(i,j) = H_{mod}(i,j) - avgProdCorr \cdot \frac{\overline{H_{mod}(i,.)} \cdot \overline{H_{mod}(.,j)}}{\overline{H_{mod}(.,.)}}*

The mutual information values are then rescaled between `0`

and `1`

by applying a sigmoidal transformation, which is controlled by `shift`

and `slope`

:

*H_{final}(i,j) = ≤ft( 1 + \exp ≤ft( slope \cdot log_e ≤ft( \frac{H_{APC}(i,j)}{shift \cdot H_{APC}[n]} \right) \right) \right)^{-1}*

where, *n* is the number of positions having `minOccupancy`

divided by two (i.e., the maximum possible number of paired positions) and *H_{APC}[n]* denotes the *n^{th}* highest value in the matrix *H_{APC}*.

If `type`

is `"states"`

or `"pairs"`

, the secondary structure is determined using a variant of the Nussinov algorithm similar to that described by Venkatachalam et al. (2014). Pairings with a score below `threshold`

are not considered during the traceback. If `psuedoknots`

is greater than `0`

, paired positions are removed from consideration and the method is applied again to find pseudoknots.

In practice the secondary structure prediction is most accurate when the input alignment is of high quality, contains a wide diversity of sequences, the number of sequences is large, no regions are completely conserved across all sequences, and most of the sequences span the entire alignment (i.e., there are few partial/incomplete sequences).

If `type`

is `"states"`

(the default), then the output is a character vector with the predicted secondary structure assignment for each position in `myXStringSet`

. Standard dot-bracket notation (DBN) is used, where “.” signifies an unpaired position, “(” and “)” a paired position, and successive “[]”, “{}”, and “<>” indicate increasing order pseudoknots. Columns below `minOccupancy`

are denoted by the “-” character to indicate that they contained too many gaps to be included in the consensus structure.

If `type`

is `"pairs"`

, then a matrix is returned with one row for each base pairing and three columns giving the positions of the paired bases and their `pseudoknot`

order.

If `type`

is `"evidence"`

, then a matrix is returned with one row for each base pairing and three columns giving the positions of the paired bases and their respective scores (greater than or equal to `threshold`

). This differs from `type`

`"pairs"`

in that `"evidence"`

does not perform a traceback. Therefore, it is possible to have conflicting evidence where a single base has evidence for pairing with multiple other bases.

If `type`

is `"scores"`

, then a matrix of three rows is returned, where the values in a column represent the maximum score for a state in each position. Columns sum to `1`

if the position was above `minOccupancy`

and `0`

otherwise.

If `type`

is `"structures"`

, then the output is a list with one element for each sequence in `myXStringSet`

. Each list element contains a matrix of dimension 3 (each state) by the number of nucleotides in the sequence. Columns of the matrix sum to zero where the nucleotide was located in a position that was below `minOccupancy`

. Otherwise, positions are considered paired if they are consistent with pairing (i.e., A/U, C/G, or G/U) in the consensus secondary structure.

If `type`

is `"search"`

then an attempt is made to find additional secondary structure beyond positions exhibiting covariation. First, anchors are identified as pairs of covarying positions with their score above `threshold`

. Next, the regions between anchors are searched for previously unidentified stem loops. Finally, any helices are assigned a score according to their length, i.e. one minus the probability of finding that many consecutive pairs within the anchor boundaries by chance. Hence, output `type`

`"search"`

will find secondary structure outside of the consensus structure shared by most sequences, and can identify secondary structure in conserved alignment regions.

Erik Wright [email protected]

Buslje, C., et al. (2009). Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics, **25(9)**, 1125-1131.

Freyhult, E., et al. (2005). Predicting RNA Structure Using Mutual Information. Applied Bioinformatics, **4(1)**, 53-59.

Venkatachalam, B., et al. (2014). Faster algorithms for RNA-folding using the Four-Russians method. Algorithms for Molecular Biology : AMB, **9(1)**, 1-12.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | ```
# load the example non-coding RNA sequences
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
rna <- SearchDB(db, type="RNAStringSet")
# predict the secondary structure in dot-bracket notation (dbn)
p <- PredictDBN(rna, "states") # predict the secondary structure in dbn
p # pairs are denoted by (), and (optionally) pseudoknots by [], {}, and <>
# convert the dot-bracket notation into pairs of positions within the alignment
p <- PredictDBN(rna, "pairs") # paired positions in the alignment
head(p) # matrix giving the pairs and their pseudoknot order (when > 0)
# show all available evidence of base pairing
p <- PredictDBN(rna, "evidence") # all pairs with scores >= threshold
head(p) # matrix giving the pairs and their scores
# determine the score at every alignment position
p <- PredictDBN(rna, "scores") # score in the alignment
p["(", 122] # score for left-pairing at alignment position 122
p[")", 260] # score for right-pairing at alignment position 260
# find the scores individually for every sequence in the alignment
p <- PredictDBN(rna, "structures") # scores per sequence
p[[1]][, 1] # the scores for the first position in the first sequence
p[[2]][, 10] # the scores for the tenth position in the second sequence
# these positional scores can be used as shades of red, green, and blue:
BrowseSeqs(rna, patterns=p) # red = unpaired, green = left-pairing, blue = right
# positions in black are not part of the consensus secondary structure
# search for additional secondary structure between the consensus pairs
p <- PredictDBN(rna, "search") # scores per sequence after searching
BrowseSeqs(rna, patterns=p) # red = unpaired, green = left-pairing, blue = right
# note that "search" identified many more pairings than "structures"
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.