# Predict RNA Secondary Structure in Dot-Bracket Notation

### Description

Predicts a consensus RNA secondary structure from a multiple sequence alignment using mutual information.

### Usage

1 2 3 4 5 6 7 8 9 10 11 12 | ```
PredictDBN(myXStringSet,
type = "states",
minOccupancy = 0.5,
impact = c(1, 1.2, 0.4, -1),
avgProdCorr = 1,
slope = 2,
shift = 1.3,
threshold = 0.5,
pseudoknots = 1,
weight = 1,
processors = 1,
verbose = TRUE)
``` |

### Arguments

`myXStringSet` |
A |

`type` |
Character string indicating the type of results desired. This should be (an unambiguous abbreviation of) one of |

`minOccupancy` |
Numeric specifying the minimum occupancy (1 - fraction of gaps) required to include a column of the alignment in the prediction. |

`impact` |
A vector with four elements giving the weights of A/U, G/C, G/U, and other pairings, respectively. The last element of |

`avgProdCorr` |
Numeric specifying the weight of the average product correction (APC) term, as described in Buslje et al. (2009). |

`slope` |
Numeric giving the slope of the sigmoid used to convert mutual information values to scores ranging from zero to one. |

`shift` |
Numeric giving the relative shift of the sigmoid used to convert mutual information values to scores ranging from zero to one. |

`threshold` |
Numeric specifying the score threshold at which to consider positions for pairing. Only applicable if |

`pseudoknots` |
Integer indicating the maximum order of pseudoknots that are acceptable. A value of |

`weight` |
A numeric vector of weights for each sequence, or a single number implying equal weights. |

`processors` |
The number of processors to use, or |

`verbose` |
Logical indicating whether to display progress. |

### Details

`PredictDBN`

employs an extension of the method described by Freyhult et al. (2005) for determining a consensus RNA secondary structure. It uses the mutual information (*H*) measure to find covarying positions in a multiple sequence alignment. The original method is modified by the addition of different weights for each type of base pairing and each input sequence. The formula for mutual information between positions *i* and *j* then becomes:

*H(i,j) = ∑_{XY \in bp}^{} ≤ft( impact(XY) \cdot f_{i,j}(XY) \cdot \log_2 ≤ft( \frac{f_{i,j}(XY)}{f_{i}(X) \cdot f_{j}(Y)} \right) \right)*

where, *bp* denotes the base pairings A/U, C/G, and G/U; `impact`

is their weight; *f* is the frequency of single bases or pairs weighted by the corresponding `weight`

of each sequence.

A penalty is then added for bases that are inconsistent with pairing:

*H_{mod}(i,j) = H(i,j) + ∑_{XY \notin bp}^{} \Big( impact(XY) \cdot f_{i,j}(XY) \Big)*

Next an average product correction (Buslje et al., 2009) is applied to the matrix *H*:

*H_{APC}(i,j) = H_{mod}(i,j) - avgProdCorr \cdot \frac{\overline{H_{mod}(i,.)} \cdot \overline{H_{mod}(.,j)}}{\overline{H_{mod}(.,.)}}*

The mutual information values are then rescaled between `0`

and `1`

by applying a sigmoidal transformation, which is controlled by `shift`

and `slope`

:

*H_{final}(i,j) = ≤ft( 1 + \exp ≤ft( slope \cdot log_e ≤ft( \frac{H_{APC}(i,j)}{shift \cdot H_{APC}[n]} \right) \right) \right)^{-1}*

where, *n* is the number of positions having `minOccupancy`

divided by two (i.e., the maximum possible number of paired positions) and *H_{APC}[n]* denotes the *n^{th}* highest value in the matrix *H_{APC}*.

If `type`

is `"states"`

or `"pairs"`

, the secondary structure is determined using a variant of the Nussinov algorithm similar to that described by Venkatachalam et al. (2014). Pairings with a score below `threshold`

are not considered during the traceback. If `psuedoknots`

is greater than `0`

, paired positions are removed from consideration and the method is applied again to find pseudoknots.

In practice the secondary structure prediction is most accurate when the input alignment is of high quality, contains a wide diversity of sequences, the number of sequences is large, no regions are completely conserved across all sequences, and most of the sequences span the entire alignment (i.e., there are few partial/incomplete sequences).

### Value

If `type`

is `"states"`

(the default), then the output is a character vector with the predicted secondary structure assignment for each position in `myXStringSet`

. Standard dot-bracket notation (DBN) is used, where “.” signifies an unpaired position, “(” and “)” a paired position, and successive “[]”, “{}”, and “<>” indicate increasing order pseudoknots.

If `type`

is `"pairs"`

, then a matrix is returned with one row for each base pairing and three columns giving the positions of the paired bases and their `pseudoknot`

order.

If `type`

is `"scores"`

, then a matrix of three rows is returned, where the values in a column represent the maximum score for a state in each position. Columns sum to `1`

if the position was above `minOccupancy`

and `0`

otherwise.

If `type`

is `"structures"`

, then the output is a list with one element for each sequence in `myXStringSet`

. Each list element contains a matrix of dimension 3 (each state) by the number of nucleotides in the sequence. Columns of the matrix sum to zero where the nucleotide was located in a position that was below `minOccupancy`

. Otherwise, positions are considered paired if they are consistent with pairing (i.e., A/U, C/G, or G/U) in the consensus secondary structure.

### Author(s)

Erik Wright DECIPHER@cae.wisc.edu

### References

Buslje, C., et al. (2009). Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics, **25(9)**, 1125-1131.

Freyhult, E., et al. (2005). Predicting RNA Structure Using Mutual Information. Applied Bioinformatics, **4(1)**, 53-59.

Venkatachalam, B., et al. (2014). Faster algorithms for RNA-folding using the Four-Russians method. Algorithms for Molecular Biology : AMB, **9(1)**, 1-12.

### See Also

`PredictHEC`

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 | ```
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
rna <- SearchDB(db, type="RNAStringSet")
p <- PredictDBN(rna, "states")
p
# color paired bases in the sequences
w <- which(strsplit(p, "")[[1]] != ".")
BrowseSeqs(c(BStringSet(p), BStringSet(rna)),
colorPatterns=rep(w, each=2),
patterns=RNA_BASES)
PredictDBN(rna, "pairs") # paired positions
``` |