# PredictDBN: Predict RNA Secondary Structure in Dot-Bracket Notation In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

## Description

Predicts a consensus RNA secondary structure from a multiple sequence alignment using mutual information.

## Usage

  1 2 3 4 5 6 7 8 9 10 11 12 PredictDBN(myXStringSet, type = "states", minOccupancy = 0.5, impact = c(1, 1.2, 0.4, -1), avgProdCorr = 1, slope = 2, shift = 1.3, threshold = 0.7, pseudoknots = 1, weight = 1, processors = 1, verbose = TRUE) 

## Arguments

 myXStringSet A DNAStringSet or RNAStringSet object containing aligned sequences. type Character string indicating the type of results desired. This should be (an unambiguous abbreviation of) one of "states", "pairs", "evidence", "scores", "structures", or "search". (See value section below.) minOccupancy Numeric specifying the minimum occupancy (1 - fraction of gaps) required to include a column of the alignment in the prediction. impact A vector with four elements giving the weights of A/U, G/C, G/U, and other pairings, respectively. The last element of impact is the penalty for pairings that are inconsistent with two positions being paired (e.g., A/- or A/C). avgProdCorr Numeric specifying the weight of the average product correction (APC) term, as described in Buslje et al. (2009). slope Numeric giving the slope of the sigmoid used to convert mutual information values to scores ranging from zero to one. shift Numeric giving the relative shift of the sigmoid used to convert mutual information values to scores ranging from zero to one. threshold Numeric specifying the score threshold at which to consider positions for pairing. Only applicable if type is "states" or "pairs". pseudoknots Integer indicating the maximum order of pseudoknots that are acceptable. A value of 0 will prevent pseudoknots in the structure, whereas 1 (the default) will search for first-order psuedoknots. Only used if type is "states" or "pairs". weight A numeric vector of weights for each sequence, or a single number implying equal weights. processors The number of processors to use, or NULL to automatically detect and use all available processors. verbose Logical indicating whether to display progress.

## Details

PredictDBN employs an extension of the method described by Freyhult et al. (2005) for determining a consensus RNA secondary structure. It uses the mutual information (H) measure to find covarying positions in a multiple sequence alignment. The original method is modified by the addition of different weights for each type of base pairing and each input sequence. The formula for mutual information between positions i and j then becomes:

H(i,j) = ∑_{XY \in bp}^{} ≤ft( impact(XY) \cdot f_{i,j}(XY) \cdot \log_2 ≤ft( \frac{f_{i,j}(XY)}{f_{i}(X) \cdot f_{j}(Y)} \right) \right)

where, bp denotes the base pairings A/U, C/G, and G/U; impact is their weight; f is the frequency of single bases or pairs weighted by the corresponding weight of each sequence.

A penalty is then added for bases that are inconsistent with pairing:

H_{mod}(i,j) = H(i,j) + ∑_{XY \notin bp}^{} \Big( impact(XY) \cdot f_{i,j}(XY) \Big)

Next an average product correction (Buslje et al., 2009) is applied to the matrix H:

H_{APC}(i,j) = H_{mod}(i,j) - avgProdCorr \cdot \frac{\overline{H_{mod}(i,.)} \cdot \overline{H_{mod}(.,j)}}{\overline{H_{mod}(.,.)}}

The mutual information values are then rescaled between 0 and 1 by applying a sigmoidal transformation, which is controlled by shift and slope:

H_{final}(i,j) = ≤ft( 1 + \exp ≤ft( slope \cdot log_e ≤ft( \frac{H_{APC}(i,j)}{shift \cdot H_{APC}[n]} \right) \right) \right)^{-1}

where, n is the number of positions having minOccupancy divided by two (i.e., the maximum possible number of paired positions) and H_{APC}[n] denotes the n^{th} highest value in the matrix H_{APC}.

If type is "states" or "pairs", the secondary structure is determined using a variant of the Nussinov algorithm similar to that described by Venkatachalam et al. (2014). Pairings with a score below threshold are not considered during the traceback. If psuedoknots is greater than 0, paired positions are removed from consideration and the method is applied again to find pseudoknots.

In practice the secondary structure prediction is most accurate when the input alignment is of high quality, contains a wide diversity of sequences, the number of sequences is large, no regions are completely conserved across all sequences, and most of the sequences span the entire alignment (i.e., there are few partial/incomplete sequences).

## Value

If type is "states" (the default), then the output is a character vector with the predicted secondary structure assignment for each position in myXStringSet. Standard dot-bracket notation (DBN) is used, where “.” signifies an unpaired position, “(” and “)” a paired position, and successive “[]”, “{}”, and “<>” indicate increasing order pseudoknots. Columns below minOccupancy are denoted by the “-” character to indicate that they contained too many gaps to be included in the consensus structure.

If type is "pairs", then a matrix is returned with one row for each base pairing and three columns giving the positions of the paired bases and their pseudoknot order.

If type is "evidence", then a matrix is returned with one row for each base pairing and three columns giving the positions of the paired bases and their respective scores (greater than or equal to threshold). This differs from type "pairs" in that "evidence" does not perform a traceback. Therefore, it is possible to have conflicting evidence where a single base has evidence for pairing with multiple other bases.

If type is "scores", then a matrix of three rows is returned, where the values in a column represent the maximum score for a state in each position. Columns sum to 1 if the position was above minOccupancy and 0 otherwise.

If type is "structures", then the output is a list with one element for each sequence in myXStringSet. Each list element contains a matrix of dimension 3 (each state) by the number of nucleotides in the sequence. Columns of the matrix sum to zero where the nucleotide was located in a position that was below minOccupancy. Otherwise, positions are considered paired if they are consistent with pairing (i.e., A/U, C/G, or G/U) in the consensus secondary structure.

If type is "search" then an attempt is made to find additional secondary structure beyond positions exhibiting covariation. First, anchors are identified as pairs of covarying positions with their score above threshold. Next, the regions between anchors are searched for previously unidentified stem loops. Finally, any helices are assigned a score according to their length, i.e. one minus the probability of finding that many consecutive pairs within the anchor boundaries by chance. Hence, output type "search" will find secondary structure outside of the consensus structure shared by most sequences, and can identify secondary structure in conserved alignment regions.

## Author(s)

Erik Wright [email protected]

## References

Buslje, C., et al. (2009). Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics, 25(9), 1125-1131.

Freyhult, E., et al. (2005). Predicting RNA Structure Using Mutual Information. Applied Bioinformatics, 4(1), 53-59.

Venkatachalam, B., et al. (2014). Faster algorithms for RNA-folding using the Four-Russians method. Algorithms for Molecular Biology : AMB, 9(1), 1-12.

PredictHEC
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # load the example non-coding RNA sequences db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER") rna <- SearchDB(db, type="RNAStringSet") # predict the secondary structure in dot-bracket notation (dbn) p <- PredictDBN(rna, "states") # predict the secondary structure in dbn p # pairs are denoted by (), and (optionally) pseudoknots by [], {}, and <> # convert the dot-bracket notation into pairs of positions within the alignment p <- PredictDBN(rna, "pairs") # paired positions in the alignment head(p) # matrix giving the pairs and their pseudoknot order (when > 0) # show all available evidence of base pairing p <- PredictDBN(rna, "evidence") # all pairs with scores >= threshold head(p) # matrix giving the pairs and their scores # determine the score at every alignment position p <- PredictDBN(rna, "scores") # score in the alignment p["(", 122] # score for left-pairing at alignment position 122 p[")", 260] # score for right-pairing at alignment position 260 # find the scores individually for every sequence in the alignment p <- PredictDBN(rna, "structures") # scores per sequence p[[1]][, 1] # the scores for the first position in the first sequence p[[2]][, 10] # the scores for the tenth position in the second sequence # these positional scores can be used as shades of red, green, and blue: BrowseSeqs(rna, patterns=p) # red = unpaired, green = left-pairing, blue = right # positions in black are not part of the consensus secondary structure # search for additional secondary structure between the consensus pairs p <- PredictDBN(rna, "search") # scores per sequence after searching BrowseSeqs(rna, patterns=p) # red = unpaired, green = left-pairing, blue = right # note that "search" identified many more pairings than "structures"