add_multifreq: Add multi-letter information to a motif.

View source: R/add_multifreq.R

add_multifreqR Documentation

Add multi-letter information to a motif.

Description

If the original sequences are available for a particular motif, then they can be used to generate higher-order PPM matrices. See the "Motif import, export, and manipulation" vignette for more information.

Usage

add_multifreq(motif, sequences, add.k = 2:3, RC = FALSE,
  threshold = 0.001, threshold.type = "pvalue", motifs.perseq = 1,
  add.bkg = FALSE)

Arguments

motif

See convert_motifs() for acceptable formats. If the motif is not a universalmotif motif, then it will be converted.

sequences

XStringSet The alphabet must match that of the motif. If these sequences are all the same length as the motif, then they are all used to generate the multi-freq matrices. Otherwise scan_sequences() is first run to find the best sequence stretches within these.

add.k

numeric(1) The k-let lengths to add.

RC

logical(1) If TRUE, check reverse complement of the input sequences. Only available for DNA/RNA.

threshold

numeric(1) See details.

threshold.type

character(1) One of c('pvalue', 'qvalue', 'logodds', 'logodds.abs'). See details.

motifs.perseq

numeric(1) If scan_sequences() is run, then this indicates how many hits from each sequence is to be used.

add.bkg

logical(1) Indicate whether to add corresponding higher order background information to the motif. Can sometimes be detrimental when the input consists of few short sequences, which can increase the likelihood of adding zero or near-zero probabilities.

Details

See scan_sequences() for more info on scanning parameters.

At each position in the motif, then the probability of each k-let covering from the initial position to ncol - 1 is calculated. Only positions within the motif are considered: this means that the final k-let probability matrix will have ncol - 1 fewer columns. Calculating k-let probabilities for the missing columns would be trivial however, as you would only need the background frequencies. Since these would not be useful for scan_sequences() though, they are not calculated.

Currently add_multifreq() does not try to stay faithful to the default motif matrix when generating multifreq matrices. This means that if the sequences used for training are completely different from the actual motif, the multifreq matrices will be as well. However this is only really a problem if you supply add_multifreq() with a set of sequences of the same length as the motif. In this case add_multifreq() is forced to create the multifreq matrices from these sequences. Otherwise add_multifreq() will scan the input sequences for the motif and use the best matches to construct the multifreq matrices.

This 'multifreq' representation is only really useful within the universalmotif environment. Despite this, if you wish it can be preserved in text using write_motifs().

A note on motif size

The number of rows for each k-let matrix is n^k, with n being the number of letters in the alphabet being used. This means that the size of the k-let matrix can become quite large as k increases. For example, if one were to wish to represent a DNA motif of length 10 as a 10-let, this would require a matrix with 1,048,576 rows (though at this point if what you want is to search for exact sequence matches, the motif format itself is not very useful).

Value

A universalmotif object with filled multifreq slot. The bkg slot is also expanded with corresponding higher order probabilities if add.bkg = TRUE.

Author(s)

Benjamin Jean-Marie Tremblay, benjamin.tremblay@uwaterloo.ca

See Also

scan_sequences(), convert_motifs(), write_motifs()

Examples

sequences <- create_sequences(seqlen = 10)
motif <- create_motif()
motif.trained <- add_multifreq(motif, sequences, add.k = 2:4)
## peek at the 2-let matrix:
motif.trained["multifreq"]$`2`


bjmt/universalmotif documentation built on Nov. 16, 2024, 7:38 a.m.