shuffle_sequences: Shuffle input sequences.

View source: R/shuffle_sequences.R

shuffle_sequencesR Documentation

Shuffle input sequences.

Description

Given a set of input sequences, shuffle the letters within those sequences with any k-let size.

Usage

shuffle_sequences(sequences, k = 1, method = "euler", nthreads = 1,
  rng.seed = sample.int(10000, 1), window = FALSE, window.size = 0.1,
  window.overlap = 0.01)

Arguments

sequences

XStringSet Set of sequences to shuffle. Works with any set of characters.

k

numeric(1) K-let size.

method

character(1) One of c('euler', 'markov', 'linear'). Only relevant if k > 1. See details.

nthreads

numeric(1) Run shuffle_sequences() in parallel with nthreads threads. nthreads = 0 uses all available threads. Note that no speed up will occur for jobs with only a single sequence.

rng.seed

numeric(1) Set random number generator seed. Since shuffling can occur simultaneously in multiple threads using C++, it cannot communicate with the regular R random number generator state and thus requires an independent seed. Each individual sequence in an XStringSet object will be given the following seed: rng.seed * index. The default is to pick a random number as chosen by sample(), which effectively is making shuffle_sequences() dependent on the R RNG state.

window

logical(1) Shuffle sequences iteratively over windows instead of all at once.

window.size

numeric(1) Window size. Can be a fraction less than one, or an integer representing the actual window size.

window.overlap

numeric(1) Overlap between windows. Can be a fraction less than one, or an integer representing the actual overlap size.

Details

markov method

If method = 'markov', then the Markov model is used to generate sequences which will maintain (on average) the k-let frequencies. Please note that this method is not a 'true' shuffling, and for short sequences (e.g. <100bp) this can result in slightly more dissimilar sequences versus true shuffling. See Fitch (1983) for a discussion on the topic.

euler method

If method = 'euler', then the sequence shuffling method proposed by Altschul and Erickson (1985) is used. As opposed to the 'markov' method, this one preserves exact k-let frequencies. This is done by creating a k-let edge graph, then following a random Eulerian walk through the graph. Not all walks will use up all available letters however, so the cycle-popping algorithm proposed by Propp and Wilson (1998) is used to find a random Eulerian path. A side effect of using this method is that the starting and ending sequence letters will remain unshuffled.

linear method

If method = 'linear', then the input sequences are split linearly every k letters. For example, for k = 3 'ACAGATAGACCC' becomes 'ACA GAT AGA CCC'; after which these 3-lets are shuffled randomly.

Single-letter shuffling

Do note however, that the method parameter is only relevant for k > 1. For k = 1, a simple shuffling is performed using the shuffle function from the C++ standard library.

Value

XStringSet The input sequences will be returned with identical names and lengths.

Author(s)

Benjamin Jean-Marie Tremblay, benjamin.tremblay@uwaterloo.ca

References

Altschul SF, Erickson BW (1985). “Significance of Nucleotide Sequence Alignments: A Method for Random Sequence Permutation That Preserves Dinucleotide and Codon Usage.” Molecular Biology and Evolution, 2, 526-538.

Fitch WM (1983). “Random sequences.” Journal of Molecular Biology, 163, 171-176.

Propp JG, Wilson DW (1998). “How to get a perfectly random sample from a generic markov chain and generate a random spanning tree of a directed graph.” Journal of Algorithms, 27, 170-217.

See Also

create_sequences(), scan_sequences(), enrich_motifs(), shuffle_motifs()

Examples

if (R.Version()$arch != "i386") {
sequences <- create_sequences()
sequences.shuffled <- shuffle_sequences(sequences, k = 2)
}


bjmt/universalmotif documentation built on Nov. 16, 2024, 7:38 a.m.