get_bkg: Calculate sequence background.

View source: R/get_bkg.R

get_bkgR Documentation

Calculate sequence background.

Description

For a set of input sequences, calculate the overall sequence background for any k-let size. For very large sequences DNA and RNA sequences (in the billions of bases), please be aware of the much faster and more efficient Biostrings::oligonucleotideFrequency(). get_bkg() can still be used in these cases, though it may take several seconds or minutes to calculate the results (depending on requested k-let sizes).

Usage

get_bkg(sequences, k = 1:3, as.prob = NULL, pseudocount = 0,
  alphabet = NULL, to.meme = NULL, RC = FALSE, list.out = NULL,
  nthreads = 1, merge.res = TRUE, window = FALSE, window.size = 0.1,
  window.overlap = 0)

Arguments

sequences

XStringSet Input sequences. Note that if multiple sequences are present, the results will be combined into one (unless merge.res = FALSE).

k

integer Size of k-let. Background can be calculated for any k-let size.

as.prob

Deprecated.

pseudocount

integer(1) Add a count to each possible k-let. Prevents any k-let from having 0 or 1 probabilities.

alphabet

character(1) Provide a custom alphabet to calculate a background for. If NULL, then standard letters will be assumed for DNA, RNA and AA sequences, and all unique letters found will be used for BStringSet type sequences. Note that letters which are not a part of the standard DNA/RNA/AA alphabets or in the provided alphabet will not be counted in the totals during probability calculations.

to.meme

If not NULL, then get_bkg() will return the sequence background in MEME Markov Background Model format. Input for this argument will be used for cat(..., file = to.meme) within get_bkg(). See http://meme-suite.org/doc/bfile-format.html for a description of the format.

RC

logical(1) Calculate the background of the reverse complement of the input sequences as well. Only valid for DNA/RNA.

list.out

Deprecated.

nthreads

numeric(1) Run get_bkg() in parallel with nthreads threads. nthreads = 0 uses all available threads. Note that no speed up will occur for jobs with only a single sequence.

merge.res

logical(1) Whether to merge results from all sequences or return background data for individual sequences.

window

logical(1) Determine background in windows.

window.size

numeric Window size. If a number between 0 and 1 is provided, the value is calculated as the number multiplied by the sequence length.

window.overlap

numeric Overlap between windows. If a number between 0 and 1 is provided, the value is calculated as the number multiplied by the sequence length.

Value

If to.meme = NULL, a DataFrame with columns klet, count, and probability. If merge.res = FALSE, there will be an additional sequence column. If window = TRUE, there will be an additional start and stop columns.

If to.meme is not NULL, then NULL is returned, invisibly.

Author(s)

Benjamin Jean-Marie Tremblay, benjamin.tremblay@uwaterloo.ca

References

Bailey TL, Elkan C (1994). “Fitting a mixture model by expectation maximization to discover motifs in biopolymers.” Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 2, 28-36.

See Also

create_sequences(), scan_sequences(), shuffle_sequences()

Examples

## Compare to Biostrings version
library(Biostrings)
seqs.DNA <- create_sequences()
bkg.DNA <- get_bkg(seqs.DNA, k = 3)
bkg.DNA2 <- oligonucleotideFrequency(seqs.DNA, 3, 1, as.prob = FALSE)
bkg.DNA2 <- colSums(bkg.DNA2)
all(bkg.DNA$count == bkg.DNA2)

## Create a MEME background file
get_bkg(seqs.DNA, k = 1:3, to.meme = stdout(), pseudocount = 1)

## Non-DNA/RNA/AA alphabets
seqs.QWERTY <- create_sequences("QWERTY")
bkg.QWERTY <- get_bkg(seqs.QWERTY, k = 1:2)


bjmt/universalmotif documentation built on Nov. 16, 2024, 7:38 a.m.