findamotif: Find the single most enriched motif in a set of DNA sequences
In MyersGroup/MotifFinder: A Bayesian, De Novo, Joint Motif Finder

Description Usage Arguments Details Value

Find the single most enriched motif in a set of DNA sequences

findamotif(seqs, len, scores = NULL, nits = 50, scoring_its = 5,
  n_for_refine = 1000, prior = NULL, updateprior = 1, plen = 0.9,
  seed = NULL, verbosity = 1, motif_rank = 1,
  motif_blacklist = NULL, range = 50, stranded_prior = F,
  motif_seed = "central", conv_t = 0, conv_n = 200)

`seqs`	a vector of strings giving the DNA sequences in which to find a motif
`len`	length of motif to find (min=4)
`scores`	a set of regional scores giving weights; e.g. ChIP-Seq enrichment values
`nits`	number of iterations used for motif refinement
`n_for_refine`	the top n_for_refine scoring regions only are used for motif refinement
`prior`	a vector of length 10 probabilities giving the initial probability of a motif being found across different parts of the sequence from start:end. If left unspecified the initial prior is set at uniform and the algorithm tries to learn where motifs are, e.g. if they are centrally enriched.
`updateprior`	a flag - should the algorithm update (learn) the prior on where the motifs occur within the DNA sequences(default is 1)?
`plen`	a parameter setting the geometric prior on how long each motif found should be. plen=0.05 corresponds to a mean length of 20bp and is the default. Setting plen large penalises longer motifs more
`seed`	integer; seed for random number generation, set this for exactly reproducible results.
`verbosity`	integer; How verbose should this function be? 0=silent, 3=everything.
`motif_rank`	integer; which rank of seed motif to use (1st seed motif, 2nd etc.)
`motif_blacklist`	charachter vector; motifs not to use as seed motif
`range`	integer; range around center to check for central enrichment
`motif_seed`	string; "central", "modal", "random", or a string e.g. "ACGTGAC"

This function identifies a single PWM from an iterative Gibbs sampler described in Altemose et al. eLife 2017. Function 2 can refine multiple motifs further, jointly.

The user must input a set of DNA sequences, a score for each sequence (e.g. an enrichment value or any other score), and a length for an initial motif (e.g. 8 bp) used to seed the algorithm.

There are additional optional parameters.

The program outputs a list of results, including information on the inferred PWM (i.e. motif found), as well as a probabilistic output of which regions contain this motif, and posterior distributions of the other parameters

List item with the following items:
Details of input data given:

seqs: the vector of input sequences used for finding motifs within
trimmedseqs: the vector of input sequences used for finding motifs within, after trimming to shorten long input sequences

Details of overall fitted model:

scoremat: a matrix giving the pwm (log-scale) for the identified motif after iteration
scorematdim: the length of the identified motif, and scoremat is of dimension scorematdimx4
prior: a vector of length 10 probabilities giving the inferred probability of a motif being found across different parts of the sequence from start to end.
alpha: a vector of probabilities giving the inferred probability of the motif being found within each input region
bindmat: a version of scoremat accounting for the background sequence composition
background is the inferred background model

Details of output for given data:

regprobs, regprob are in this case identical vectors giving the probability of the motif occurring in each given input sequence
bestpos is a vector giving the best match to the motif in each given input sequence
whichregs is a vector showing which input sequences had motifs identified in the final round of sampling of the Gibbs sampler
whichpos: for motifs identified in regions described in whichreg, the start positions of motifs identified in the final round of sampling of the Gibbs sampler
whichmot: not needed in this case
whichstrand: for motifs identified in regions described in whichreg, the strand associated with motifs identified in the final round of sampling of the Gibbs sampler, relative to the input sequence