getmotifs: Jointly call and refine a set of seed motifs provided by the...
In MyersGroup/MotifFinder: A Bayesian, De Novo, Joint Motif Finder

Description Usage Arguments Details Value

Jointly call and refine a set of seed motifs provided by the findamotif function

getmotifs(scorematset, dimvec, seqs, maxwidth = 800, alpha = 0.5,
  incprob = 0.99999, maxits = 30, plen = 0.05, updatemot = 1,
  updatealpha = 1, ourprior = NULL, updateprior = 1, bg = -1,
  dt = T, allowinf = FALSE, seed = NULL, verbosity = 1,
  stranded_prior = F, conv_t = 0.05, conv_n = 200)

`scorematset`	is a set of matrices, row-concatenated, giving pwms (log-scale) for the initialisation of the algorithm. scorematset is of dimension `sum(dimvec)` rows by 4 columns. and the first `dimvec[1]` rows of this matrix gives the pwm for the first motifs, the next `dimvec[2]` rows the second motif, and so on
`dimvec`	gives the lengths of each of the initial motifs. If dimvec is of length n_motifs, motif k is of length `dimvec[k]`
`seqs`	a vector of input sequences used for finding motifs within. Lower case bases are ignored/masked - e.g. if repeats are an issue. In some cases it may be helpful NOT to mask repeats that may contain motif matches
`maxwidth`	the length that elements of "seqs" will be trimmed to (around their centre). Run times depend roughly linearly on this parameter
`alpha`	a vector of initial assumed probabilities each motif is present in a sequence
`incprob`	can usually be left as default value
`maxits`	the number of iterations (if no motif is found the algorithm could terminate early)
`plen`	a parameter setting the geometric prior on how long each motif found should be. plen=0.05 corresponds to a mean length of 20bp and is the default. Setting plen large penalises longer motifs more
`updatemot`	a flag - should the algorithm update (learn) the initial motifs (default is 1)
`updatealpha`	a flag - should the algorithm update (learn) the initial motifs (default is 1)
`ourprior`	a vector of length 10 probabilities giving the initial probability of a motif being found across different parts of the sequence from start:end. If left unspecified the initial prior is set at uniform and the algorithm tries to learn where motifs are, e.g. if they are centrally enriched.
`updateprior`	a flag - should the algorithm update (learn) the prior on where the motifs occur within the DNA sequences(default is 1)
`bg`	should be left at default value normally (technical parameter setting background model)
`dt`	logical; should a data table of the results be returned
`allowinf`	a flag - should infinite values be allowed in scoremat (not recommended, default is FALSE).
`seed`	integer; seed for random number generation, set this for exactly reproducible results.
`verbosity`	integer; How verbose should this function be? 0=silent, 3=everything.

Given a user-input set of initial PWMs and input sequences to identify motifs, run a Gibbs sampler to update these motifs, and output the results

The user can also optionally supply priors on the fraction of sequences containing motifs, the likely length of motifs, and the positional distribution of motifs within the sequences.

User-supplied information can either be updated (the default) by the algorithm, or fixed at the input values

The program outputs a list of results, including information on inferred PWMs (i.e. motifs found), as well as a probabilistic output of which regions contain which motifs, and posterior distributions of the other parameters

If you use this program, please cite Altemose et al. eLife 2017

The code returns detailed output as a list, whose elements are as follows (access these using commands like outputlist$scoremat)

Details of input data given:

seqs: the vector of input sequences used for finding motifs within
trimmedseqs: the vector of input sequences used for finding motifs within, after trimming to shorten long input sequences

Details of overall fitted model:

scoremat: a matrix made up of matrices, row-concatenated, giving pwms (log-scale) for the identified motifs after iteration
scorematdim: the lengths of each of the identified motifs. If scorematdim is of length n_motifs, motif k is of length scorematdim[k] scoremat is of dimension sum(scorematdim) by 4 and the first scorematdim[1] rows of this matrix gives the pwm for the first motifs, the next scorematdim[2] rows the second motif, and so on
prior: a vector of length 10 probabilities giving the inferred probability of a motif being found across different parts of the sequence from start to end.
alpha: a vector of probabilities giving the inferred probability of each motif being found within a single input region
bindmat:a version of scoremat accounting for the background sequence composition
background: the inferred background model
seed: random number generation seed used

Details of output for given data

regprobs: a matrix giving the probability of each motif occurring in each given input sequence
regprob: a vector giving the overall probability of any motif occurring in each given input sequence
bestpos: a matrix giving the best match for each motif in each given input sequence
whichregs is a vector showing which input sequences had motifs identified in the final round of sampling of the Gibbs sampler
whichpos: for motifs identified in regions described in whichreg, the start positions of motifs identified in the final round of sampling of the Gibbs sampler
whichmot: for motifs identified in regions described in whichreg, the type (an integer in the range 1 to length(scorematdim)) of motifs identified in the final round of sampling of the Gibbs sampler
whichstrand: for motifs identified in regions described in whichreg, the strand associated with motifs identified in the final round of sampling of the Gibbs sampler, relative to the input sequence