phyloFit: Fit a Phylogenetic model to an alignment...
In rphast: Interface to 'PHAST' Software for Comparative Genomics

Description Usage Arguments Value Parameter boundaries Note Author(s) Examples

Fit a Phylogenetic model to an alignment

phyloFit(msa, tree = NULL, subst.mod = "REV", init.mod = NULL,
  no.opt = c("backgd"), init.backgd.from.data = ifelse(is.null(init.mod),
  TRUE, FALSE), features = NULL, scale.only = FALSE, scale.subtree = NULL,
  nrates = NULL, alpha = 1, rate.constants = NULL, selection = NULL,
  init.random = FALSE, init.parsimony = FALSE, clock = FALSE,
  EM = FALSE, max.EM.its = NULL, precision = "HIGH", ninf.sites = 50,
  quiet = FALSE, bound = NULL, log.file = FALSE)

`msa`	An alignment object. May be altered if passed in as a pointer to C memory (see Note).
`tree`	A character string containing a Newick formatted tree defining the topology. Required if the number of species > 3, unless init.mod is specified. The topology must be rooted, although the root is ignored if the substitution model is reversible.
`subst.mod`	The substitution model to use. Some possible models include "REV", "JC69", "K80", "F81", "HKY85", "R2", "U2". Run `subst.mods()` for a full list; some models are experimental.
`init.mod`	An object of class `tm` used to initialize the model.
`no.opt`	A character vector indicating which parameters NOT to optimize (instead hold constant at their initial values). By default, the equilibrium frequencies (backgd) are not optimized. Other parameters that may be indicated here are "ratematrix" for the entire rate matrix, "kappa" for models with transition/transversion ratios, "branches" to hold all branch lengths constant, "ratevar" for rate variation parameters, "scale" for the tree scaling factor, and "scale_sub" for the subtree scaling factor. This argument does NOT apply to parameters of a lineage-specific model created with `add.ls.mod`, though such parameters can be held constant by using appropriate arguments when the model is created. See `add.ls.mod` for more details about lineage-specific models.
`init.backgd.from.data`	A logical value; can be `FALSE` only if init.mod is provided. If `TRUE`, use observed base frequencies in data to initialize equilibrium frequencies. Otherwise use the values from init.mod. By default uses init.mod values if provided.
`features`	An object of type `feat`. If given, a separate model will be estimated for each feature type.
`scale.only`	A logical value. If `TRUE`, estimate only the scale of the tree. Branches will be held at initial values. Useful in conjunction with init.mod.
`scale.subtree`	A character string giving the name of a node in a tree. This option implies scale.only=TRUE. If given, estimate separate scale factors for subtree beneath identified node and the rest of the tree. The branch leading to the subtree is included in the subtree.
`nrates`	An integer. The number of rate categories to use. Specifying a value greater than one causes the discrete gamma model for rate variation to be used, unless rate constants are specified. The default value `NULL` implies a single rate category.
`alpha`	A numeric value > 0, for use with "nrates". Initial value for alpha, the shape parameter of the gamma distribution.
`rate.constants`	A numeric vector. Implies `nrates = length(rate.constants)`. Also implies `EM=TRUE`. Uses a non-parametric mixture model for rates, instead of a gamma distribution. The weight associated with each rate will be estimated. alpha may still be used to initialize these weights.
`selection`	A numeric value. If provided, use selection in the model. The value given will be the initial value for selection. If `NULL`, selection will not be used unless init.mod is provided and indicates a model with selection. selection scales the rate matrix by s/(1-exp(-s)). Selection is applied after the rate matrix is scaled so that the expected number of substitutions per unit time is 1. When using codon models, selection only scales nonsynonymous substitutions.
`init.random`	A logical value. If `TRUE`, parameters will be initialized randomly.
`init.parsimony`	A logical value. If `TRUE`, branch lengths will be estimated based on parsimony counts for the alignments. Currently only works for models of order0.
`clock`	A logical value. If `TRUE`, assume a molecular clock in estimation.
`EM`	A logical value. If `TRUE`, the model is fit using EM rather than the default BFGS quasi-Newton algorithm. Not available for all models/options.
`max.EM.its`	An integer value; only applies if `EM==TRUE`. The maximum number of EM iterations to perform. The EM algorithm may quit earlier if other convergence criteria are met.
`precision`	A character vector, one of "HIGH", "MED", or "LOW", denoting the level of precision to use in estimating model parameters. Affects convergence criteria for iterative algorithms: higher precision means more iterations and longer execution time.
`ninf.sites`	An integer. Require at least this many "informative" sites in order to estimate a model. An informative site as an alignment column with at least two non-gap and non-missing-data characers.
`quiet`	A logical value. If `TRUE`, do not report progress to screen.
`bound`	Defines boundaries for parameters (see Details below).
`log.file`	If TRUE, write log of optimization to the screen. If a character string, write log of optimization to the named file. Otherwise write no optimization log.

An object of class tm (tree model), or (if several models are computed, as is possible with the features or windows options), a list of objects of class tm.

Boundaries can be set for some parameters using the bound argument. The bound argument should be a vector of character strings, each element defines the boundaries for a single parameter. The boundaries are best explained by example. A value of c("scale[0,1]", "scale_sub[1,]", "kappa[,3]") would imply to keep the scale between 0 and 1, the subtree scale between 1 and infinity, and kappa between 0 and 3. The blank entries in the subtree_scale upper bound and kappa's lower bound indicate not to set this boundary, in which case the normal default boundary will be used for that parameter. (Most parameters are defined between 0 and infinity). Most of the parameters listed in the description of no.opt can also have their boundaries set in this way.

If msa or features object are passed in as pointers to C memory, they may be altered by this function! Use copy.msa(msa) or copy.feat(features) to avoid this behavior!

Melissa J. Hubisz and Adam Siepel

exampleArchive <- system.file("extdata", "examples.zip", package="rphast")
files <- c("ENr334-test.maf", "ENr334-100k.fa", "ENr334-test.gff", "rev.mod")
unzip(exampleArchive, files)
m <- read.msa("ENr334-test.maf")
mod <- phyloFit(m, tree="((hg18, (mm9, rn4)), canFam2)")
mod
phyloFit(m, init.mod=mod)
likelihood.msa(m, mod)
mod$likelihood
print(mod$likelihood, digits=10)
f <- read.feat("ENr334-test.gff")
mod <- phyloFit(m, tree="((hg18, (mm9, rn4)), canFam2)",
                features=f, quiet=TRUE)
names(mod)
mod$other
mod[["5'flank"]]
phyloFit(m, init.mod=mod$AR, nrates=3, alpha=4.0)
phyloFit(m, init.mod=mod$AR, rate.constants=c(10, 5, 1))
# background frequencies options

# this should use the background frequencies from the initial mod
phyloFit(m, init.mod=mod$AR, quiet=TRUE)$backgd
mod$AR$backgd

# this should use the background frequencies from the data
phyloFit(m, init.mod=mod$AR, init.backgd.from.data=TRUE, quiet=TRUE)$backgd
mod$AR$backgd

# this should optimize the background frequencies
phyloFit(m, init.mod=mod$AR, no.opt=NULL, quiet=TRUE)$backgd
mod$AR$backgd

unlink(files)

Extracting sufficient statistics ...
Compacting sufficient statistics ...
Fitting tree model to alignment using REV ...
numpar = 11
Done.  log(likelihood) = -149094.320182 numeval=331
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
TRAINING_LNL: -149094.320182
BACKGROUND: 0.269027 0.248505 0.227938 0.254530 
RATE_MAT:
  -0.942590    0.176447    0.624375    0.141767 
   0.191019   -0.996540    0.151541    0.653980 
   0.736926    0.165214   -1.105011    0.202871 
   0.149842    0.638500    0.181676   -0.970018 
TREE: ((hg18:0.112696,(mm9:0.0657572,rn4:0.0698817):0.292789):0.105601,canFam2:0.105601);
Extracting sufficient statistics ...
Compacting sufficient statistics ...
Fitting tree model to alignment using REV ...
numpar = 11
Done.  log(likelihood) = -149094.320181 numeval=29
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
TRAINING_LNL: -149094.320181
BACKGROUND: 0.269027 0.248505 0.227938 0.254530 
RATE_MAT:
  -0.942590    0.176447    0.624375    0.141768 
   0.191019   -0.996540    0.151541    0.653980 
   0.736926    0.165214   -1.105011    0.202871 
   0.149842    0.638500    0.181676   -0.970018 
TREE: ((hg18:0.112696,(mm9:0.0657604,rn4:0.0698818):0.29279):0.105599,canFam2:0.105599);
[1] -149094.3
[1] -149094.3
[1] -149094.3202
Skipping alignment (category 7); insufficient informative sites ...
[1] "background" "AR"         "other"      "5'flank"    "5'UTR"     
[6] "CDS"        "intron"     "3'flank"    "phastCons" 
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
TRAINING_LNL: -75146.750332
BACKGROUND: 0.254259 0.252749 0.251373 0.241619 
RATE_MAT:
  -1.001616    0.188908    0.669705    0.143004 
   0.190036   -0.969975    0.153607    0.626332 
   0.677394    0.154448   -1.023594    0.191752 
   0.150484    0.655184    0.199493   -1.005161 
TREE: ((hg18:0.118571,(mm9:0.0677486,rn4:0.0729959):0.347226):0.112422,canFam2:0.112422);
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
TRAINING_LNL: -408.370000
BACKGROUND: 0.249330 0.276139 0.321716 0.152815 
RATE_MAT:
  -1.024082    0.194429    0.763515    0.066138 
   0.175552   -0.988782    0.074163    0.739067 
   0.591724    0.063656   -0.717861    0.062481 
   0.107909    1.335508    0.131538   -1.574955 
TREE: ((hg18:1.38418e-16,(mm9:0.148352,rn4:0.142487):0.148352):0.188016,canFam2:0.188016);
Extracting sufficient statistics ...
Compacting sufficient statistics ...
Fitting tree model to alignment using REV (with rate variation) ...
numpar = 14
Done.  log(likelihood) = 593439.111981 numeval=1591
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
NRATECATS: 3
RATE_CONSTS: 1.000000 0.000000 0.000000 
RATE_WEIGHTS: 0.374809 0.311231 0.313960 
TRAINING_LNL: 593439.111981
BACKGROUND: 0.328209 0.212004 0.190436 0.269352 
RATE_MAT:
  -1.440369    0.176546    0.425040    0.838783 
   0.273316   -0.353648    0.045212    0.035120 
   0.732539    0.050333   -0.836580    0.053707 
   1.022068    0.027643    0.037972   -1.087682 
TREE: ((hg18:1.30764e+06,(mm9:3.89557e-13,rn4:635199):214870):813996,canFam2:813996);
Extracting sufficient statistics ...
Compacting sufficient statistics ...
Fitting tree model to alignment using REV (with rate variation) ...
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
NRATECATS: 3
RATE_CONSTS: 10.000000 5.000000 1.000000 
RATE_WEIGHTS: -0.000246 0.030657 0.969589 
TRAINING_LNL: -149906.317945
BACKGROUND: 0.328209 0.212004 0.190436 0.269352 
RATE_MAT:
  -0.800149    0.144451    0.523022    0.132676 
   0.223629   -1.135865    0.156648    0.755589 
   0.901408    0.174389   -1.306774    0.230977 
   0.161667    0.594717    0.163304   -0.919688 
TREE: ((hg18:0.114939,(mm9:0.0654673,rn4:0.0691082):0.298064):0.10845,canFam2:0.10845);
[1] 0.3282085 0.2120040 0.1904358 0.2693517
[1] 0.3282085 0.2120040 0.1904358 0.2693517
[1] 0.2690269 0.2485048 0.2279384 0.2545299
[1] 0.3282085 0.2120040 0.1904358 0.2693517
[1] 0.2762342 0.2419623 0.2250745 0.2567290
[1] 0.3282085 0.2120040 0.1904358 0.2693517