phyloFit: Fit a Phylogenetic model to an alignment...

Description Usage Arguments Value Parameter boundaries Note Author(s) Examples

View source: R/phyloFit.R

Description

Fit a Phylogenetic model to an alignment

Usage

1
2
3
4
5
6
7
phyloFit(msa, tree = NULL, subst.mod = "REV", init.mod = NULL,
  no.opt = c("backgd"), init.backgd.from.data = ifelse(is.null(init.mod),
  TRUE, FALSE), features = NULL, scale.only = FALSE, scale.subtree = NULL,
  nrates = NULL, alpha = 1, rate.constants = NULL, selection = NULL,
  init.random = FALSE, init.parsimony = FALSE, clock = FALSE,
  EM = FALSE, max.EM.its = NULL, precision = "HIGH", ninf.sites = 50,
  quiet = FALSE, bound = NULL, log.file = FALSE)

Arguments

msa

An alignment object. May be altered if passed in as a pointer to C memory (see Note).

tree

A character string containing a Newick formatted tree defining the topology. Required if the number of species > 3, unless init.mod is specified. The topology must be rooted, although the root is ignored if the substitution model is reversible.

subst.mod

The substitution model to use. Some possible models include "REV", "JC69", "K80", "F81", "HKY85", "R2", "U2". Run subst.mods() for a full list; some models are experimental.

init.mod

An object of class tm used to initialize the model.

no.opt

A character vector indicating which parameters NOT to optimize (instead hold constant at their initial values). By default, the equilibrium frequencies (backgd) are not optimized. Other parameters that may be indicated here are "ratematrix" for the entire rate matrix, "kappa" for models with transition/transversion ratios, "branches" to hold all branch lengths constant, "ratevar" for rate variation parameters, "scale" for the tree scaling factor, and "scale_sub" for the subtree scaling factor. This argument does NOT apply to parameters of a lineage-specific model created with add.ls.mod, though such parameters can be held constant by using appropriate arguments when the model is created. See add.ls.mod for more details about lineage-specific models.

init.backgd.from.data

A logical value; can be FALSE only if init.mod is provided. If TRUE, use observed base frequencies in data to initialize equilibrium frequencies. Otherwise use the values from init.mod. By default uses init.mod values if provided.

features

An object of type feat. If given, a separate model will be estimated for each feature type.

scale.only

A logical value. If TRUE, estimate only the scale of the tree. Branches will be held at initial values. Useful in conjunction with init.mod.

scale.subtree

A character string giving the name of a node in a tree. This option implies scale.only=TRUE. If given, estimate separate scale factors for subtree beneath identified node and the rest of the tree. The branch leading to the subtree is included in the subtree.

nrates

An integer. The number of rate categories to use. Specifying a value greater than one causes the discrete gamma model for rate variation to be used, unless rate constants are specified. The default value NULL implies a single rate category.

alpha

A numeric value > 0, for use with "nrates". Initial value for alpha, the shape parameter of the gamma distribution.

rate.constants

A numeric vector. Implies nrates = length(rate.constants). Also implies EM=TRUE. Uses a non-parametric mixture model for rates, instead of a gamma distribution. The weight associated with each rate will be estimated. alpha may still be used to initialize these weights.

selection

A numeric value. If provided, use selection in the model. The value given will be the initial value for selection. If NULL, selection will not be used unless init.mod is provided and indicates a model with selection. selection scales the rate matrix by s/(1-exp(-s)). Selection is applied after the rate matrix is scaled so that the expected number of substitutions per unit time is 1. When using codon models, selection only scales nonsynonymous substitutions.

init.random

A logical value. If TRUE, parameters will be initialized randomly.

init.parsimony

A logical value. If TRUE, branch lengths will be estimated based on parsimony counts for the alignments. Currently only works for models of order0.

clock

A logical value. If TRUE, assume a molecular clock in estimation.

EM

A logical value. If TRUE, the model is fit using EM rather than the default BFGS quasi-Newton algorithm. Not available for all models/options.

max.EM.its

An integer value; only applies if EM==TRUE. The maximum number of EM iterations to perform. The EM algorithm may quit earlier if other convergence criteria are met.

precision

A character vector, one of "HIGH", "MED", or "LOW", denoting the level of precision to use in estimating model parameters. Affects convergence criteria for iterative algorithms: higher precision means more iterations and longer execution time.

ninf.sites

An integer. Require at least this many "informative" sites in order to estimate a model. An informative site as an alignment column with at least two non-gap and non-missing-data characers.

quiet

A logical value. If TRUE, do not report progress to screen.

bound

Defines boundaries for parameters (see Details below).

log.file

If TRUE, write log of optimization to the screen. If a character string, write log of optimization to the named file. Otherwise write no optimization log.

Value

An object of class tm (tree model), or (if several models are computed, as is possible with the features or windows options), a list of objects of class tm.

Parameter boundaries

Boundaries can be set for some parameters using the bound argument. The bound argument should be a vector of character strings, each element defines the boundaries for a single parameter. The boundaries are best explained by example. A value of c("scale[0,1]", "scale_sub[1,]", "kappa[,3]") would imply to keep the scale between 0 and 1, the subtree scale between 1 and infinity, and kappa between 0 and 3. The blank entries in the subtree_scale upper bound and kappa's lower bound indicate not to set this boundary, in which case the normal default boundary will be used for that parameter. (Most parameters are defined between 0 and infinity). Most of the parameters listed in the description of no.opt can also have their boundaries set in this way.

Note

If msa or features object are passed in as pointers to C memory, they may be altered by this function! Use copy.msa(msa) or copy.feat(features) to avoid this behavior!

Author(s)

Melissa J. Hubisz and Adam Siepel

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
exampleArchive <- system.file("extdata", "examples.zip", package="rphast")
files <- c("ENr334-test.maf", "ENr334-100k.fa", "ENr334-test.gff", "rev.mod")
unzip(exampleArchive, files)
m <- read.msa("ENr334-test.maf")
mod <- phyloFit(m, tree="((hg18, (mm9, rn4)), canFam2)")
mod
phyloFit(m, init.mod=mod)
likelihood.msa(m, mod)
mod$likelihood
print(mod$likelihood, digits=10)
f <- read.feat("ENr334-test.gff")
mod <- phyloFit(m, tree="((hg18, (mm9, rn4)), canFam2)",
                features=f, quiet=TRUE)
names(mod)
mod$other
mod[["5'flank"]]
phyloFit(m, init.mod=mod$AR, nrates=3, alpha=4.0)
phyloFit(m, init.mod=mod$AR, rate.constants=c(10, 5, 1))
# background frequencies options

# this should use the background frequencies from the initial mod
phyloFit(m, init.mod=mod$AR, quiet=TRUE)$backgd
mod$AR$backgd

# this should use the background frequencies from the data
phyloFit(m, init.mod=mod$AR, init.backgd.from.data=TRUE, quiet=TRUE)$backgd
mod$AR$backgd

# this should optimize the background frequencies
phyloFit(m, init.mod=mod$AR, no.opt=NULL, quiet=TRUE)$backgd
mod$AR$backgd

unlink(files)

Example output

Extracting sufficient statistics ...
Compacting sufficient statistics ...
Fitting tree model to alignment using REV ...
numpar = 11
Done.  log(likelihood) = -149094.320182 numeval=331
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
TRAINING_LNL: -149094.320182
BACKGROUND: 0.269027 0.248505 0.227938 0.254530 
RATE_MAT:
  -0.942590    0.176447    0.624375    0.141767 
   0.191019   -0.996540    0.151541    0.653980 
   0.736926    0.165214   -1.105011    0.202871 
   0.149842    0.638500    0.181676   -0.970018 
TREE: ((hg18:0.112696,(mm9:0.0657572,rn4:0.0698817):0.292789):0.105601,canFam2:0.105601);
Extracting sufficient statistics ...
Compacting sufficient statistics ...
Fitting tree model to alignment using REV ...
numpar = 11
Done.  log(likelihood) = -149094.320181 numeval=29
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
TRAINING_LNL: -149094.320181
BACKGROUND: 0.269027 0.248505 0.227938 0.254530 
RATE_MAT:
  -0.942590    0.176447    0.624375    0.141768 
   0.191019   -0.996540    0.151541    0.653980 
   0.736926    0.165214   -1.105011    0.202871 
   0.149842    0.638500    0.181676   -0.970018 
TREE: ((hg18:0.112696,(mm9:0.0657604,rn4:0.0698818):0.29279):0.105599,canFam2:0.105599);
[1] -149094.3
[1] -149094.3
[1] -149094.3202
Skipping alignment (category 7); insufficient informative sites ...
[1] "background" "AR"         "other"      "5'flank"    "5'UTR"     
[6] "CDS"        "intron"     "3'flank"    "phastCons" 
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
TRAINING_LNL: -75146.750332
BACKGROUND: 0.254259 0.252749 0.251373 0.241619 
RATE_MAT:
  -1.001616    0.188908    0.669705    0.143004 
   0.190036   -0.969975    0.153607    0.626332 
   0.677394    0.154448   -1.023594    0.191752 
   0.150484    0.655184    0.199493   -1.005161 
TREE: ((hg18:0.118571,(mm9:0.0677486,rn4:0.0729959):0.347226):0.112422,canFam2:0.112422);
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
TRAINING_LNL: -408.370000
BACKGROUND: 0.249330 0.276139 0.321716 0.152815 
RATE_MAT:
  -1.024082    0.194429    0.763515    0.066138 
   0.175552   -0.988782    0.074163    0.739067 
   0.591724    0.063656   -0.717861    0.062481 
   0.107909    1.335508    0.131538   -1.574955 
TREE: ((hg18:1.38418e-16,(mm9:0.148352,rn4:0.142487):0.148352):0.188016,canFam2:0.188016);
Extracting sufficient statistics ...
Compacting sufficient statistics ...
Fitting tree model to alignment using REV (with rate variation) ...
numpar = 14
Done.  log(likelihood) = 593439.111981 numeval=1591
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
NRATECATS: 3
RATE_CONSTS: 1.000000 0.000000 0.000000 
RATE_WEIGHTS: 0.374809 0.311231 0.313960 
TRAINING_LNL: 593439.111981
BACKGROUND: 0.328209 0.212004 0.190436 0.269352 
RATE_MAT:
  -1.440369    0.176546    0.425040    0.838783 
   0.273316   -0.353648    0.045212    0.035120 
   0.732539    0.050333   -0.836580    0.053707 
   1.022068    0.027643    0.037972   -1.087682 
TREE: ((hg18:1.30764e+06,(mm9:3.89557e-13,rn4:635199):214870):813996,canFam2:813996);
Extracting sufficient statistics ...
Compacting sufficient statistics ...
Fitting tree model to alignment using REV (with rate variation) ...
ALPHABET: A C G T 
ORDER: 0
SUBST_MOD: REV
NRATECATS: 3
RATE_CONSTS: 10.000000 5.000000 1.000000 
RATE_WEIGHTS: -0.000246 0.030657 0.969589 
TRAINING_LNL: -149906.317945
BACKGROUND: 0.328209 0.212004 0.190436 0.269352 
RATE_MAT:
  -0.800149    0.144451    0.523022    0.132676 
   0.223629   -1.135865    0.156648    0.755589 
   0.901408    0.174389   -1.306774    0.230977 
   0.161667    0.594717    0.163304   -0.919688 
TREE: ((hg18:0.114939,(mm9:0.0654673,rn4:0.0691082):0.298064):0.10845,canFam2:0.10845);
[1] 0.3282085 0.2120040 0.1904358 0.2693517
[1] 0.3282085 0.2120040 0.1904358 0.2693517
[1] 0.2690269 0.2485048 0.2279384 0.2545299
[1] 0.3282085 0.2120040 0.1904358 0.2693517
[1] 0.2762342 0.2419623 0.2250745 0.2567290
[1] 0.3282085 0.2120040 0.1904358 0.2693517

rphast documentation built on May 1, 2019, 9:26 p.m.