estimate.evolutionary.model  R Documentation 
Wrapper function to find best (out of BM, OU, OUOU, OUBM) fitting evolutionary model and
estimate its parameters.
Description
The estimate.evolutionary.model
function calls the
BrownianMotionModel
, ouchModel
and mvslouchModel
functions with different classes of evolutionary model parameters.
It then compares the resulting estimates by the AICc (or BIC if AICc fails) and
returns the best overall model. The user is recommended to install the suggested package
PCMBaseCpp which significantly speeds up the calculations (see Details).
Usage
estimate.evolutionary.model(phyltree, mData, regimes = NULL,
root.regime = NULL, M.error = NULL, repeats = 5, model.setups = NULL,
predictors = NULL, kY = NULL, doPrint = FALSE, pESS=NULL,
estimate.root.state=FALSE, min_bl = 0.0003, maxiter=c(10,50,100))
Arguments
phyltree 
The phylogeny in phylo format. The tree can be obtained from e.g. a nexus file
by the read.nexus() function from the ape package. The "standard" ape node
indexing is assumed: for a tree with n tips, the tips should have indices 1:n
and the root index n+1 .

mData 
A matrix with the rows corresponding to the tip species while the columns correspond to the traits.
The rows should be named by species (field phyltree$tip.label ), if not, then a warning
is thrown and the order of the species is assumed to be the same as the order in which the
species are in the phylogeny (i.e. correspond to the node indices 1:n ,
where n is the number of tips). The columns should be named by traits, otherwise
a warning is thrown and generic names are generated.

regimes 
A vector or list of regimes. If vector then each entry corresponds to each of phyltree 's
branches, i.e. to each row of phyltree$edge .If list then each list entry corresponds to a tip node
and is a vector for regimes on that lineage. If NULL , then a constant regime is assumed on
the whole tree.

root.regime 
The regime at the root of the tree. If not given, then it is taken as the regime that is present
on the root's daughter lineages and is the most frequent one in the regimes vector. If more
than one regime has the same maximum frequency, then alphabetically first one of the maximum ones
is taken.

M.error 
An optional measurement error covariance structure. The measurement errors between species are assumed
independent. The program tries to recognize the structure of the passed matrix and accepts the following
possibilities :
a single number that is a common measurement error for all tips and species,
a m element vector with each value corresponding to a variable, measurement errors are independent
between variables and each species is assumed to have the same measurement errors,
a m x m ((number of variables) x (number of variables)) matrix, all species will
have the same measurement error,
a list of length n (number of species), each list element is the covariance structure
for the appropriate (numbering according to tree) species, either a single number
(each variable has same variance), vector (of length m for each variable), or m x m matrix,
the order of the list has to correspond to the order of the nodes in the phyltree object,
NULL no measurement error.
From version 2.0.0 of mvSLOUCH it is impossible to pass a single joint measurement error matrix
for all the species and traits.

repeats 
How many starting points for the numerical maximum likelihood procedure should be tried for
each model setup. On the first repeat for OUOU and OUBM modes the functions takes as the starting
point (for A and Syy ) values based on the sample covariance matrix estimate,
motivated by Bartoszek \& Sagitov (2015)'s univariate results.

model.setups 
What models to try when searching for the best evolutionary model.
This field may remain NULL , in this situation the function
generates using
.generate.basic.model.setups() a basic list of models.
Allowed values are
"basic" A basic list of models to try out is generated, defined using
.generate.basic.model.setups() . This list should be usually enough.
"fundamental" A slightly extended list of models to try out is generated,
defined using .generate.fund.model.setups() . Compared to "basic" a few more
models are added.
"extended" An extension of the "fundamental" list of models to try out.
Defined using .generate.ext.model.setups() which at the moment calls
generate.model.setups() .
"all" All possible models are generated, using
.generate.all.model.setups() . Running it will take an intolerable amount of time.
Alternatively the user is also free to provide their own list of models in this variable.
Each element of the list is a list with fields.
evolmodel The evolutionary model, it may take one of the three values "bm"
(Brownian motion model), "ouch" (OUOU model), "mvslouch" (OUBM model).
Atype The class of the A matrix, ignored if evolmodel equals
"bm" . Otherwise it can take one of the following values:
"SingleValueDiagonal" , "Diagonal" ,
"UpperTri" , "LowerTri" , "SymmetricPositiveDefinite" ,
"Symmetric" , "DecomposablePositive" , "DecomposableNegative" ,
"DecomposableReal" , "Invertible" , "Any" .
Syytype The class of the Syy matrix, ignored if evolmodel equals
"bm" . Otherwise it can take one of the following values:
"SingleValueDiagonal" , "Diagonal" ,
"UpperTri" , "LowerTri" , "Symmetric" , "Any" .
diagA Should the diagonal of A be forced to be positive ("Positive" ),
negative ("Negative" ) or the sign free to vary (NULL ). However, setting this to
a nonNULL value when evolmodel is "mvslouch" might be
(but simulations concerning this are not conclusive) slightly detrimental to the optimization process
if Atype is "DecomposablePositive" , "DecomposableNegative" , or
"DecomposableReal" . In these cases A is parametrized by its eigendecomposition.
Additional exponentiation of the diagonal, to ensure positivity, could (but this is uncertain)
make the exploration of the likelihood surface more difficult. The user is advised to
also try diag=NULL . In the case of Atype being "SymmetricPositiveDefinite" ,
the diagonal is always guaranteed to be positive.
signsA WARNING: ONLY use this if you know what you are doing. Ignored if evolmodel
equals "bm" . This allows the user to specify which elements of A are to be positive, negative
or equal to specific values.See ouchModel and mvslouchModel for a more specific
description and important warnings.
signsSyy WARNING: ONLY use this if you know what you are doing. Ignored if evolmodel
equals "bm" . This allows the user to specify which elements of Syy are to be positive, negative
or equal to specific values. See ouchModel and mvslouchModel for a more specific
description and important warnings.
signsB WARNING: ONLY use this if you know what you are doing. Ignored if evolmodel
does not equals "mvslouch" . This allows the user to specify which elements of B are to be
positive, negative or equal to specific values. See mvslouchModel
for a more specific description and important warnings.
signsmPsi WARNING: ONLY use this if you know what you are doing. Ignored if evolmodel
equals "bm" . This allows the user to specify which elements of mPsi are to be positive, negative
or equal to specific values.See ouchModel and mvslouchModel for a more specific
description and important warnings.
signsvY0 WARNING: ONLY use this if you know what you are doing. Ignored if evolmodel
equals "bm" . This allows the user to specify which elements of vY0 are to be positive, negative
or equal to specific values.See ouchModel and mvslouchModel for a more specific
description and important warnings.
 start_point_for_optim

A named list with starting parameters for of the parameters for be optimized by optim() , currently
only A and Syy for evolmodel equalling "ouch" or "mvslouch" .
One may provide both or only one of them.
Make sure that the parameter is consistent with the other parameter restrictions as no check is done and this
can result in undefined behaviour. For example one may provide this as (provided dimensions and other
parameter restrictions agree)
start_point_for_optim=list(A=rbind(c(2,0),(0,4)),
Syy=rbind(c(1,0.5),c(0,2))).
 parscale

A vector to calculate the parscale argument for optim . It is a named
vector with 3 entries, e.g. c("parscale_A"=3,"logparscale_A"=5,"logparscale_other"=1) .
The entry parscale_A is the scale for entries of the A matrix,
logparscale_A is the scale for entries of the A matrix that are
optimized over on the logarithmic scale, e.g. if eigenvalues are assumed to be positive,
then optimization is done over log(eigenvalue) for A 's eigendecomposition
and logparscale_other is the scale for entries other then of A
that are done on the logarithmic scale (e.g. Syy 's diagonal, or other entries
indicated as positive via parameter_signs ). If not provided (or if a name of the vector
is misspelled), then made equal to the example value provided above. For other elements, then
mentioned above, that are optimized over by optim() , 1 is used for optim() 's
parscale . It is advised that the user experiments with a couple of different values and
reads optim 's man page.
 estimateBmethod
Only relevant for the OUBM models (optional), should B be estimated by maximum likelihood
(default if not provided) value "ML" or generalized least squares (value "GLS" ).
A minimum example list is list(list(evolmodel="bm")) . The functions that automatically generate
different types of models do NOT use any of the "signs" parameters. Hence, in these models
all parameters (under the appropriate parametrization) will be free to vary.

predictors 
A vector giving the numbers of the columns from
dfdata which are to be considered predictor ones, i.e. conditioned
on in the program output.
A vector giving the numbers of the columns from mData matrix which are to be considered predictor ones,
i.e. conditioned on in the program output. If not provided then in for the OUBM model the columns (kY+1):ncol(mData) , i.e. the
"BM" ones, are treated as predictors. Otherwise, none will be considered to be predictors.

kY 
Number of "Y" (response) variables, for the OUBM models.
The first kY
columns of mY are the "OU" ones, while the rest the "BM" ones.
In more detail this value determines the number of columns of the mData matrix to treat as response variables ("OU" ones). For example, a
value of 1 means that only the first column is treated as a response variable, while a value of 3 means the first three
columns are treated as response variables. Any predictor variables ("BM" ones) the user is interested in setting for a particular model
should therefore be placed in the final columns of the mData matrix, allowing for selecting select kY columns before
this as response variables ("OU" ones).

doPrint 
Should the function print out information on what it is doing (TRUE) or keep silent
(default FALSE).

pESS 
Should the function also find the best model taking into account the phylogenetic effective sample size
and it so what method. If NULL , then do not take this into account. Otherwise one
of "reg" ("regression" effective sample size that takes into account all of the
correlations between species explicitly),
"mean" (mean value effective sample size 1^{T}R^{1}1 , where R
is the interspecies correlation matrix),
"MI" (mutual information effective sample size),
"mvreg" (multivariate version of "regression" effective sample size when each species is
described by a suite of traits) ,
"mvMI" (multivariate mutual information effective sample size when each species is described by
a suite of traits) indicating
the way to calculate the pESS. The default (NULL ) is not to do any pESS calculations as these will be slow.
They require the construction of the betweenspeciesbetweentraits variance covariance matrix and
hence do not fully take advantage of the speedup offered by PCMBase. If pESS="only_calculate" ,
then all possible pESS values are calculated but no model selection is done based on them.

estimate.root.state 
Should the root state be estimate TRUE (not recommended) or set at the optimum
FALSE (recommended). Root state estimation is usually unreliable hence
if fossil measurements are available prediction based on them and the estimated
model will probably be more accurate. If there is only one regime, then
estimation of the root state separately is impossible and will not be allowed.

min_bl 
Value to which PCMBase's PCMBase.Threshold.Skip.Singular should be set.
It indicates that branches of length shorter than min_bl should be skipped in likelihood
calculations. Short branches can result in singular covariance matrices for the transition
density along a branch. The user should adjust this value if a lot of warnings are raised by
PCMBase about singularities during the likelihood calculations.
Furthermore, mvSLOUCH sets all branches in the tree shorter than min_bl to min_bl .
However, this does not concern tip branchesthese cannot be skipped and hence should be long
enough so that numerical issues are not raised.

maxiter 
The maximum number of iterations for different components of the estimation
algorithm. A vector of three integers. The first is the number of iterations for phylogenetic
GLS evaluations, i.e. conditional on the other parameters, the regime optima, perhaps B , and perhaps
initial state are estimated by a phylogenetic GLS procedure. After this the other (except of B
in OUBM model case) parameters are optimized over by optim() . This first
entry controls the number of iterations of this procedure. The second is the number of iterations inside
the iterated GLS for the OUBM model. In the first step regime optima and B (and perhaps initial state)
are estimated conditional on the other parameters and current estimate of B , then the estimate of B
is update and the same phylogenetic GLS is repeated (second entry of maxiter number of times).
Finally, the third is the value of maxiter passed to optim() , apart from the optimization
in the Brownian motion and measurement error case.

Details
The likelihood calculations are done by the PCMBase package. However, there is a
C++ backend, PCMBaseCpp. If it is not available, then the likelihood is calculated
slower using pure R. However, with the calculations in C++ up to a 100fold increase in
speed is possible (more realistically 1020 times). The PCMBaseCpp package is available
from https://github.com/venelin/PCMBaseCpp.
The setting Atype="Any"
means that one assumes the matrix A
is eigendecomposable.
If the estimation algorithm hits a defective A
, then it sets the loglikelihood at
the minimum value and will try to get out of this dip.
If model.setups
is left at the default value the
function will take a long time to run, as it performs estimation for each model
(generate.model.setups
generates 90 setups) times the value in repeats.
Therefore if the user has particular hypotheses in mind then it is advisable to
prepare their own list.
If the Syy
matrix is assumed to be uppertriangular and the starting conditions
based on Bartoszek \&
Sagitov (2015)'s results are used then the factorization
of \Sigma=\Sigma_{yy} \Sigma_{yy}^{T}
into \Sigma_{yy}
is
done using the procedure described in
https://math.stackexchange.com/questions/2039477/choleskydecompostionuppertriangularorlowertriangular.
From version 2.0.0
of mvSLOUCH the data has to be passed as a matrix.
To underline this the data parameter's name has been changed to mData
.
If AICc fails, then the function will use BIC to select between models. This is extremely
unlikely essentially only when AICc is infinite, i.e. the model is saturated
(number of observations equals number of data points).
Value
A list is returned that describes the results of the search. See the help for
BrownianMotionModel
, ouchModel
and
mvslouchModel
for the description of the lower level entries.
The elements of this list are the following
BestModel 
The resulting best model found. Included are the model parameters,
a "firstglance" qualitative description of the model, the most important parameters of the
process (halflives and regressions in the case of OU models) and what to call to obtain
standard errors. It takes a long time to obtain them so calculating them is not part of the
standard procedure.

BestModelESS 
Only if pESS was TRUE . The resulting best model found
taking into account the phylogenetic essential sample size. Included are the model parameters,
a "firstglance" qualitative description of the model, the most important parameters of the
process (halflives and regressions in the case of OU models) and what to call to obtain
bootstrap confidence intervals. It takes a long time to obtain them so calculating them is not part of the
standard procedure.

testedModels 
A list of results for each tried model.

model.setups 
A list of models tried.

repeats 
How many starting points were tried per model.

Note
The engine behind the likelihood calculations is called from PCMBase.
The slouch
package is a recommended alternative if one has a OUBM models and
only a single response (Y) trait.
The mvMORPH
, ouch
and Rphylpars
packages consider multivariate OU models
and looking at them could be helpful.
Author(s)
Krzysztof Bartoszek
References
Ane, C. (2008)
Analysis of comparative data with hierarchical autocorrelation.
Annals of Applied Statistics 2:10781102.
Bartoszek, K. (2016)
Phylogenetic effective sample size.
Journal of Theoretical Biology 407:371386.
Bartoszek, K. and FuentesGonzalez, J. and Mitov, V. and Pienaar, J. and Piwczynski, M.
and Puchalka, R. and Spalik, K. and Voje, K. L. (2023)
Model Selection Performance in Phylogenetic Comparative Methods Under Multivariate
OrnsteinUhlenbeck Models of Trait Evolution, Systematic Biology 72(2):275293.
Bartoszek, K. and Pienaar, J. and Mostad. P. and Andersson, S. and Hansen, T. F. (2012)
A phylogenetic comparative method for studying multivariate adaptation.
Journal of Theoretical Biology 314:204215.
Bartoszek, K. and Sagitov, S. (2015)
Phylogenetic confidence intervals for the optimal trait value.
Journal of Applied Probability 52(4):11151132.
Butler, M.A. and A.A. King (2004)
Phylogenetic comparative analysis: a modeling approach for adaptive evolution.
American Naturalist 164:683695.
Hansen, T.F. and Pienaar, J. and Orzack, S.H. (2008)
A comparative method for studying adaptation to randomly evolving environment.
Evolution 62:19651977.
Mitov, V. and Bartoszek, K. and Asimomitis, G. and Stadler, T. (2020)
Fast likelihood calculation for multivariate Gaussian phylogenetic models with shifts
Theoretical Population Biology 131:6678.
Xiao, H and Bartoszek, K. and Lio P. (2018)
Multiâ€“omic analysis of signalling factors in inflammatory comorbidities.
BMC Bioinformatics, Proceedings from the 12th International BBCC conference
19:439.
See Also
brown
, mvBM
BrownianMotionModel
, SummarizeBM
,
simulBMProcPhylTree
, hansen
, mvOU
,
ouchModel
,
SummarizeOUCH
, simulOUCHProcPhylTree
,
slouch::model.fit
, PCMLik
,
mvslouchModel
,
SummarizeMVSLOUCH
, simulMVSLOUCHProcPhylTree
,
parametric.bootstrap
,
optim
Examples
RNGversion(min(as.character(getRversion()),"3.6.1"))
set.seed(12345, kind = "MersenneTwister", normal.kind = "Inversion")
### We will first simulate a small phylogenetic tree using functions from ape.
### For simulating the tree one could also use alternative functions, e.g. sim.bd.taxa
### from the TreeSim package
phyltree<ape::rtree(4)
## The line below is not necessary but advisable for speed
phyltree<phyltree_paths(phyltree)
### Define a vector of regimes.
regimes<c("small","small","large","small","large","small")
### Define SDE parameters to be able to simulate data under the OUOU model.
OUOUparameters<list(vY0=matrix(c(1,1,0.5),nrow=3,ncol=1),
A=rbind(c(9,0,0),c(0,5,0),c(0,0,1)),mPsi=cbind("small"=c(1,1,0.5),"large"=c(1,1,0.5)),
Syy=rbind(c(1,0.25,0.3),c(0,1,0.2),c(0,0,1)))
### Now simulate the data.
OUOUdata<simulOUCHProcPhylTree(phyltree,OUOUparameters,regimes,NULL)
OUOUdata<OUOUdata[phyltree$tip.label,,drop=FALSE]
## set up for a trivial, single model setup case (for running time)
## in a real analysis you should carefully choose between what models
## you want to do model selection
model_setups<list(list(evolmodel="bm"))
### Try to recover the parameters of the OUOU model.
### maxiter here set to minimal working possibility, in reality it should be larger
### e.g. default of c(10,50,100)
estimResults<estimate.evolutionary.model(phyltree,OUOUdata,regimes=regimes,
root.regime="small",M.error=NULL,repeats=1,model.setups=model_setups,predictors=c(3),
kY=2,doPrint=TRUE,pESS=NULL,maxiter=c(1,1,1))
### After this step you can look at the best estimated model and use the
### parametric.bootstrap() function to obtain bootstrap confidence intervals
RNGversion(as.character(getRversion()))
## Not run: ##It takes too long to run this
## take a less trivial setup
phyltree<ape::rtree(5)
## The line below is not necessary but advisable for speed
phyltree<phyltree_paths(phyltree)
### Define a vector of regimes.
regimes<c("small","small","large","small","small","large","large","large")
### Define SDE parameters to be able to simulate data under the OUOU model.
OUOUparameters<list(vY0=matrix(c(1,1,0.5),nrow=3,ncol=1),
A=rbind(c(9,0,0),c(0,5,0),c(0,0,1)),mPsi=cbind("small"=c(1,1,0.5),"large"=c(1,1,0.5)),
Syy=rbind(c(1,0.25,0.3),c(0,1,0.2),c(0,0,1)))
### Now simulate the data.
OUOUdata<simulOUCHProcPhylTree(phyltree,OUOUparameters,regimes,NULL)
OUOUdata<OUOUdata[phyltree$tip.label,,drop=FALSE]
## set up for two very simple (for example usage) models to compare between
## in a real analysis you should carefully choose between what models
## you want to do model selection, the default
## model_setups<NULL provides a wide selection of models
model_setups<list(list(evolmodel="bm"),list(evolmodel="ouch",
"Atype"="SingleValueDiagonal","Syytype"="SingleValueDiagonal","diagA"="Positive"))
### Try to recover the parameters of the OUOU model.
estimResults<estimate.evolutionary.model(phyltree,OUOUdata,regimes=regimes,
root.regime="small",M.error=NULL,repeats=3,model.setups=model_setups,predictors=c(3),
kY=2,doPrint=TRUE,pESS=NULL,maxiter=c(10,50,100))
## End(Not run)