estimate.evolutionary.model: Wrapper function to find best (out of BM, OU, OUOU, OUBM)...

View source: R/evolmodelest.R

estimate.evolutionary.modelR Documentation

Wrapper function to find best (out of BM, OU, OUOU, OUBM) fitting evolutionary model and estimate its parameters.

Description

The estimate.evolutionary.model function calls the BrownianMotionModel, ouchModel and mvslouchModel functions with different classes of evolutionary model parameters. It then compares the resulting estimates by the AICc (or BIC if AICc fails) and returns the best overall model. The user is recommended to install the suggested package PCMBaseCpp which significantly speeds up the calculations (see Details).

Usage

estimate.evolutionary.model(phyltree, mData, regimes = NULL, 
root.regime = NULL, M.error = NULL, repeats = 5, model.setups = NULL, 
predictors = NULL, kY = NULL, doPrint = FALSE, pESS=NULL, 
estimate.root.state=FALSE, min_bl = 0.0003, maxiter=c(10,50,100))

Arguments

phyltree

The phylogeny in phylo format. The tree can be obtained from e.g. a nexus file by the read.nexus() function from the ape package. The "standard" ape node indexing is assumed: for a tree with n tips, the tips should have indices 1:n and the root index n+1.

mData

A matrix with the rows corresponding to the tip species while the columns correspond to the traits. The rows should be named by species
(field phyltree$tip.label), if not, then a warning is thrown and the order of the species is assumed to be the same as the order in which the species are in the phylogeny (i.e. correspond to the node indices 1:n, where n is the number of tips). The columns should be named by traits, otherwise a warning is thrown and generic names are generated.

regimes

A vector or list of regimes. If vector then each entry corresponds to each of phyltree's branches, i.e. to each row of phyltree$edge.If list then each list entry corresponds to a tip node and is a vector for regimes on that lineage. If NULL, then a constant regime is assumed on the whole tree.

root.regime

The regime at the root of the tree. If not given, then it is taken as the regime that is present on the root's daughter lineages and is the most frequent one in the regimes vector. If more than one regime has the same maximum frequency, then alphabetically first one of the maximum ones is taken.

M.error

An optional measurement error covariance structure. The measurement errors between species are assumed independent. The program tries to recognize the structure of the passed matrix and accepts the following possibilities :

  • a single number that is a common measurement error for all tips and species,

  • a m element vector with each value corresponding to a variable, measurement errors are independent between variables and each species is assumed to have the same measurement errors,

  • a m x m ((number of variables) x (number of variables)) matrix, all species will have the same measurement error,

  • a list of length n (number of species), each list element is the covariance structure for the appropriate (numbering according to tree) species, either a single number (each variable has same variance), vector (of length m for each variable), or m x m matrix, the order of the list has to correspond to the order of the nodes in the phyltree object,

  • NULL no measurement error.

From version 2.0.0 of mvSLOUCH it is impossible to pass a single joint measurement error matrix for all the species and traits.

repeats

How many starting points for the numerical maximum likelihood procedure should be tried for each model setup. On the first repeat for OUOU and OUBM modes the functions takes as the starting point (for A and Syy) values based on the sample covariance matrix estimate, motivated by Bartoszek \& Sagitov (2015)'s univariate results.

model.setups

What models to try when searching for the best evolutionary model. This field may remain NULL, in this situation the function generates using

.generate.basic.model.setups() a basic list of models. Allowed values are

"basic"

A basic list of models to try out is generated, defined using

.generate.basic.model.setups(). This list should be usually enough.

"fundamental"

A slightly extended list of models to try out is generated, defined using .generate.fund.model.setups(). Compared to "basic" a few more models are added.

"extended"

An extension of the "fundamental" list of models to try out. Defined using .generate.ext.model.setups() which at the moment calls
generate.model.setups().

"all"

All possible models are generated, using
.generate.all.model.setups(). Running it will take an intolerable
amount of time.

Alternatively the user is also free to provide their own list of models in this variable. Each element of the list is a list with fields.

evolmodel

The evolutionary model, it may take one of the three values "bm" (Brownian motion model), "ouch" (OUOU model), "mvslouch"
(OUBM model).

Atype

The class of the A matrix, ignored if evolmodel equals "bm". Otherwise it can take one of the following values: "SingleValueDiagonal", "Diagonal", "UpperTri", "LowerTri", "SymmetricPositiveDefinite", "Symmetric", "DecomposablePositive", "DecomposableNegative",
"DecomposableReal", "Invertible", "Any".

Syytype

The class of the Syy matrix, ignored if evolmodel equals "bm". Otherwise it can take one of the following values: "SingleValueDiagonal", "Diagonal", "UpperTri", "LowerTri", "Symmetric", "Any".

diagA

Should the diagonal of A be forced to be positive ("Positive"), negative ("Negative") or the sign free to vary (NULL). However, setting this to a non-NULL value when evolmodel is "mvslouch" might be (but simulations concerning this are not conclusive) slightly detrimental to the optimization process if Atype is "DecomposablePositive", "DecomposableNegative", or "DecomposableReal". In these cases A is parametrized by its eigendecomposition. Additional exponentiation of the diagonal, to ensure positivity, could (but this is uncertain) make the exploration of the likelihood surface more difficult. The user is advised to also try diag=NULL. In the case of Atype being "SymmetricPositiveDefinite", the diagonal is always guaranteed to be positive.

signsA

WARNING: ONLY use this if you know what you are doing. Ignored if evolmodel equals "bm". This allows the user to specify which elements of A are to be positive, negative or equal to specific values.See ouchModel and mvslouchModel for a more specific description and important warnings.

signsSyy

WARNING: ONLY use this if you know what you are doing. Ignored if evolmodel equals "bm". This allows the user to specify which elements of Syy are to be positive, negative or equal to specific values. See ouchModel and mvslouchModel for a more specific description and important warnings.

signsB

WARNING: ONLY use this if you know what you are doing. Ignored if evolmodel does not equals "mvslouch". This allows the user to specify which elements of B are to be positive, negative or equal to specific values. See mvslouchModel for a more specific description and important warnings.

signsmPsi

WARNING: ONLY use this if you know what you are doing. Ignored if evolmodel equals "bm". This allows the user to specify which elements of mPsi are to be positive, negative or equal to specific values.See ouchModel and mvslouchModel for a more specific description and important warnings.

signsvY0

WARNING: ONLY use this if you know what you are doing. Ignored if evolmodel equals "bm". This allows the user to specify which elements of vY0 are to be positive, negative or equal to specific values.See ouchModel and mvslouchModel for a more specific description and important warnings.

start_point_for_optim

A named list with starting parameters for of the parameters for be optimized by optim(), currently only A and Syy for
evolmodel equalling "ouch" or "mvslouch". One may provide both or only one of them. Make sure that the parameter is consistent with the other parameter restrictions as no check is done and this can result in undefined behaviour. For example one may provide this as (provided dimensions and other parameter restrictions agree)

start_point_for_optim=list(A=rbind(c(2,0),(0,4)), 
Syy=rbind(c(1,0.5),c(0,2))).
parscale

A vector to calculate the parscale argument for optim. It is a named vector with 3 entries, e.g.
c("parscale_A"=3,"logparscale_A"=5,"logparscale_other"=1).
The entry parscale_A is the scale for entries of the A matrix,
logparscale_A is the scale for entries of the A matrix that are optimized over on the logarithmic scale, e.g. if eigenvalues are assumed to be positive, then optimization is done over log(eigenvalue) for A's eigendecomposition and logparscale_other is the scale for entries other then of A that are done on the logarithmic scale (e.g. Syy's diagonal, or other entries indicated as positive via parameter_signs). If not provided (or if a name of the vector is misspelled), then made equal to the example value provided above. For other elements, then mentioned above, that are optimized over by optim(), 1 is used for optim()'s parscale. It is advised that the user experiments with a couple of different values and reads optim's man page.

estimateBmethod

Only relevant for the OUBM models (optional), should B be estimated by maximum likelihood (default if not provided) value "ML" or generalized least squares (value "GLS").

A minimum example list is list(list(evolmodel="bm")). The functions that automatically generate different types of models do NOT use any of the "signs" parameters. Hence, in these models all parameters (under the appropriate parametrization) will be free to vary.

predictors

A vector giving the numbers of the columns from dfdata which are to be considered predictor ones, i.e. conditioned on in the program output. A vector giving the numbers of the columns from mData matrix which are to be considered predictor ones, i.e. conditioned on in the program output. If not provided then in for the OUBM model the columns (kY+1):ncol(mData), i.e. the "BM" ones, are treated as predictors. Otherwise, none will be considered to be predictors.

kY

Number of "Y" (response) variables, for the OUBM models. The first kY columns of mY are the "OU" ones, while the rest the "BM" ones. In more detail this value determines the number of columns of the mData matrix to treat as response variables ("OU" ones). For example, a value of 1 means that only the first column is treated as a response variable, while a value of 3 means the first three columns are treated as response variables. Any predictor variables ("BM" ones) the user is interested in setting for a particular model should therefore be placed in the final columns of the mData matrix, allowing for selecting select kY columns before this as response variables ("OU" ones).

doPrint

Should the function print out information on what it is doing (TRUE) or keep silent (default FALSE).

pESS

Should the function also find the best model taking into account the phylogenetic effective sample size and it so what method. If NULL, then do not take this into account. Otherwise one of "reg" ("regression" effective sample size that takes into account all of the correlations between species explicitly), "mean" (mean value effective sample size 1^{T}R^{-1}1, where R is the interspecies correlation matrix), "MI" (mutual information effective sample size), "mvreg"(multivariate version of "regression" effective sample size when each species is described by a suite of traits) , "mvMI" (multivariate mutual information effective sample size when each species is described by a suite of traits) indicating the way to calculate the pESS. The default (NULL) is not to do any pESS calculations as these will be slow. They require the construction of the between-species-between-traits variance covariance matrix and hence do not fully take advantage of the speed-up offered by PCMBase. If pESS="only_calculate", then all possible pESS values are calculated but no model selection is done based on them.

estimate.root.state

Should the root state be estimate TRUE (not recommended) or set at the optimum FALSE (recommended). Root state estimation is usually unreliable hence if fossil measurements are available prediction based on them and the estimated model will probably be more accurate. If there is only one regime, then estimation of the root state separately is impossible and will not be allowed.

min_bl

Value to which PCMBase's PCMBase.Threshold.Skip.Singular should be set. It indicates that branches of length shorter than min_bl should be skipped in likelihood calculations. Short branches can result in singular covariance matrices for the transition density along a branch. The user should adjust this value if a lot of warnings are raised by PCMBase about singularities during the likelihood calculations. Furthermore, mvSLOUCH sets all branches in the tree shorter than min_bl to min_bl. However, this does not concern tip branches-these cannot be skipped and hence should be long enough so that numerical issues are not raised.

maxiter

The maximum number of iterations for different components of the estimation algorithm. A vector of three integers. The first is the number of iterations for phylogenetic GLS evaluations, i.e. conditional on the other parameters, the regime optima, perhaps B, and perhaps initial state are estimated by a phylogenetic GLS procedure. After this the other (except of B in OUBM model case) parameters are optimized over by optim(). This first entry controls the number of iterations of this procedure. The second is the number of iterations inside the iterated GLS for the OUBM model. In the first step regime optima and B (and perhaps initial state) are estimated conditional on the other parameters and current estimate of B, then the estimate of B is update and the same phylogenetic GLS is repeated (second entry of maxiter number of times). Finally, the third is the value of maxiter passed to optim(), apart from the optimization in the Brownian motion and measurement error case.

Details

The likelihood calculations are done by the PCMBase package. However, there is a C++ backend, PCMBaseCpp. If it is not available, then the likelihood is calculated slower using pure R. However, with the calculations in C++ up to a 100-fold increase in speed is possible (more realistically 10-20 times). The PCMBaseCpp package is available from https://github.com/venelin/PCMBaseCpp.

The setting Atype="Any" means that one assumes the matrix A is eigendecomposable. If the estimation algorithm hits a defective A, then it sets the log-likelihood at the minimum value and will try to get out of this dip.

If model.setups is left at the default value the function will take a long time to run, as it performs estimation for each model (generate.model.setups generates 90 setups) times the value in repeats. Therefore if the user has particular hypotheses in mind then it is advisable to prepare their own list. If the Syy matrix is assumed to be upper-triangular and the starting conditions based on Bartoszek \& Sagitov (2015)'s results are used then the factorization of \Sigma=\Sigma_{yy} \Sigma_{yy}^{T} into \Sigma_{yy} is done using the procedure described in https://math.stackexchange.com/questions/2039477/cholesky-decompostion-upper-triangular-or-lower-triangular.

From version 2.0.0 of mvSLOUCH the data has to be passed as a matrix. To underline this the data parameter's name has been changed to mData.

If AICc fails, then the function will use BIC to select between models. This is extremely unlikely essentially only when AICc is infinite, i.e. the model is saturated (number of observations equals number of data points).

Value

A list is returned that describes the results of the search. See the help for BrownianMotionModel, ouchModel and mvslouchModel for the description of the lower level entries. The elements of this list are the following

BestModel

The resulting best model found. Included are the model parameters, a "first-glance" qualitative description of the model, the most important parameters of the process (half-lives and regressions in the case of OU models) and what to call to obtain standard errors. It takes a long time to obtain them so calculating them is not part of the standard procedure.

BestModelESS

Only if pESS was TRUE. The resulting best model found taking into account the phylogenetic essential sample size. Included are the model parameters, a "first-glance" qualitative description of the model, the most important parameters of the process (half-lives and regressions in the case of OU models) and what to call to obtain bootstrap confidence intervals. It takes a long time to obtain them so calculating them is not part of the standard procedure.

testedModels

A list of results for each tried model.

model.setups

A list of models tried.

repeats

How many starting points were tried per model.

Note

The engine behind the likelihood calculations is called from PCMBase. The slouch package is a recommended alternative if one has a OUBM models and only a single response (Y) trait. The mvMORPH, ouch and Rphylpars packages consider multivariate OU models and looking at them could be helpful.

Author(s)

Krzysztof Bartoszek

References

Ane, C. (2008) Analysis of comparative data with hierarchical autocorrelation. Annals of Applied Statistics 2:1078-1102.

Bartoszek, K. (2016) Phylogenetic effective sample size. Journal of Theoretical Biology 407:371-386.

Bartoszek, K. and Fuentes-Gonzalez, J. and Mitov, V. and Pienaar, J. and Piwczynski, M. and Puchalka, R. and Spalik, K. and Voje, K. L. (2023) Model Selection Performance in Phylogenetic Comparative Methods Under Multivariate Ornstein-Uhlenbeck Models of Trait Evolution, Systematic Biology 72(2):275-293.

Bartoszek, K. and Pienaar, J. and Mostad. P. and Andersson, S. and Hansen, T. F. (2012) A phylogenetic comparative method for studying multivariate adaptation. Journal of Theoretical Biology 314:204-215.

Bartoszek, K. and Sagitov, S. (2015) Phylogenetic confidence intervals for the optimal trait value. Journal of Applied Probability 52(4):1115-1132.

Butler, M.A. and A.A. King (2004) Phylogenetic comparative analysis: a modeling approach for adaptive evolution. American Naturalist 164:683-695.

Hansen, T.F. and Pienaar, J. and Orzack, S.H. (2008) A comparative method for studying adaptation to randomly evolving environment. Evolution 62:1965-1977.

Mitov, V. and Bartoszek, K. and Asimomitis, G. and Stadler, T. (2020) Fast likelihood calculation for multivariate Gaussian phylogenetic models with shifts Theoretical Population Biology 131:66-78.

Xiao, H and Bartoszek, K. and Lio P. (2018) Multi–omic analysis of signalling factors in inflammatory comorbidities. BMC Bioinformatics, Proceedings from the 12th International BBCC conference 19:439.

See Also

brown, mvBMBrownianMotionModel, SummarizeBM, simulBMProcPhylTree, hansen, mvOU,
ouchModel, SummarizeOUCH, simulOUCHProcPhylTree, slouch::model.fit, PCMLik,
mvslouchModel, SummarizeMVSLOUCH, simulMVSLOUCHProcPhylTree,
parametric.bootstrap, optim

Examples

RNGversion(min(as.character(getRversion()),"3.6.1"))
set.seed(12345, kind = "Mersenne-Twister", normal.kind = "Inversion")
### We will first simulate a small phylogenetic tree using functions from ape.
### For simulating the tree one could also use alternative functions, e.g. sim.bd.taxa 
### from the TreeSim package
phyltree<-ape::rtree(4)

## The line below is not necessary but advisable for speed
phyltree<-phyltree_paths(phyltree)

### Define a vector of regimes.
regimes<-c("small","small","large","small","large","small")

### Define SDE parameters to be able to simulate data under the OUOU model.
OUOUparameters<-list(vY0=matrix(c(1,-1,0.5),nrow=3,ncol=1),
A=rbind(c(9,0,0),c(0,5,0),c(0,0,1)),mPsi=cbind("small"=c(1,-1,0.5),"large"=c(-1,1,0.5)),
Syy=rbind(c(1,0.25,0.3),c(0,1,0.2),c(0,0,1)))

### Now simulate the data.
OUOUdata<-simulOUCHProcPhylTree(phyltree,OUOUparameters,regimes,NULL)
OUOUdata<-OUOUdata[phyltree$tip.label,,drop=FALSE]

## set up for a trivial, single model setup case (for running time)
## in a real analysis you should carefully choose between what models
## you want to do model selection
model_setups<-list(list(evolmodel="bm"))

### Try to recover the parameters of the OUOU model.
### maxiter here set to minimal working possibility, in reality it should be larger
### e.g. default of c(10,50,100)
estimResults<-estimate.evolutionary.model(phyltree,OUOUdata,regimes=regimes,
root.regime="small",M.error=NULL,repeats=1,model.setups=model_setups,predictors=c(3),
kY=2,doPrint=TRUE,pESS=NULL,maxiter=c(1,1,1))

### After this step you can look at the best estimated model and use the
### parametric.bootstrap() function to obtain bootstrap confidence intervals
RNGversion(as.character(getRversion()))

## Not run:  ##It takes too long to run this
## take a less trivial setup
phyltree<-ape::rtree(5)

## The line below is not necessary but advisable for speed
phyltree<-phyltree_paths(phyltree)

### Define a vector of regimes.
regimes<-c("small","small","large","small","small","large","large","large")

### Define SDE parameters to be able to simulate data under the OUOU model.
OUOUparameters<-list(vY0=matrix(c(1,-1,0.5),nrow=3,ncol=1),
A=rbind(c(9,0,0),c(0,5,0),c(0,0,1)),mPsi=cbind("small"=c(1,-1,0.5),"large"=c(-1,1,0.5)),
Syy=rbind(c(1,0.25,0.3),c(0,1,0.2),c(0,0,1)))

### Now simulate the data.
OUOUdata<-simulOUCHProcPhylTree(phyltree,OUOUparameters,regimes,NULL)
OUOUdata<-OUOUdata[phyltree$tip.label,,drop=FALSE]

## set up for two very simple (for example usage) models to compare between
## in a real analysis you should carefully choose between what models
## you want to do model selection, the default 
## model_setups<-NULL provides a wide selection of models

model_setups<-list(list(evolmodel="bm"),list(evolmodel="ouch",
"Atype"="SingleValueDiagonal","Syytype"="SingleValueDiagonal","diagA"="Positive"))

### Try to recover the parameters of the OUOU model.
estimResults<-estimate.evolutionary.model(phyltree,OUOUdata,regimes=regimes,
root.regime="small",M.error=NULL,repeats=3,model.setups=model_setups,predictors=c(3),
kY=2,doPrint=TRUE,pESS=NULL,maxiter=c(10,50,100))

## End(Not run)

mvSLOUCH documentation built on Nov. 21, 2023, 1:08 a.m.