corHMMDredgeBase: Hidden Rates Model with regularization with Penalization...

View source: R/corHMMDredge.R

corHMMDredgeBaseR Documentation

Hidden Rates Model with regularization with Penalization Options

Description

corHMMDredgeBase fits a hidden Markov model (HMM) to a given phylogenetic tree and character data. It offers additional options for penalization and optimization compared to the standard corHMM function.

Usage

corHMMDredgeBase(phy, data, rate.cat, root.p="maddfitz", pen.type = "l1", lambda = 1, 
rate.mat=NULL, node.states = "marginal", fixed.nodes=FALSE, ip=NULL, nstarts=0, n.cores=1, 
get.tip.states = FALSE,lewis.asc.bias = FALSE, collapse = FALSE, lower.bound = 1e-10, 
upper.bound = 100, opts=NULL, p=NULL, grad=FALSE)

Arguments

phy

An object of class phylo, representing the phylogenetic tree.

data

A data frame containing character states for the tips of the phylogeny. The first column should match the tip labels in phy, and the second onwards column should contain the observed states.

rate.cat

An integer specifying the number of rate categories in the model.

root.p

A vector of probabilities for the root states or a method to estimate them. The default is the "maddfitz" method.

pen.type

The type of penalization applied to the model. Options include "l1", "l2", "er", and "unreg". See Details.

lambda

A hyper-parameter that adjusts the severity of the penalty, ranging from 0 (no regularization) to 1 (full penalization). Default is 1.

rate.mat

A user-supplied matrix containing indexes of parameters to be optimized. If NULL, an all rates different model is estimated.

node.states

A method for estimating node states. Options include "marginal", "joint", and "none".

fixed.nodes

Specifies that states for nodes in the phylogeny are assumed fixed. These are supplied as node labels in the "phylo" object. Default is FALSE.

ip

Initial values used for the likelihood search. Can be a single value or a vector of unique values for each parameter. The default is ip=1.

nstarts

The number of random restarts to be performed. The default is nstarts=0.

n.cores

The number of processor cores to spread out the random restarts.

get.tip.states

Logical value indicating whether tip reconstructions should be output. The default is FALSE.

lewis.asc.bias

Logical value indicating whether to correct for observing a dataset that is not univariate. The default is FALSE

collapse

A logical value indicating whether to collapse branches with no variation in states. Default is FALSE.

lower.bound

The lower bound for the rate parameters during optimization. Default is 1e-10.

upper.bound

The upper bound for the rate parameters during optimization. Default is 100.

opts

A list of options to pass to nloptr.

p

A vector of transition rates. Allows the user to calculate the likelihood given a specified set of parameter values to specified as fixed and calculate the likelihood.

grad

A logical value indicating whether to use gradient-based optimization. Default is FALSE.

Details

There are 3 penalty types available for users (though this may be expanded in the future). They are l1, l2, and er. The first two penalty types are analagous to lasso and ridge regression. Whereas the er penalization is analagous to Zhou et al.'s (2024) McLasso and penalizes the distance between rate values (i.e., it prefers rate matrices that are closer to an equal-rates model).

Under an l1 regularization scheme, the likelihood is penalized by the mean transition rate. The mean is used instead of the sum because a sum would overly penalize more complex models by virtue of them having more parameters. This leads to scenarios where simpler models have much better likelihoods than more complex models

Under an l2 regularization scheme, the likleihood is penalized by the squared mean transition rate.

Under an er regularization scheme, the likleihood is penalized by the average distance between parameters. If all the parameters were the same (an equal rates model), the penalty would be 0.

A user can also set pen.type = 'unreg' for an unpenalized likelihood which would be identical to the original implementation of corHMM. More work needs to be done to determine when to use each type of penalization, but generally using any penalization will be good for smaller datasets which tend to be high variance. l2 is the most aggresive penalization, shrinking paramaters more quickly than other methods and leading to more dropped (and potentially finding more unecessary) parameters. er is the most similar to an unregularized model as it does not necessarily penalize high parameter values. It will however penalize a model that has one parameter that is much higher than the rest unless there is significant evidence that this outlier parameter is needed. In practice, er, behaves very similarly to unregularized models. l1 regularization is an intermediate penalization between l2 and er.

The grad option employs a numerical gradient for the optimization. This is a particularly inefficient way to find a gradient as it will require at least k iterations per likelihood calculation. However, I have found that this improves the search stability and speed as the number of iterations is greatly reduced when a gradient is provided. This is also important in cases where there are a lot of parameters (k is large). In these cases the parameter space is so highly dimensional that many search algorithms struggle. In the future I hope to implement a more efficient gradient calculation and combine a gradient based local optimizaiton with a global optimization scheme.

*Note: Many details of corHMMDredgeBase and corHMM are the same and will not be repeated here. If an arguement is unclear check the Details section of corHMM. The focus of these details will be on the unique aspects of the dredge fitting approach.

Value

corHMM returns an object of class corHMM. This is a list with elements:

$loglik

the maximum negative log-likelihood.

$AIC

Akaike information criterion.

$AICc

Akaike information criterion corrected for sample size.

$rate.cat

The number of rate categories specified.

$solution

a matrix containing the maximum likelihood estimates of the transition rates. Note that the rate classes are ordered from slowest (R1) to fastest (Rn) with respect to state 0.

$index.mat

The indices of the parameters being estimated are returned. This also is a way to allow the estimation of transition rates for parameters not oberved in the dataset. Say you have 2 traits X and Y, where the combinations 00, 01, and 11 are observed (10 is not). A 4 by 4 index matrix could be used to force 10 into the model.

$data

User-supplied dataset.

$data.legend

User-supplied dataset with an extra column of trait values corresponding to how corHMM calls the user data.

$phy

User-supplied tree.

$states

The likeliest states at each internal node. The state and rates reconstructed at internal nodes are in the order of the column headings of the rates matrix.

$tip.states

The likeliest state at each tip. The state and rates reconstructed at the tips are in the order of the column headings of the rates matrix.

$states.info

a vector containing the amount of information (in bits) that the tip states and model gives to each node.

$iterations

The number of iterations used by the optimization routine.

$root.p

The root prior used in model estimation.

Author(s)

James D. Boyko

References

Boyko, J. D. 2024. Automatic Discovery of Optimal Discrete Character Models through Regularization. In prep.

Zhou, Y., Gao, M., Chen, Y., Shi, X., 2024. Adaptive Penalized Likelihood method for Markov Chains.

Examples


data(primates)
phy <- multi2di(primates[[1]])
data <- primates[[2]]
MK_3state <- corHMMDredgeBase(phy = phy, data = data, rate.cat = 1, pen.type = "l1", 
	root.p = "maddfitz", lambda = 1)
MK_3state


thej022214/corHMM documentation built on April 13, 2025, 9:37 a.m.