R/likelihood_bdRho.R

Defines functions likelihood_bdRho

Documented in likelihood_bdRho

#' Likelihood birth-death rho function
#'
#' @description Computes the log likelihood of a rooted ultrametric phylogeny under a constant-time homogeneous birth-death model and Bernoulli sampling. The birth-death process is conditioned on the starting time of the process \code{tottime} and the survival of the process at present time. The log likelihood can be calculated specifying the sampling probability or integrating over it according to a specified sampling probability distribution (either uniform: \code{unif = TRUE} or beta distribution: \code{beta = TRUE}). This function can computes the log likelihood on a stem or crown phylogeny or a set of phylogenies assuming common or specific diversification rates. It is by default parametrised on the net diversification rate and the turnover rate but can be reparametrised on the product \eqn{y*\lambda} and the net diversification rate \eqn{r}. This function is specifically adapted for diversification analysis on phylogenies on which the sampling probability is unknown.
#'
#' @param tottime Numeric vector. The stem or crown age (also called MRCA) of the phylogenie(s) depending on the conditioning of the process specified (see \code{root} argument) and the phylogenie(s) used. The length of the numeric vector equals the number of phylogenie(s) used. It is of length 1 if a unique phylogeny is used. The stem age of the phylogeny can be computed using max(TreeSim::getx(phylo))+phylo$root.edge (note that the phylo$root.edge needs to be known) and the crown age of the phylogeny can be computed using max(TreeSim::getx(phylo)). If multiple phylogenies are used the following can be used for calculating the stem age: \code{sapply(seq_along(multiPhylo), function(i) max(TreeSim::getx(multiPhylo[[i]]))+multiPhylo[[i]]$root.edge)} and the crown age: \code{sapply(seq_along(multiPhylo), function(i) max(TreeSim::getx(multiPhylo[[i]])))}. Note that if multiple phylogenies are used, the phylogenies do not need to have the same root age but they need to be conditioned the same way (for all phylogenies either their stem or crown age).
#' @param nbtips Integer.  The number of extant sampled tips in the phylogeny typically noted \code{k}.
#' @param tj List of numeric vector(s). A list of atomic numeric vectors where each vector is specifying the node depths for each phylogeny. If only one phylogeny is used, the list is of length 1. The node depths of a phylogeny can be computed using \code{TreeSim::getx(phylo)[!TreeSim::getx(phylo)==tottime]} and using lapply(seq_along(multiPhylo), function(i) TreeSim::getx(multiPhylo[[i]])[!TreeSim::getx(multiPhylo[[i]])==tottime[i]]) for an object of class \code{multiPhylo}. This code works both for stem and crown phylogenies. Note that the \code{tottime} is not contained in node depths here since it is used for conditioning the process. Since there are \eqn{k-1} internal nodes in a phylogeny and that the MRCA is not included in the node depths when conditioning in crown, then each numeric vector is of length \eqn{k-2} if the phylogeny is in crown and of length \eqn{k-1} if it is in stem with \code{k} being the number of extant sampled tips in the phylogeny.
#' @param  yj Numeric vector. The sampling probabilitie(s) typically calculated as \eqn{k/N} where \code{k} is the number of extant sampled tips and \eqn{N} is the global diversity of the clade. The length of the numeric vector equals the number of phylogenie(s) used. If \code{NULL} and \code{reparam = FALSE}, the sampling probabilitie(s) are integrated according to the specified sampling probability distribution (corresponding to the model \eqn{birth-death∫\rho} in Lambert et al. 2022).
#' @param reparam Logical. If \code{FALSE} (the default option), the log likelihood is calculated using the parameters \eqn{r} and \eqn{\epsilon}. If \code{TRUE} and \code{yj = NULL}, the log likelihood is parametrised using the product \eqn{y*\lambda} and the net diversification rate \eqn{r}.
#' @param beta Logical. This argument is only used if \code{yj = NULL} and \code{reparam = FALSE}. If \code{TRUE} a beta distribution is assumed on the sampling probabilitie(s). Note that the parameters of the beta distribution can be fixed or inferred and can be specified in the function generated by the present function \code{UDivEvo::likelihood_bdRho} using the arguments \code{a} for the parameter \eqn{\alpha} (\eqn{\alpha>0}) and \code{b} for the parameter \eqn{\beta} (\eqn{\beta>0}). Check the \code{Value} section for more details.
#' @param unif Logical. This argument is only used if \code{yj = NULL} and \code{reparam = FALSE}. If \code{TRUE} a uniform distribution is assumed on the sampling probabilitie(s). Note that the parameters of the uniform distribution can be fixed or inferred and can be specified in the function generated by the present function \code{UDivEvo::likelihood_bdRho} using the arguments \code{a} for the lower bound of the uniform distribution (\eqn{0\lea<1}) and \code{b} for the higher bound of the uniform distribution (\eqn{0<b\le1} and \eqn{b>a}). Check the \code{Value} section for more details.
#' @param root Integer. Specifying the conditioning of the birth-death process. \code{root} takes the value \eqn{0} if the phylogeny used is a stem phylogeny to condition the process on the stem age. It takes the value \eqn{1} if the phylogeny is a crown phylogeny to condition the process on the crown age.
#' @param dt Numeric. This argument is only used if \code{yj = NULL} and \code{reparam = FALSE}. If \code{dt = 0}, the integral on the sampling probabilitie(s) is computed using the R \code{stats::integrate} function. If \eqn{dt\ge0}, the integral of the sampling probabilitie(s) is performed manually using a piece-wise constant approximation. \code{dt} represents the length of the interval on which the function integrated is assumed to be constant. For manual integral, advised value of \code{dt} are \eqn{1e-3} to \eqn{1e-5}.
#' @param rel.tol Numeric. This argument is only used if \code{yj = NULL}, \code{reparam = FALSE} and \code{dt = 0}. This represents the relative accuracy requested when the integral is performed using \code{stats::integrate} function. Typically \code{.Machine$double.eps^0.25} is used but a value of \eqn{1e-10} has been tested and performs well.
#' @param tuned_dichotomy Logical. This argument is only used if \code{yj = NULL} and \code{reparam = FALSE}. If \code{TRUE}, when the log likelihood is equal to non finite value due to approximations, a dichotomy search is performed to find a tuning parameter that will be used for getting a finite value of the log likelihood. If \code{TRUE}, the log likelihood will take longer to calculate. Else if \code{FALSE}, no dichotomy search is performed; if the log likelihood is equal to non finite value due to approximations, the function will return this non finite value.
#' @param brk Numeric. This argument is only used if \code{yj = NULL}, \code{reparam = FALSE} and \code{tuned_dichotomy = TRUE}. The number of steps used in the dichotomy search. Typically the value \eqn{200} is sufficient to avoid non finite values. In some case if the log likelihood is still equal to non finite value, the \code{brk} value \eqn{2000} will be required for more tuning but it will rarely take a larger value.
#'
#' @details This function is a closure, it takes all of the above as arguments and creates another function with arguments depending on the birth-death model chosen:
#' \itemize{
#'   \item If a unique phylogeny is used and the corresponding sampling probability is given in \code{yj}, then the classical birth-death-sampling model is used and the function created takes \code{div}: the net diversification rate and \code{turn}: the turnover rate as arguments.
#'   \item If a unique phylogeny is used, \code{yj = NULL} and the log likelihood is set for being reparametrised \code{reparam = TRUE}, then the reparametrised birth-death-sampling model is used and the function created takes \code{div}: the net diversification rate and \code{ylamb}: the product of the sampling probability and speciation rate \eqn{y*\lambda} as arguments.
#'   \item If a unique phylogeny is used, \code{yj = NULL} and \code{reparam = FALSE}, then the \eqn{birth-death∫\rho} model is used and the function created takes \code{tun.init}: the initial tuning parameters values and \code{seqphy.init}: the phylogenies number for which to use the corresponding tuning parameter (\eqn{1} for a unique phylogeny). This former function creates another function that takes as argument \code{div}: the net diversification rate, \code{turn}: the turnover rate, and hyperparameters of the sampling probabilitie(s) distribution \code{a}: the first argument of the sampling probabilitie(s) distribution chosen and \code{b}: the second argument of the sampling probabilitie(s) distribution chosen (see \code{\link{phi}} for more details).
#'   \item If a set of phylogenies are used for computing the log likelihood then all the parameters (except the one relatives to the hyperparameters of the sampling probabilitie(s) distribution described above): \code{div}, \code{turn}, \code{ylamb}, \code{tun.init} and \code{seqphy.init} have to be of length the number of phylogenies used.
#' }
#' @details This function is specifically intended to be used on phylogenies with unknown or highly uncertain global diversity estimates (the sampling probability is not known with accuracy). Note that the sampling probability is never estimated and that this function is not able to evaluate negative rates.
#'
#' @return Returns an object of class \code{function}. This function can take the following arguments (as described above depending on the chosen model) and will return the value of the log likelihood alone or in a list together with the tuning parameters values (for the \eqn{birth-death∫\rho} model):
#' \describe{
#'   \item{div}{Numeric vector. The net diversification rate(s) also called \eqn{r}. The length of the numeric vector equals the number of phylogenie(s) used.}
#'   \item{turn}{Numeric vector. The turnover rate(s) also called \eqn{\epsilon}. The length of the numeric vector equals the number of phylogenie(s) used.}
#'   \item{ylamb}{Numeric vector. The product of the sampling probabilitie(s) and speciation rates \eqn{y*\lambda} also called \eqn{c} in Stadler 2009. The length of the numeric vector equals the number of phylogenie(s) used.}
#'   \item{tun.init}{Numeric vector. The initial tuning parameters values. Typically, it will take the value \eqn{log(1)}. The length of the numeric vector equals the number of phylogenie(s) used.}
#'   \item{seqphy.init}{Integer vector. A regular sequence of the number of phylogenie(s) used, i.e. if \eqn{10} phylogenies are used \code{seqphy.init = c(1:10)}}
#'   \item{a}{Numeric. The value of \eqn{\alpha} or \eqn{a} respectively for the beta or the uniform distribution. This value cannot be negative for both distributions and cannot exceed \eqn{b} and \eqn{1} for the uniform distribution.}
#'   \item{b}{Numeric. The value of \eqn{\beta} or \eqn{b} respectively for the beta or the uniform distribution. This value cannot be negative for both distributions, cannot be inferior to \eqn{a} and cannot exceed 1 for the uniform distribution.}
#' }
#' @references Stadler, T. (2009). On incomplete sampling under birth–death models and connections to the sampling-based coalescent. Journal of theoretical biology, 261(1), 58-66.
#'
#' @author Sophia Lambert
#'
#' @export
#'
#' @seealso \code{\link{fitMCMC_bdRho}} and \code{\link{likelihood_bdK}}

likelihood_bdRho <- function(tottime, nbtips, tj, yj,
                             reparam = FALSE, beta, unif,
                             root, dt, rel.tol,
                             tuned_dichotomy,
                             brk){

  seqphy <- 1:length(nbtips)

  # The birth-death-sampling (Bernoulli sampling) model when the sampling probability is known.

  if(class(yj)=="numeric"){
    function(div, turn){
      if(any(div < 0) | any(turn < 0)){loglik = -Inf}
      else{
        loglik <- sapply(seqphy, function(i){
          log(Pcond(t = tottime[i], r = div[i], epsi = turn[i], y = yj[i]))*(root+1) + # age + survival conditioning
            sum(log(Pnd(tj[[i]], r = div[i], epsi = turn[i], y = yj[i])))})
      }
      reslogLik <- sum(loglik)
      return(reslogLik)
    }
  }

  # The birth-death-sampling model reparametrised on the net diversification rate and the product of the sampling probability and the speciation rate. The sampling probability is not known.

  else if(reparam == TRUE){
    function(div, ylamb){
      if(any(ylamb < 0) | any(div < 0)){loglik = -Inf}
      else{
        loglik <- sapply(seqphy, function(i){
          log(Pcond_reparam(t = tottime[i], yl = ylamb[i], r = div[i]))*(root+1) + # age + survival conditioning
            sum(log(Pnd_reparam(t = tj[[i]], yl = ylamb[i], r = div[i])))})
      }
      reslogLik <- sum(loglik)
      return(reslogLik)
    }
  }

  # The birth-death∫rho (Bernoulli sampling) model when the sampling probability is not known.

  else{
    prPhi <- phi(beta = beta, unif = unif)
    integr_Rho <- int_Rho(dt = dt)
    function(tun.init, seqphy.init){
      function(div, turn, a, b){

        tun = tun.init # the tun.init has to be a vector of length seqphy
        seqphy = seqphy.init

        if(any(div < 0) | any(turn < 0) | a < 0 | b < 0 | # should we add a -inf value if a is bigger than b ????
           integr_Rho(function(yj)sapply(yj, function (yj){prPhi(x = yj, a = a, b = b)}), 0, 1)==0){rv = -Inf}
        else{
          like_tuning <- function(tun, seqphy){
            integr_loglik <- sapply(seqphy, function(i){
              log(integr_Rho(function(yj) sapply(yj, function (yj){
                exp(log(Pcond(t = tottime[i], r = div[i], epsi = turn[i], y = yj))*(root+1) +
                      tottime[i]*div[i]*2 + # age + survival conditioning + tuning cond
                      sum(log(Pnd(tj[[i]], r = div[i], epsi = turn[i], y = yj)) +
                            tj[[i]]*div[i]-log(div[i])) - tun[i]*(nbtips[i]-(1+root)) +  # adding the first tuning and the tuning flex
                      log(prPhi(x = yj, a = a, b = b)))
              }), 0, 1, rel.tol = rel.tol))-
                tottime[i]*div[i]*2 - sum(tj[[i]]*div[i]-log(div[i])) + tun[i]*(nbtips[i]-(1+root))
            })
            return(integr_loglik)
          }

          rv = like_tuning(tun = tun, seqphy = seqphy) # seqphy change for faster inference when tuning

          # Dichotomy search

          if(tuned_dichotomy == TRUE){
            try_int = 0
            tun1 <- 10^308/(nbtips-(1+root)) # will be the -Inf bound and to not have NaN
            tun2 <- -10^308/(nbtips-(1+root)) # normally gives infinity in our function, will be the Inf bound and to not have NaN
            while (any(is.infinite(rv))&!any(is.na(rv))){
              rin  = rv == Inf
              rmin = rv == -Inf
              tun1[rmin] <- tun[rmin]
              tun2[rin] <- tun[rin]
              tun <- (tun1+tun2)/2
              if(any(round(tun1, digits = 10) == round(tun2, digits = 10))){ # already tested previously with 100 and does not seem to change much
                break
                warning("tuning parameters too close to each other, the likelihood is hard to calculate")}

              wtr = rin | rmin
              seqphy <- which(wtr==TRUE)
              rv[wtr] = like_tuning(tun, seqphy)

              try_int = try_int + 1
              if(try_int > brk) {
                break
                warning('could not calculate the likelihood with the bisection method up to ', brk,' splits \n')}
            }
          }
          else(rv)
        }

        if(any(is.na(rv))){
          print("NA detected probably because prPhi(x = yj, a = a, b = b) == 0 and the equation before equals Inf and once passed in log this equals NaN")
          rv <- -Inf
        }
        else if(is.na(sum(rv))){
          print("NA detected probably because at least one log likelihood phylo equals Inf and another one equals -Inf thus the sum is NaN")
          rv <- -Inf
        }
        resLH <- list(logLik = sum(rv), tuning = tun)
        return(resLH)
      }
    }
  }
}
sophia-lambert/UDivEvo documentation built on Sept. 27, 2022, 11:05 p.m.