Phylogenetic Tree Selection

\fontsize{6}{8} \selectfont

knitr::opts_chunk$set(fig.width=7, fig.height=7, out.width = '40%', fig.align = "center") 

Phylogenetic Analysis of Mammal Dataset

Load the package scaleboot.

library(scaleboot)

The methods are explained in Shimodaira and Terada (2019). The theory for the selective inference behind the methods is given in Terada and Shimodaira (2017).

Hidetoshi Shimodaira and Yoshikazu Terada. Selective Inference for Testing Trees and Edges in Phylogenetics. 2019.

Yoshikazu Terada and Hidetoshi Shimodaira. Selective inference for the problem of regions via multiscale bootstrap. arXiv:1711.00949, 2017.

Phylogenetic Analysis of 105 trees of 6 taxa

As a working example, we estimate the phylogenetic tree from the same dataset previously analyzed in Shimodaira and Hasegawa (1999), Shimodaira (2001, 2002) using the same model of evolution. The dataset consists of mitochondrial protein sequences of six mammalian species with $n=3414$ amino acids The taxa are Homo sapiens (human), Phoca vitulina (seal), Bos taurus (cow), Oryctolagus cuniculus (rabbit), Mus musculus (mouse), and Didelphis virginiana (opossum). The software package PAML (Yang 1997) was used to calculate the site-wise log-likelihoods for the trees. The mtREV model (Adachi and Hasgawa 1996) was used for amino acid substitutions, and the site-heterogeneity was modeled by the discrete-gamma distribution (Yang 1996).

We first run a phylogenetic package, such as PAML, to calculate site-wise log-likelihood for trees. The tree topology file is mam105.tpl, and the site-wise log-likelhiood file is mam105.mt. The mam105.mt file is converted from mam105.lnf (output from PAML) by seqmt program in CONSEL. We also run treeass in CONSEL to get mam105.ass and mam105.log from mam105.tpl. We use CONSEL only for preparing mt and ass files. All these files are found in mam15 folder.

Instead of using the program consel in CONSEL to compute p-values, we use scaleboot here. First, read the following two files. Then run relltest (internally calling scaleboot function) to perform multiscale bootstrap resampling.

### dont run
nb.rell = 100000
nb.pvclust = 10000
library(parallel)
length(cl <- makeCluster(detectCores()))
mam105.mt <- read.mt("mam15-files/mam105.mt")
mam105.ass <- read.ass("mam15-files/mam105.ass")
sa <- 9^seq(-1,1,length=13) # specify scales for multiscale bootstrap
mam105.relltest <- relltest(mam105.mt,nb=nb.rell,sa=sa,ass=mam105.ass,cluster=cl)

We have run the above command in makedata.R preveously. To get the results, simply do below, which will also load other objects.

data(mam15) # load mam15, mam26, mam105
ls() # look at the objects

The output of relltest includes the results of trees and edges. We separate them, and also reorder the trees and edges in decreasing order of likelhiood values below.

mam105 <- sbphylo(mam105.relltest, mam105.ass)

This includes the multiscale bootstrap probability. The order can be checked as follows. T1, T2, T3, ... are sorted tree (in decreasing order of likelhiood). t1, t2, t3, ... are the original order of trees. E1, E2, E3, ... are sorted edges, and e1, e2, e3, ... are the original order of edges.

mam105$order.tree  # sorted tree to original tree
mam105$invorder.tree  # original tree to sorted tree
mam105$order.edge # sorted edge to original edge
mam105$invorder.edge # original edge to sorted edge

The $p$-values are calculated by the summary method.

mam105.pv <- summary(mam105)
mam105.pv$tree$value[1:5,] # p-values of the best 5 trees
mam105.pv$edge$value[1:5,] # p-values of the best 5 edges

We also have formatted results.

mam105.pv$tree$character[1:5,] # p-values of the best 5 trees
mam105.pv$edge$character[1:5,] # p-values of the best 5 edges

The formatted table can be used for prepare latex table.

table2latex <- function(x) {
  rn <- rownames(x)
  cn <- colnames(x); cl <- length(cn)
  cat("\n\\begin{tabular}{",paste(rep("c",cl+1),collapse=""),"}\n",sep="")
  cat("\\hline\n")
  cat("&",paste(cn,collapse=" & "),"\\\\\n")
  for(i in seq(along=rn)) {
    cat(rn[i],"&",paste(x[i,],collapse=" & "),"\\\\\n")
  }
  cat("\\hline\n")
  cat("\\end{tabular}\n")  
}

In the tree table below, we omitted stat (log-likelihood difference), shtest (Shimodaira-Hasegawa test $p$-value). The other values are: k.1 (BP, bootstrap probability), k.2 (AU, approximately unbiased $p$-value), sk.2 (SI, selective inference $p$-value), beta0 ($\beta_0$, signed distance), beta1 ($\beta_1$, mean curvature), edge (the associated edges).

table2latex(mam105.pv$tree$character[1:20,-(1:2)]) # the best 20 trees

In the edge table below, we omitted tree (associated trees). The other values are: k.1 (BP, bootstrap probability), k.2 (AU, approximately unbiased $p$-value), sk.2 (SI, selective inference $p$-value), beta0 ($\beta_0$, signed distance), beta1 ($\beta_1$, mean curvature).

table2latex(mam105.pv$edge$character[,-6]) # all the 25 edges

We have auxiliary information in mam105.aux. The topologies are in the order of mam105.tpl (the same order as mam105.mt). The edges are in the order of mam105.cld (extracted from mam105.log, which is the log file of treeass).

names(mam105.aux)
mam105.aux$tpl[1:3] # topologies (the first three trees, in the order of mam105.tpl file)
mam105.aux$cld[1:3] # edges  (the first three edges, in the order of  mam105.cld file)
mam105.aux$tax # taxa, the order corresponds to the positions of + and - in the clade pattern.

We can specify these auxiliary information in sbphylo.

mam105 <- sbphylo(mam105.relltest, mam105.ass, treename=mam105.aux$tpl,edgename=mam105.aux$cld,taxaname=mam105.aux$tax)

The fomatted tables are now accampanied by tree topology and clade pattern.

mam105.pv <- summary(mam105)
mam105.pv$tree$character[1:5,] # p-values of the best 5 trees
mam105.pv$edge$character[1:5,] # p-values of the best 5 edges

Geometric Quantities

The two geometric quantities play important roles in our theory of multiscale bootstrap. They are signed distance ($\beta_0$) and mean curvature ($\beta_1$). We look at estimated values of $(\beta_0, \beta_1)$ for trees and edges.

a1 <- attr(summary(mam105$trees,k=2),"table") # extract (beta0,beta1) for trees
a2 <- attr(summary(mam105$edges,k=2),"table") # extract (beta0,beta1) for edges
beta <- rbind(a1$value,a2$value)[,c("beta0","beta1")]
sbplotbeta(beta,col=rgb(0,0,0,alpha=0.7))

Diagnostics of multiscale bootstrap

In scaleboot, $p$-values are computed by multiscale bootstrap. We compute bootstrap probabilities at several scales, and fit models of scaling-law to them. We look at the model fitting for diagnostics.

tree T1

Look at the model fitting of tree T1. Candidate models are used for fitting, and sorted by AIC values. Model parameters ($\beta_0, \beta_1, \beta_2$) are estimated by the maximum likelihood method. Models are sorted by AIC. We also plot $\psi(\sigma^2)$ function. It is defined as $$ \psi(\sigma^2) = \Phi^{-1} ( 1 - \text{BP}(\sigma^2) ),\quad \sigma^2 = \frac{n}{n'} $$ for the sample size of dataset $n$, and that of bootstrap replicates $n'$. We compute bootstrap probabilities (BP) for several $n' = n/\sigma^2$ values. Then fitting parametric models to $\psi(\sigma^2)$. The most standard model is poly.2 $$ \text{poly.2}(\sigma^2) = \beta_0 + \beta_1 \sigma^2, $$ and its generalization $$ \text{poly.}k(\sigma^2) = \sum_{i=0}^{k-1} \beta_i \sigma^{2i}, $$ for $k=1,2,3$. Also considered is the singular model $$ \text{sing.3} = \beta_0 + \frac{\beta_1 \sigma^2}{1 + \beta_2(\sigma-1)}. $$ The result is as follows. The best fitting model is poly.3.

(f <- mam105$trees$T1) # the list of fitted models (MLE and AIC)
plot(f,legend="topleft",pch=16,cex=1.5,lwd=2) # fitting curves

$p$-values are computed using the fitted models. We extrapolate $\psi(\sigma^2)$ to $\sigma^2=0$ and $\sigma^2=-1$, and these values are used in AU and SI. On the other hand BP is computed as $1-\Phi(\psi(1))$; this improves the raw value of $\text{BP}(1)$ in terms of standard error. For each model, we extrapoloate $\psi(\sigma^2)$. We consider the Taylor expansion of $\psi(\sigma^2)$ at $\sigma^2=1$, and extrapolate $\psi(\sigma^2)$ by polynomial of degree $k-1$. In the below, we use $k=1,2,3$ for the Taylor expansion. $k=1$ is used for BP. $k=2,3$ can be used for AU and SI. The default value of $k$ in sbphylo is $k=2$.

(g <- summary(mam105$trees$T1,k=1:3))
plot(g,legend="topleft",pch=16,cex=1.5,lwd=2)

In the table, k.1 is BP. k.2 or k.3 is used for AU. sk.2 or sk.3 is used for SI. beta0 and beta1 are estimated values of $\beta_0$ and $\beta_1$, obtained as the tangent line at $\sigma^2=1$. Thus these $\beta_0$ and $\beta_1$ correspond to the Talor expansion with $k=2$.

In sbphylo, you can replace $k=2$ by $k=3$ (or you could specify $k=4$) as follows. This may improve the accuracy of AU and SI when $\psi(\sigma^2)$ deviates from the linear model poly.2. There is a trade-off between the accuracy and stability, so $k=2$ or $k=3$ would be a good choice, instead of using larger values such as $k=4$.

mam105.pv3 <- summary(mam105,k=2:3) # simply specify k=3 is also fine
mam105.pv3$tree$value[1:5,] # p-values of the best 5 trees
mam105.pv3$edge$value[1:5,] # p-values of the best 5 edges

trees T1, T2, T3, T4

The fitting and $p$-values can be seen for several trees at the same time. Look at the results for the best 4 trees.

(f <- mam105$trees[1:4]) 
plot(f,legend="topleft",pch=16,cex=1,lwd=1,cex.legend=0.5) # fitting curves
(g <- summary(mam105$trees[1:4],k=1:3))
plot(g,legend="topleft",pch=16,cex=1,lwd=1,cex.legend=0.5) # extrapolation

edges E1, E2, E3, E4

Look at the results for the best 4 edges.

(f <- mam105$edges[1:4]) 
plot(f,legend="topleft",pch=16,cex=1,lwd=1,cex.legend=0.5) # fitting curves
(g <- summary(mam105$edges[1:4],k=1:3))
plot(g,legend="topleft",pch=16,cex=1,lwd=1,cex.legend=0.5) # extrapolation


Try the scaleboot package in your browser

Any scripts or data that you put into this service are public.

scaleboot documentation built on Dec. 4, 2019, 5:07 p.m.