Model Map in Phylogenetics
In scaleboot: Approximately Unbiased P-Values via Multiscale Bootstrap

\fontsize{6}{8} \selectfont

knitr::opts_chunk$set(fig.width=7, fig.height=7, out.width = '40%', fig.align = "center")

Visualization of Pametric Models

Load the package scaleboot.

library(scaleboot)

The methods are explained in Shimodaira and Terada (2019).

Hidetoshi Shimodaira and Yoshikazu Terada. Selective Inference for Testing Trees and Edges in Phylogenetics. 2019.

Phylogenetic Analysis of 15 trees of 6 taxa

As a working example, we estimate the phylogenetic tree from the same dataset previously analyzed in Shimodaira and Hasegawa (1999), Shimodaira (2001, 2002) using the same model of evolution. The dataset consists of mitochondrial protein sequences of six mammalian species with $n=3414$ amino acids The taxa are Homo sapiens (human), Phoca vitulina (seal), Bos taurus (cow), Oryctolagus cuniculus (rabbit), Mus musculus (mouse), and Didelphis virginiana (opossum). The software package PAML (Yang 1997) was used to calculate the site-wise log-likelihoods for the trees. The mtREV model (Adachi and Hasgawa 1996) was used for amino acid substitutions, and the site-heterogeneity was modeled by the discrete-gamma distribution (Yang 1996).

We first perform phylogentic tree selection. Look at the other RMarkdown document phylo.Rmd for the details. Here, we consider only 15 trees of 6 taxa, by fixing the clade (seal, cow).

We first run a phylogenetic package, such as PAML, to calculate site-wise log-likelihood for trees. The tree topology file is mam15.tpl, and the site-wise log-likelhiood file is mam15.mt. The mam15.mt file is converted from mam15.lnf (output from PAML) by seqmt program in CONSEL. We also run treeass in CONSEL to get mam15.ass and mam15.log from mam15.tpl. We use CONSEL only for preparing mt and ass files. All these files are found in mam15 folder.

Instead of using the program consel in CONSEL to compute p-values, we use scaleboot here. First, read the following two files. Then run relltest (internally calling scaleboot function) to perform multiscale bootstrap resampling.

### dont run
nb.rell = 100000
nb.pvclust = 10000
library(parallel)
length(cl <- makeCluster(detectCores()))
mam15.mt <- read.mt("mam15-files/mam15.mt")
mam15.ass <- read.ass("mam15-files/mam15.ass")
sa <- 9^seq(-1,1,length=13) # specify scales for multiscale bootstrap
mam15.relltest <- relltest(mam15.mt,nb=nb.rell,sa=sa,ass=mam15.ass,cluster=cl)

We have run the above command in makedata.R preveously. To get the results, simply do below, which will also load other objects.

data(mam15) # load mam15, mam26, mam105
ls() # look at the objects

We have auxiliary information in mam15.aux. The topologies are in the order of mam15.tpl (the same order as mam15.mt). The edges are in the order of mam15.cld (extracted from mam15.log, which is the log file of treeass).

names(mam15.aux)
mam15.aux$tpl[1:3] # topologies (the first three trees, in the order of mam15.tpl file)
mam15.aux$cld[1:3] # edges  (the first three edges, in the order of  mam15.cld file)
mam15.aux$tax # taxa, the order corresponds to the positions of + and - in the clade pattern.

The output of relltest includes the results of trees and edges. We separate them, and also reorder the trees and edges in decreasing order of likelhiood values below. We can also specify the auxiliary information in sbphylo.

mam15 <- sbphylo(mam15.relltest, mam15.ass, treename=mam15.aux$tpl,edgename=mam15.aux$cld,taxaname=mam15.aux$tax)

This includes the multiscale bootstrap probability. The order can be checked as follows. T1, T2, T3, ... are sorted tree (in decreasing order of likelhiood). t1, t2, t3, ... are the original order of trees. E1, E2, E3, ... are sorted edges, and e1, e2, e3, ... are the original order of edges.

mam15$order.tree  # sorted tree to original tree
mam15$invorder.tree  # original tree to sorted tree
mam15$order.edge # sorted edge to original edge
mam15$invorder.edge # original edge to sorted edge

The $p$-values are calculated by the summary method.

mam15.pv <- summary(mam15)
mam15.pv$tree$value[1:5,] # p-values of the best 5 trees
mam15.pv$edge$value[1:5,] # p-values of the best 5 edges

We make latex tables by the following code.

table2latex <- function(x) {
  rn <- rownames(x)
  cn <- colnames(x); cl <- length(cn)
  cat("\n\\begin{tabular}{",paste(rep("c",cl+1),collapse=""),"}\n",sep="")
  cat("\\hline\n")
  cat("&",paste(cn,collapse=" & "),"\\\\\n")
  for(i in seq(along=rn)) {
    cat(rn[i],"&",paste(x[i,],collapse=" & "),"\\\\\n")
  }
  cat("\\hline\n")
  cat("\\end{tabular}\n")  
}

In the tree table below, the first two columns are computed by sbphylo: stat (log-likelihood difference), shtest (Shimodaira-Hasegawa test $p$-value). The other values are computed by the summary method: k.1 (BP, bootstrap probability), k.2 (AU, approximately unbiased $p$-value), sk.2 (SI, selective inference $p$-value), beta0 ($\beta_0$, signed distance), beta1 ($\beta_1$, mean curvature), edge (the associated edges).

table2latex(mam15.pv$tree$character) # all the 15 trees

table2latex(mam15.pv$edge$character) # all the 10 edges

Visualization of the 15 trees (simple method)

We use the following mam15.mt which is already used for computing the $p$-values for the 15 trees. It is a matrix of size $n\times m$, where $n=3414$ is sample size and $m=15$ is the number of bifurcating trees. We sort trees by the likelhiood value.

mt15 <- mam15.mt[,mam15$order.tree] # Now T1, T2, ...
colnames(mt15) <- names(mam15$order.tree)
dim(mt15)

The matrix consists of the site-wise log-likelihood of tree $i$ at site $t$ as $\xi_{ti} = \log p_i(\boldsymbol{x}t | \boldsymbol{\hat\theta}_i)$, $t=1,\ldots,n$, $i=1,\ldots, m$. The matrix is written as $(\boldsymbol{\xi}_1,\ldots,\boldsymbol{\xi}_m)$ for the vectors $\boldsymbol{\xi}_i \in \mathbb{R}^n$, $i=1,\ldots, m$. We compute the mean vector as $\boldsymbol{\bar\xi} = \sum{i=1}^m \boldsymbol{\xi}_i / m$. Then models are represented by $\boldsymbol{a}_i := \boldsymbol{\xi}_i - \boldsymbol{\bar\xi}$. We apply PCA to $\boldsymbol{A} = (\boldsymbol{a}_1, \ldots, \boldsymbol{a}_m)$ and biolot for visualization.

mt0 <- apply(mt15,1,mean) # using the mean vector
xx <- mt15 - mt0 # centering for rows
f <- prcomp(xx,center=FALSE,scale=FALSE)
f$x[,1] <- -f$x[,1];f$rotation[,1] <- -f$rotation[,1] # change the sign of PC1
biplot(f,cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5)) ) # (PC1, PC2)
biplot(f,cex=c(0.4,0.8),choices=c(3,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5)) ) # (PC3, PC2)

Reconstructiong the full model X from the submodels (using 10+1 trees)

We use mam26.mt instead of mam15.mt. First we check the association mapping mam26.ass.

names(mam26.ass) # trees and edges
attr(mam26.ass,"trees") # trees
attr(mam26.ass,"edges") # edges

t1, t2, ..., t15 are the 15 bifurcaing trees. t0 is the star topology. ta, tb, ..., tj are partially resolved trees. I want to know the correspondance between ta, tb, ..., tj and e1, e2, ..., e10.

(e <- attr(mam26.ass,"edges")) # edge names
mam26.ass[e]  # only values > 15 are for ta, ..., tj
(b <- sapply(mam26.ass[e], function(a) a[a>15])) # extract values > 15
e2t <- names(mam26.ass)[b] # the correspondance: e1, e2, ... -> ta, tb, ...
names(e2t) <- e
e2t
E2t <- e2t[mam15$order.edge ] # the correspondance: E1, E2, ... -> ta, tb, ...
names(E2t) <- names(mam15$order.edge)
E2t

Extract the site-wise log-likelihoods for trees. Here mt15 is the matrix for the 15 bifurcating trees ($\boldsymbol{\xi}1,\ldots,\boldsymbol{\xi}{15}$). In the below, we treat the clade (cow, seal) as a leaf of the tree. mt0 is the vector for the star topology ($\boldsymbol{\eta}0$). mt10 is for the matrix for 10 partially resolved trees; they correspond to the 10 internal edges ($\boldsymbol{\eta}_1,\ldots, \boldsymbol{\eta}{10}$). We need E2t for getting the correct mapping from the edges to the partially observed trees.

mt15<- mam26.mt[,mam15$order.tree]
(colnames(mt15) <- names(mam15$order.tree)) # Now T1, T2, ...
mt0 <- mam26.mt[,"t0"] # star topology
mt10<- mam26.mt[,E2t] # partialy resolved trees
(colnames(mt10) <- names(E2t)) # Now E1, E2, ...

We subtract the vector of star-topology from all the other vectors of trees. $\boldsymbol{a}i := \boldsymbol{\xi}_i - \boldsymbol{\eta}_0$, $i=1,\ldots, 15$. $\boldsymbol{b}_i := \boldsymbol{\eta}_i - \boldsymbol{\eta}_0$, $i=1,\ldots, 10$. Then the vector for the full model $X$ is obtained as $$ \boldsymbol{a}_X := \boldsymbol{B}(\boldsymbol{B}^T \boldsymbol{B})^{-1} \boldsymbol{d}, $$ where $\boldsymbol{B} = (\boldsymbol{b}_1,\ldots,\boldsymbol{b}{10})$ and $\boldsymbol{d} = ( \| \boldsymbol{b}1 \|^2 ,\ldots, \| \boldsymbol{b}{10}\|^2)^T$. This is computed by the following code.

## Shimodaira 2001 Comm Stat A
fullmodel <- function(B,dim=NULL) {
  d <- apply(B,2,function(a) sum(a*a))
  s <- svd (B)
  if(is.null(dim)) {
    dd <- diag(1/s$d)
  } else {
    dd <- diag(c(1/s$d[1:dim],rep(0,length(s$d)-dim)))
  }
  x <- s$u %*% dd %*% t(s$v) %*% d
  colnames(x) <- "X"
  x
}

Then simply run below.

ax <- fullmodel(mt10 - mt0)

Now look at model map.

xx <- cbind(ax,mt15-mt0,mt10-mt0)
f <- prcomp(xx,center=FALSE,scale=FALSE)
biplot(f,cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5)) ) # (PC1, PC2)
biplot(f,cex=c(0.4,0.8),choices=c(3,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5)) ) # (PC3, PC2)

3d plot of the model map

We need rgl packages for visualization in 3d.

library(rgl)
#knitr::knit_hooks$set(rgl = hook_webgl)
rgl::setupKnitr()

Also prepare in-house version of pca and biplot.

## singular value decomposition with dimnames
mysvd <- function(x) {
  s <- svd(x)
  napc <- paste("PC",seq(along=s$d))
  dimnames(s$u) <- list(dimnames(x)[[1]],napc)
  names(s$d) <- napc
  dimnames(s$v) <- list(dimnames(x)[[2]],napc)
  s
}

## principal component analysis
mypca <- function(dat) {
  s <- mysvd(dat)
  n <- dim(s$u)[1]
  m <- dim(s$v)[1]
  k <- length(s$d)
#  x <- s$u %*% diag(s$d)
#  y <- s$v %*% diag(s$d)
  x <- s$u * rep(s$d,rep(n,k))
  y <- s$v * rep(s$d,rep(m,k))
#  sdev <- s$d/sqrt(n)
#  loadings <- s$v 
#  scores <- x
#  shuseibun.fuka = y
#  shuseibun.tokuten = s$u
  list(x=x,y=y,d=s$d,u=s$u,v=s$v)
}

## biplot
## (example)
## p <- mypca(scale(x)) or mypca(x)
## mybiplot(p$u,p$y) # biplot(scale=1) default
## mybiplot(p$x,p$v) # biplot(scale=0)
mybiplot <- function(x,y,choices=1:2,mag=c(1,1),
                     col.arg=c(1,2),cex.arg=c(1,1),lim.mag=1,
                     xadj.arg=c(0.5,0.5),yadj.arg=c(0.5,0.5),
                     arrow.len=0.1,xnames=NULL,ynames=NULL) {
  if(length(choices) != 2) stop("choices must be length 2")
  if(length(mag) != 2) stop("mag must be length 2")
 # x <- x[,choices] %*% diag(mag)
 # y <- y[,choices] %*% diag(1/mag)
  x <- x[,choices] * rep(mag,rep(dim(x)[1],2))
  y <- y[,choices] * rep(1/mag,rep(dim(y)[1],2))
  if(is.null(xnames)) nx <- dimnames(x)[[1]]
  else nx <- as.character(xnames)
  if(is.null(ynames)) ny <- dimnames(y)[[1]]
  else ny <- as.character(ynames)
  nd <- dimnames(x)[[2]]
  rx <- range(x)
  ry <- range(y)
  oldpar <- par(pty="s")
  a <- min(rx/ry)
  yy <- y*a
  plot(x,xlim=rx*lim.mag,ylim=rx*lim.mag,type="n",xlab=nd[1],ylab=nd[2])
  ly <- pretty(rx/a)
  ly[abs(ly) < 1e-10] <- 0
  axis(3,at = ly*a ,labels = ly)
  axis(4,at = ly*a ,labels = ly)
  text(x,nx,col=col.arg[1],cex=cex.arg[1],adj=yadj.arg)
  text(yy,ny,col=col.arg[2],cex=cex.arg[2],adj=yadj.arg)
  arrows(0,0,yy[,1]*0.8,yy[,2]*0.8,col=col.arg[2],length=arrow.len)
  par(oldpar)
  invisible(list(x=x,y=y))
}
## biplot3d
## requires(plot3d)
## (example)
## p <- mypca(scale(x)) or mypca(x)
## mybiplot3d(p$u,p$y) # biplot(scale=1) default
## mybiplot3d(p$x,p$v) # biplot(scale=0)
mybiplot3d <- function(x,y,choices=1:3,col.arg=c(1,2),alpha.arg=c(1,1),cex.arg=c(1,1),lwd=1) {
  x <- x[,choices]
  y <- y[,choices]
  xn <- rownames(x)
  if(is.null(xn)) xn <- seq(length=nrow(x))

  plot3d(rbind(x,y),type="n")

  save <- material3d("color")
  material3d(col.arg[1],alpha.arg[1])
  text3d(x,texts=xn,col=col.arg[1],cex=cex.arg[1])
  material3d(col.arg[2],alpha.arg[2])
  coords <- NULL
  for (i in 1:nrow(y)) {
    coords <- rbind(coords, rbind(c(0,0,0),y[i,]))
  }
  lines3d(coords, col=col.arg[2],cex=cex.arg[2], lwd=lwd)

  text3d(1.1*y, texts=rownames(y), col=col.arg[2],cex=cex.arg[2])
  material3d(save)
}

We check my in-house version pca produces the same output as the original pca.

p <- mypca(xx) # in-house version of pca
f <- prcomp(xx,center=FALSE,scale=FALSE)  # standard pca
mybiplot(p$u,p$y, cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # same as default of biplot (scale=1)
biplot(f, cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # same as above
mybiplot(p$x,p$v, cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # same as biplot with scale=0
biplot(f, scale=0,cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # same as above

## 3d plotting
um=structure(c(0.352101027965546, 0.606027662754059, 0.713270783424377, 
0, 0.931618392467499, -0.153592959046364, -0.329387068748474, 
0, -0.0900644063949585, 0.780474007129669, -0.618666768074036, 
0, 0, 0, 0, 1), .Dim = c(4L, 4L))
mybiplot3d(p$u*p$d[1],p$y,lwd=2,cex=c(0.5,1),alpha=c(0.5,0.7))
#um = par3d("userMatrix"); dput(um)
par3d(userMatrix=um)
par3d(windowRect=c(0,45,600,600))

plot only 15 trees and full model

xx <- cbind(ax,mt15-mt0)
p <- mypca(xx) # in-house version of pca
mybiplot(p$u,p$y, cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # (PC1, PC2)
mybiplot(p$u,p$y, cex=c(0.4,0.8),choices=c(3,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # (PC3, PC2)

## 3d plotting
um=structure(c(0.228986993432045, 0.835189700126648, 0.500022709369659, 
0, 0.972387254238129, -0.172497615218163, -0.157184407114983, 
0, -0.0450262203812599, 0.522209227085114, -0.851627767086029, 
0, 0, 0, 0, 1), .Dim = c(4L, 4L))
mybiplot3d(p$u*p$d[1]*1.4,p$y,lwd=2,cex=c(0.5,1),alpha=c(0.5,0.7))
#um = par3d("userMatrix"); dput(um)
par3d(userMatrix=um)
par3d(windowRect=c(0,45,600,600))
#rgl.postscript("20190114_map3d.pdf", fmt='pdf')
#rgl.close()

top-view of the 3d plot

The top-view of the pca in 3d is computed by projecting the points $\boldsymbol{a}_i$ to the space orthogonal to $\boldsymbol{a}_X$.

## projection to the orthogonal space
topview <- function(ax,A) {
  ax <- drop(ax) # strip dim
  ux <- ax/sqrt(sum(ax^2)) # unit vector 
  pp <- ux %*% (t(ux) %*% A) # projection to ux
  A - pp
}

Now draw top-view of the previous pca3d.

yy <- topview(ax,mt15-mt0)
p <- mypca(yy) # in-house version of pca
mybiplot(p$u,p$y, cex=c(0.4,0.8),choices=c(1,2), mag=c(-1,1),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # (PC1, PC2)

Reconstructiong the full model X from the submodels (using 15+1 trees)

In the previous reconstruction, we used 10+1 trees. The MLEs for the 10 partially resolved trees are usually not computed for model comparisons. Instead, here we use the 15 bifurcating trees, for which MLEs are already computed for tree selection. But we need the MLE for the star topology, anyway.

Since we know the dimension spanned by the 10 trees is 10, so specify dim=10 below.

ax2 <- fullmodel(mt15 - mt0, dim=10)

Draw model map of the 15 trees and the full model.

xx <- cbind(ax2,mt15-mt0)
p <- mypca(xx) # in-house version of pca

pca3d

## 3d plotting
um=structure(c(0.228986993432045, 0.835189700126648, 0.500022709369659, 
0, 0.972387254238129, -0.172497615218163, -0.157184407114983, 
0, -0.0450262203812599, 0.522209227085114, -0.851627767086029, 
0, 0, 0, 0, 1), .Dim = c(4L, 4L))
mybiplot3d(p$u*p$d[1]*1.4,p$y,lwd=2,cex=c(0.5,1),alpha=c(0.5,0.7))
#um = par3d("userMatrix"); dput(um)
par3d(userMatrix=um)
par3d(windowRect=c(0,45,600,600))

Now draw top-view of the previous pca3d.

yy <- topview(ax2,mt15-mt0)
p <- mypca(yy) # in-house version of pca
mybiplot(p$u,p$y, cex=c(0.4,0.8),choices=c(1,2), mag=c(-1,1),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # (PC1, PC2)