\fontsize{6}{8} \selectfont
knitr::opts_chunk$set(fig.width=7, fig.height=7, out.width = '40%', fig.align = "center")
Load the package scaleboot.
library(scaleboot)
The methods are explained in Shimodaira and Terada (2019).
Hidetoshi Shimodaira and Yoshikazu Terada. Selective Inference for Testing Trees and Edges in Phylogenetics. 2019.
As a working example, we estimate the phylogenetic tree from the same dataset previously analyzed in Shimodaira and Hasegawa (1999), Shimodaira (2001, 2002) using the same model of evolution. The dataset consists of mitochondrial protein sequences of six mammalian species with $n=3414$ amino acids The taxa are Homo sapiens (human), Phoca vitulina (seal), Bos taurus (cow), Oryctolagus cuniculus (rabbit), Mus musculus (mouse), and Didelphis virginiana (opossum). The software package PAML (Yang 1997) was used to calculate the site-wise log-likelihoods for the trees. The mtREV model (Adachi and Hasgawa 1996) was used for amino acid substitutions, and the site-heterogeneity was modeled by the discrete-gamma distribution (Yang 1996).
We first perform phylogentic tree selection. Look at the other RMarkdown document phylo.Rmd for the details. Here, we consider only 15 trees of 6 taxa, by fixing the clade (seal, cow).
We first run a phylogenetic package, such as PAML, to calculate site-wise log-likelihood for trees. The tree topology file is mam15.tpl, and the site-wise log-likelhiood file is mam15.mt. The mam15.mt file is converted from mam15.lnf (output from PAML) by seqmt program in CONSEL. We also run treeass in CONSEL to get mam15.ass and mam15.log from mam15.tpl. We use CONSEL only for preparing mt and ass files. All these files are found in mam15 folder.
Instead of using the program consel in CONSEL to compute p-values, we use scaleboot here. First, read the following two files. Then run relltest (internally calling scaleboot function) to perform multiscale bootstrap resampling.
### dont run nb.rell = 100000 nb.pvclust = 10000 library(parallel) length(cl <- makeCluster(detectCores())) mam15.mt <- read.mt("mam15-files/mam15.mt") mam15.ass <- read.ass("mam15-files/mam15.ass") sa <- 9^seq(-1,1,length=13) # specify scales for multiscale bootstrap mam15.relltest <- relltest(mam15.mt,nb=nb.rell,sa=sa,ass=mam15.ass,cluster=cl)
We have run the above command in makedata.R preveously. To get the results, simply do below, which will also load other objects.
data(mam15) # load mam15, mam26, mam105 ls() # look at the objects
We have auxiliary information in mam15.aux. The topologies are in the order of mam15.tpl (the same order as mam15.mt). The edges are in the order of mam15.cld (extracted from mam15.log, which is the log file of treeass).
names(mam15.aux) mam15.aux$tpl[1:3] # topologies (the first three trees, in the order of mam15.tpl file) mam15.aux$cld[1:3] # edges (the first three edges, in the order of mam15.cld file) mam15.aux$tax # taxa, the order corresponds to the positions of + and - in the clade pattern.
The output of relltest includes the results of trees and edges. We separate them, and also reorder the trees and edges in decreasing order of likelhiood values below. We can also specify the auxiliary information in sbphylo.
mam15 <- sbphylo(mam15.relltest, mam15.ass, treename=mam15.aux$tpl,edgename=mam15.aux$cld,taxaname=mam15.aux$tax)
This includes the multiscale bootstrap probability. The order can be checked as follows. T1, T2, T3, ... are sorted tree (in decreasing order of likelhiood). t1, t2, t3, ... are the original order of trees. E1, E2, E3, ... are sorted edges, and e1, e2, e3, ... are the original order of edges.
mam15$order.tree # sorted tree to original tree mam15$invorder.tree # original tree to sorted tree mam15$order.edge # sorted edge to original edge mam15$invorder.edge # original edge to sorted edge
The $p$-values are calculated by the summary method.
mam15.pv <- summary(mam15) mam15.pv$tree$value[1:5,] # p-values of the best 5 trees mam15.pv$edge$value[1:5,] # p-values of the best 5 edges
We make latex tables by the following code.
table2latex <- function(x) { rn <- rownames(x) cn <- colnames(x); cl <- length(cn) cat("\n\\begin{tabular}{",paste(rep("c",cl+1),collapse=""),"}\n",sep="") cat("\\hline\n") cat("&",paste(cn,collapse=" & "),"\\\\\n") for(i in seq(along=rn)) { cat(rn[i],"&",paste(x[i,],collapse=" & "),"\\\\\n") } cat("\\hline\n") cat("\\end{tabular}\n") }
In the tree table below, the first two columns are computed by sbphylo: stat (log-likelihood difference), shtest (Shimodaira-Hasegawa test $p$-value). The other values are computed by the summary method: k.1 (BP, bootstrap probability), k.2 (AU, approximately unbiased $p$-value), sk.2 (SI, selective inference $p$-value), beta0 ($\beta_0$, signed distance), beta1 ($\beta_1$, mean curvature), edge (the associated edges).
table2latex(mam15.pv$tree$character) # all the 15 trees
table2latex(mam15.pv$edge$character) # all the 10 edges
We use the following mam15.mt which is already used for computing the $p$-values for the 15 trees. It is a matrix of size $n\times m$, where $n=3414$ is sample size and $m=15$ is the number of bifurcating trees. We sort trees by the likelhiood value.
mt15 <- mam15.mt[,mam15$order.tree] # Now T1, T2, ... colnames(mt15) <- names(mam15$order.tree) dim(mt15)
The matrix consists of the site-wise log-likelihood of tree $i$ at site $t$ as $\xi_{ti} = \log p_i(\boldsymbol{x}t | \boldsymbol{\hat\theta}_i)$, $t=1,\ldots,n$, $i=1,\ldots, m$. The matrix is written as $(\boldsymbol{\xi}_1,\ldots,\boldsymbol{\xi}_m)$ for the vectors $\boldsymbol{\xi}_i \in \mathbb{R}^n$, $i=1,\ldots, m$. We compute the mean vector as $\boldsymbol{\bar\xi} = \sum{i=1}^m \boldsymbol{\xi}_i / m$. Then models are represented by $\boldsymbol{a}_i := \boldsymbol{\xi}_i - \boldsymbol{\bar\xi}$. We apply PCA to $\boldsymbol{A} = (\boldsymbol{a}_1, \ldots, \boldsymbol{a}_m)$ and biolot for visualization.
mt0 <- apply(mt15,1,mean) # using the mean vector xx <- mt15 - mt0 # centering for rows f <- prcomp(xx,center=FALSE,scale=FALSE) f$x[,1] <- -f$x[,1];f$rotation[,1] <- -f$rotation[,1] # change the sign of PC1 biplot(f,cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5)) ) # (PC1, PC2) biplot(f,cex=c(0.4,0.8),choices=c(3,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5)) ) # (PC3, PC2)
We use mam26.mt instead of mam15.mt. First we check the association mapping mam26.ass.
names(mam26.ass) # trees and edges attr(mam26.ass,"trees") # trees attr(mam26.ass,"edges") # edges
t1, t2, ..., t15 are the 15 bifurcaing trees. t0 is the star topology. ta, tb, ..., tj are partially resolved trees. I want to know the correspondance between ta, tb, ..., tj and e1, e2, ..., e10.
(e <- attr(mam26.ass,"edges")) # edge names mam26.ass[e] # only values > 15 are for ta, ..., tj (b <- sapply(mam26.ass[e], function(a) a[a>15])) # extract values > 15 e2t <- names(mam26.ass)[b] # the correspondance: e1, e2, ... -> ta, tb, ... names(e2t) <- e e2t E2t <- e2t[mam15$order.edge ] # the correspondance: E1, E2, ... -> ta, tb, ... names(E2t) <- names(mam15$order.edge) E2t
Extract the site-wise log-likelihoods for trees. Here mt15 is the matrix for the 15 bifurcating trees ($\boldsymbol{\xi}1,\ldots,\boldsymbol{\xi}{15}$). In the below, we treat the clade (cow, seal) as a leaf of the tree. mt0 is the vector for the star topology ($\boldsymbol{\eta}0$). mt10 is for the matrix for 10 partially resolved trees; they correspond to the 10 internal edges ($\boldsymbol{\eta}_1,\ldots, \boldsymbol{\eta}{10}$). We need E2t for getting the correct mapping from the edges to the partially observed trees.
mt15<- mam26.mt[,mam15$order.tree] (colnames(mt15) <- names(mam15$order.tree)) # Now T1, T2, ... mt0 <- mam26.mt[,"t0"] # star topology mt10<- mam26.mt[,E2t] # partialy resolved trees (colnames(mt10) <- names(E2t)) # Now E1, E2, ...
We subtract the vector of star-topology from all the other vectors of trees. $\boldsymbol{a}i := \boldsymbol{\xi}_i - \boldsymbol{\eta}_0$, $i=1,\ldots, 15$. $\boldsymbol{b}_i := \boldsymbol{\eta}_i - \boldsymbol{\eta}_0$, $i=1,\ldots, 10$. Then the vector for the full model $X$ is obtained as $$ \boldsymbol{a}_X := \boldsymbol{B}(\boldsymbol{B}^T \boldsymbol{B})^{-1} \boldsymbol{d}, $$ where $\boldsymbol{B} = (\boldsymbol{b}_1,\ldots,\boldsymbol{b}{10})$ and $\boldsymbol{d} = ( \| \boldsymbol{b}1 \|^2 ,\ldots, \| \boldsymbol{b}{10}\|^2)^T$. This is computed by the following code.
## Shimodaira 2001 Comm Stat A fullmodel <- function(B,dim=NULL) { d <- apply(B,2,function(a) sum(a*a)) s <- svd (B) if(is.null(dim)) { dd <- diag(1/s$d) } else { dd <- diag(c(1/s$d[1:dim],rep(0,length(s$d)-dim))) } x <- s$u %*% dd %*% t(s$v) %*% d colnames(x) <- "X" x }
Then simply run below.
ax <- fullmodel(mt10 - mt0)
Now look at model map.
xx <- cbind(ax,mt15-mt0,mt10-mt0) f <- prcomp(xx,center=FALSE,scale=FALSE) biplot(f,cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5)) ) # (PC1, PC2) biplot(f,cex=c(0.4,0.8),choices=c(3,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5)) ) # (PC3, PC2)
We need rgl packages for visualization in 3d.
library(rgl) #knitr::knit_hooks$set(rgl = hook_webgl) rgl::setupKnitr()
Also prepare in-house version of pca and biplot.
## singular value decomposition with dimnames mysvd <- function(x) { s <- svd(x) napc <- paste("PC",seq(along=s$d)) dimnames(s$u) <- list(dimnames(x)[[1]],napc) names(s$d) <- napc dimnames(s$v) <- list(dimnames(x)[[2]],napc) s } ## principal component analysis mypca <- function(dat) { s <- mysvd(dat) n <- dim(s$u)[1] m <- dim(s$v)[1] k <- length(s$d) # x <- s$u %*% diag(s$d) # y <- s$v %*% diag(s$d) x <- s$u * rep(s$d,rep(n,k)) y <- s$v * rep(s$d,rep(m,k)) # sdev <- s$d/sqrt(n) # loadings <- s$v # scores <- x # shuseibun.fuka = y # shuseibun.tokuten = s$u list(x=x,y=y,d=s$d,u=s$u,v=s$v) } ## biplot ## (example) ## p <- mypca(scale(x)) or mypca(x) ## mybiplot(p$u,p$y) # biplot(scale=1) default ## mybiplot(p$x,p$v) # biplot(scale=0) mybiplot <- function(x,y,choices=1:2,mag=c(1,1), col.arg=c(1,2),cex.arg=c(1,1),lim.mag=1, xadj.arg=c(0.5,0.5),yadj.arg=c(0.5,0.5), arrow.len=0.1,xnames=NULL,ynames=NULL) { if(length(choices) != 2) stop("choices must be length 2") if(length(mag) != 2) stop("mag must be length 2") # x <- x[,choices] %*% diag(mag) # y <- y[,choices] %*% diag(1/mag) x <- x[,choices] * rep(mag,rep(dim(x)[1],2)) y <- y[,choices] * rep(1/mag,rep(dim(y)[1],2)) if(is.null(xnames)) nx <- dimnames(x)[[1]] else nx <- as.character(xnames) if(is.null(ynames)) ny <- dimnames(y)[[1]] else ny <- as.character(ynames) nd <- dimnames(x)[[2]] rx <- range(x) ry <- range(y) oldpar <- par(pty="s") a <- min(rx/ry) yy <- y*a plot(x,xlim=rx*lim.mag,ylim=rx*lim.mag,type="n",xlab=nd[1],ylab=nd[2]) ly <- pretty(rx/a) ly[abs(ly) < 1e-10] <- 0 axis(3,at = ly*a ,labels = ly) axis(4,at = ly*a ,labels = ly) text(x,nx,col=col.arg[1],cex=cex.arg[1],adj=yadj.arg) text(yy,ny,col=col.arg[2],cex=cex.arg[2],adj=yadj.arg) arrows(0,0,yy[,1]*0.8,yy[,2]*0.8,col=col.arg[2],length=arrow.len) par(oldpar) invisible(list(x=x,y=y)) } ## biplot3d ## requires(plot3d) ## (example) ## p <- mypca(scale(x)) or mypca(x) ## mybiplot3d(p$u,p$y) # biplot(scale=1) default ## mybiplot3d(p$x,p$v) # biplot(scale=0) mybiplot3d <- function(x,y,choices=1:3,col.arg=c(1,2),alpha.arg=c(1,1),cex.arg=c(1,1),lwd=1) { x <- x[,choices] y <- y[,choices] xn <- rownames(x) if(is.null(xn)) xn <- seq(length=nrow(x)) plot3d(rbind(x,y),type="n") save <- material3d("color") material3d(col.arg[1],alpha.arg[1]) text3d(x,texts=xn,col=col.arg[1],cex=cex.arg[1]) material3d(col.arg[2],alpha.arg[2]) coords <- NULL for (i in 1:nrow(y)) { coords <- rbind(coords, rbind(c(0,0,0),y[i,])) } lines3d(coords, col=col.arg[2],cex=cex.arg[2], lwd=lwd) text3d(1.1*y, texts=rownames(y), col=col.arg[2],cex=cex.arg[2]) material3d(save) }
We check my in-house version pca produces the same output as the original pca.
p <- mypca(xx) # in-house version of pca f <- prcomp(xx,center=FALSE,scale=FALSE) # standard pca mybiplot(p$u,p$y, cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # same as default of biplot (scale=1) biplot(f, cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # same as above mybiplot(p$x,p$v, cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # same as biplot with scale=0 biplot(f, scale=0,cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # same as above
## 3d plotting um=structure(c(0.352101027965546, 0.606027662754059, 0.713270783424377, 0, 0.931618392467499, -0.153592959046364, -0.329387068748474, 0, -0.0900644063949585, 0.780474007129669, -0.618666768074036, 0, 0, 0, 0, 1), .Dim = c(4L, 4L)) mybiplot3d(p$u*p$d[1],p$y,lwd=2,cex=c(0.5,1),alpha=c(0.5,0.7)) #um = par3d("userMatrix"); dput(um) par3d(userMatrix=um) par3d(windowRect=c(0,45,600,600))
xx <- cbind(ax,mt15-mt0) p <- mypca(xx) # in-house version of pca mybiplot(p$u,p$y, cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # (PC1, PC2) mybiplot(p$u,p$y, cex=c(0.4,0.8),choices=c(3,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # (PC3, PC2)
## 3d plotting um=structure(c(0.228986993432045, 0.835189700126648, 0.500022709369659, 0, 0.972387254238129, -0.172497615218163, -0.157184407114983, 0, -0.0450262203812599, 0.522209227085114, -0.851627767086029, 0, 0, 0, 0, 1), .Dim = c(4L, 4L)) mybiplot3d(p$u*p$d[1]*1.4,p$y,lwd=2,cex=c(0.5,1),alpha=c(0.5,0.7)) #um = par3d("userMatrix"); dput(um) par3d(userMatrix=um) par3d(windowRect=c(0,45,600,600)) #rgl.postscript("20190114_map3d.pdf", fmt='pdf') #rgl.close()
The top-view of the pca in 3d is computed by projecting the points $\boldsymbol{a}_i$ to the space orthogonal to $\boldsymbol{a}_X$.
## projection to the orthogonal space topview <- function(ax,A) { ax <- drop(ax) # strip dim ux <- ax/sqrt(sum(ax^2)) # unit vector pp <- ux %*% (t(ux) %*% A) # projection to ux A - pp }
Now draw top-view of the previous pca3d.
yy <- topview(ax,mt15-mt0) p <- mypca(yy) # in-house version of pca mybiplot(p$u,p$y, cex=c(0.4,0.8),choices=c(1,2), mag=c(-1,1),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # (PC1, PC2)
In the previous reconstruction, we used 10+1 trees. The MLEs for the 10 partially resolved trees are usually not computed for model comparisons. Instead, here we use the 15 bifurcating trees, for which MLEs are already computed for tree selection. But we need the MLE for the star topology, anyway.
Since we know the dimension spanned by the 10 trees is 10, so specify dim=10 below.
ax2 <- fullmodel(mt15 - mt0, dim=10)
Draw model map of the 15 trees and the full model.
xx <- cbind(ax2,mt15-mt0) p <- mypca(xx) # in-house version of pca
pca3d
## 3d plotting um=structure(c(0.228986993432045, 0.835189700126648, 0.500022709369659, 0, 0.972387254238129, -0.172497615218163, -0.157184407114983, 0, -0.0450262203812599, 0.522209227085114, -0.851627767086029, 0, 0, 0, 0, 1), .Dim = c(4L, 4L)) mybiplot3d(p$u*p$d[1]*1.4,p$y,lwd=2,cex=c(0.5,1),alpha=c(0.5,0.7)) #um = par3d("userMatrix"); dput(um) par3d(userMatrix=um) par3d(windowRect=c(0,45,600,600))
Now draw top-view of the previous pca3d.
yy <- topview(ax2,mt15-mt0) p <- mypca(yy) # in-house version of pca mybiplot(p$u,p$y, cex=c(0.4,0.8),choices=c(1,2), mag=c(-1,1),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5))) # (PC1, PC2)
As you seen, visualization with ax2 (reconstruction by 15+1 trees) is almost identical to that of ax (reconstruction by 10+1 trees).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.