In USCbiostats/phylogenetic: Statistical Inference and Prediction of Annotations in Phylogenetic Trees

knitr::knit_hooks$set(smallsize = function(before, options, envir) {
    if (before) {
        "\\footnotesize\n\n"
    } else {
        "\n\\normalsize\n\n"
    }
})

options(digits = 4)

knitr::opts_chunk$set(echo = FALSE, smallsize=TRUE,
                      out.width = ".4\\linewidth",
                      fig.width = 7, fig.height = 5, fig.align = 'center'
                      )

library(aphylo)

Agenda

\tableofcontents{}

On Genes and Trees

Agenda

\tableofcontents[currentsection]

Overview

A GO annotation is an association between a gene and a GO (Gene Ontology) term describing its function, e.g: A gene can be annotated with the GO term GO:0016049, which denotes cellular growth.\pause
Functional knowledge (e.g. Gene Ontology (GO) terms annotations) for human genes is very incomplete.\pause
Increase in association detection power using prior biological knowledge depends strongly on annotation completeness.\pause
Phylogenetic inference of annotations (i.e. using evolutionary trees) allows vast experimental knowledge in model systems (e.g. mouse, fruit fly, yeast) to augment human gene annotations.

Overview (cont.)

Manual curation of GO terms is good but infeasible, e.g.: \pause 1 to 2 people annotating ~4,000 families took roughly 4 years (6 years of FTE).\pause
Today, we present a model that uses both: \pause
- Existing gene functional annotations, and
- Phylogenetic trees
to infer annotations on un-annotated genes in a probabilistic way (so it is not a 0/1 prediction). \pause
This predicted functional information will serve as prior covariates in Projects 1 and 3.

Model

Agenda {.t}

\tableofcontents[currentsection]

Some definitions {.t label=definitions}

\begincols

\begincol{.38\linewidth}

\includetikz{simple_tree_names.tex}{.5}

\endcol

\begincol{.58\linewidth}

\footnotesize

Where

$$ \annObs_{lp} = \left{ \begin{array}{ll} 1 & \mbox{if the function }\mbox{ is believed to be present}\ 0 & \mbox{if the function }\mbox{ is believed to be absent}\ 9 & \mbox{if we don't have information for this node } \end{array}\right. $$

\normalsize

\hyperlink{formaldef}{\beamerbutton{more details}}

\endcol

\endcols

A probabilistic model of function propagation

For any given node, we can write down the probability of observing a \emph{functional state} as a function of some model parameters and its offspring. \pause
This version of our model has five parameters (probabilities): \pause

a. Root node had a function: $\pi$, b. Gain of function: $\mu_0$, c. Loss of function: $\mu_1$. d. Misclassification of: - A missing function as present, $\psi_0$, and - A present function as missing, $\psi_1$ \pause

All five parameters are assumed to be equal across functions, this is, $\pi, \mu_0, \mu_1, \psi_0$, and $\psi_1$ are assumed to be independent of the functions that are analyzed.\pause
In this presentation, we will focus on the case that we are dealing with a single function.

Peeling algorithm

Agenda {.t}

\tableofcontents[currentsection]s

Peeling (pruning) phylogenies (Felsenstein, 1973, 1981) {.t label=peelingalgorithm}

Given an experimentally annotated phylogenetic tree, the likelihood computation on a single function is as follows. \pause

\def\probmat{\mbox{P}{}}

Create an matrix $\probmat$ of size $|N|\times 2$, \pause

For node $n \in {\mbox{peeling sequence}}$ (from leafs to root) do: \pause

a. For $\ann_n\in {0,1}$ do:

Set \color{teal} $\probmat_{n, \ann_n} = \left\{\begin{array}{ll}
\Prcond{\Ann_n = \ann_n}{\AnnObs_n = \AnnObs_n} & \mbox{If }$n$\mbox{ is a leaf} \\
\Likecond{\Ann_n = \ann_n}{\aphyloObs_n} & \mbox{otherwise}
\end{array} 
\right.$ \color{black}

b. Next $n$ \pause

At this point the matrix $\probmat$ should be completely filled, we can compute

$$ \likelihood{\psi, \mu, \pi}{\aphyloObs} = \pi\Likecond{\Ann_0=1}{\aphyloObs_0} + (1 - \pi)\Likecond{\Ann_0=0}{\aphyloObs_0} $$

\pause

Let's see an example! \hyperlink{leafnodesprob}{\beamerbutton{more details}}

Peeling algorithm {.t}

\begincols

\begincol{.28\textwidth}

\includetikz{simple_tree.tex}{.5}

\endcol

\begincol{.68\textwidth}

Let's calculate the likelihood of observing this tree with the following parameters:

psi0 <- .05
psi1 <- .01
mu0  <- .04
mu1  <- .01
Pi   <- .05

$$ \begin{aligned} \mbox{Mislabeling a 0} &: \psi_0 &= r psi0 \ \mbox{Mislabeling a 1} &:\psi_1 &= r psi1 \ \mbox{Functional gain} &:\mu_0 &= r mu0 \ \mbox{Functional loss} &:\mu_1 &= r mu1 \ \mbox{Root node has the function} &: \pi &= r Pi \ \end{aligned} $$

\endcol

\endcols

dat <- matrix(
  as.integer(c(0,1,0,3,1,2,3,4,3,5)), ncol=2, byrow=TRUE)
dat <- aphylo:::new_po_tree(
  dat,
  labels      = as.character(0:5),
  edge.length = c(2, 5, 3, 6, 4)
  )

ann <- cbind(
  Function  = c(NA, NA, 0, NA, 1, 0)
  )
dat <- aphylo:::as_aphylo(ann, dat)

LL <- LogLike(
  dat$annotations,
  attr(dat$edges, "offspring"),
  c(psi0, psi1), c(mu0, mu1), Pi, verb_ans = TRUE)

dimnames(LL$Pr) <- list(0:5, paste("State", 0:1))

Peeling algorithm (cont. 1) {.t}

\footnotesize

$$ \psi_0 = r psi0 \qquad \psi_1 = r psi1 \qquad \mu_0 = r mu0 \qquad \mu_1 = r mu1 \qquad \pi = r Pi $$

\normalsize

\begincols

\begincol{.28\textwidth}

\mode{ \only<1>{\includetikz{simple_tree.tex}{.5}} \only<2-3>{\includetikz{simple_tree_leaf2.tex}{.5}} \only<4-5>{\includetikz{simple_tree_leaf4.tex}{.5}} \only<6-7>{\includetikz{simple_tree_leaf5.tex}{.5}} }

\mode{ \includetikz{simple_tree.tex}{.5} }

\endcol

\begincol{.68\textwidth}

# Printing the colored version of the matrix
Pr <- LL$Pr
Pr[] <- sprintf("%.4f",Pr[])
Pr[c(1,2,4),] <- ""

Pr[3,1] <- paste0("\\onslide<2->{\\color{blue}", Pr[3,1], "}")
Pr[3,2] <- paste0("\\onslide<3->{\\color{red}", Pr[3,2], "}")
Pr[5,1] <- paste0("\\onslide<4->{\\color{orange}", Pr[5,1], "}")
Pr[5,2] <- paste0("\\onslide<5->{\\color{olive}", Pr[5,2], "}")

Pr[6,1] <- paste0("\\onslide<6->{\\color{brown}", Pr[6,1], "}")
Pr[6,2] <- paste0("\\onslide<7->{\\color{gray}", Pr[6,2], "}")

print(xtable::xtable(Pr), sanitize.text.function=function(x) x, comment=FALSE)

\footnotesize

$$ \begin{aligned} \onslide<2->{\Prcond{\AnnObs_2=0}{\Ann_2=0} & = 1 - \psi_0 & = \color{blue}{r 1-psi0}} \ \onslide<3->{\Prcond{\AnnObs_2=0}{\Ann_2=1} & = \psi_1 &= \color{red}{r psi1}} \\ \onslide<4->{\Prcond{\AnnObs_4=1}{\Ann_4=0} & = \psi_0 & = \color{orange}{r psi0}} \ \onslide<5->{\Prcond{\AnnObs_4=1}{\Ann_4=1} & = 1 - \psi_1 &= \color{olive}{r 1-psi1}} \ \ \onslide<6->{\Prcond{\AnnObs_5=0}{\Ann_5=0} & = 1 - \psi_0 & = \color{brown}{r 1-psi0}} \ \onslide<7->{\Prcond{\AnnObs_5=0}{\Ann_5=1} & = \psi_1 &= \color{gray}{r psi1}} \end{aligned} $$

\normalsize

\endcol

\endcols

Peeling algorithm (cont. 2) {.t}

\footnotesize

$$ \psi_0 = r psi0 \qquad \psi_1 = r psi1 \qquad \mu_0 = r mu0 \qquad \mu_1 = r mu1 \qquad \pi = r Pi $$

\normalsize

\begincols

\begincol{.28\textwidth}

\mode{ \only<1>{\includetikz{simple_tree.tex}{.5}} \only<2-5>{\includetikz{simple_tree_node1.tex}{.5}} }

\mode{ \includetikz{simple_tree.tex}{.5} }

\endcol

\begincol{.68\textwidth}

# Printing the colored version of the matrix
Pr <- LL$Pr
Pr[] <- sprintf("%.4f",Pr[])
Pr[c(1,4),] <- ""

Pr[3,1] <- paste0("{\\color{blue}", Pr[3,1], "}")
Pr[3,2] <- paste0("{\\color{red}", Pr[3,2], "}")
Pr[2, 1] <- paste0("\\onslide<3->{\\color{violet}{", Pr[2, 1],"}}")
Pr[2, 2] <- paste0("\\onslide<5->{\\color{purple}{", Pr[2, 2],"}}")

print(xtable::xtable(Pr), sanitize.text.function=function(x) x, comment=FALSE)

\tiny

$$ \begin{aligned} \onslide<2->{\Likecond{\Ann_1=0}{\aphyloObs_1} & = {\color{blue} \Prcond{\AnnObs_2 = 0}{\Ann_2=0}}(1 - \mu_0) + {\color{red}\Prcond{\AnnObs_2 = 0}{\Ann_2=1}}\mu_0} \ \onslide<3->{& = {\color{blue} r Pr[3, 1]}\times r 1-mu0 + {\color{red} r Pr[3, 2]}\times r mu0 = \color{violet}{r LL$Pr[3, 1]* (1-mu0) + LL$Pr[3, 2]*mu0}}\\ % \onslide<4->{\Likecond{\Ann_1=1}{\aphyloObs_1} & = \Prcond{\Ann_2 = 0}{\AnnObs_2=0}\mu_1 + \Prcond{\Ann_2 = 1}{\AnnObs_2=0}(1-\mu_1)} \ \onslide<5->{& = r Pr[3, 1]\times r mu1 + r Pr[3, 2]\times r 1-mu1 = \color{purple}{r LL$Pr[3, 1]* mu1 + LL$Pr[3, 2]*(1-mu1)}} \end{aligned} $$

\normalsize

\endcol

\endcols

Peeling algorithm (cont. 3) {.t}

\footnotesize

$$ \psi_0 = r psi0 \qquad \psi_1 = r psi1 \qquad \mu_0 = r mu0 \qquad \mu_1 = r mu1 \qquad \pi = r Pi $$

\normalsize

\begincols

\begincol{.28\textwidth}

\mode{ \only<1>{\includetikz{simple_tree.tex}{.5}} \only<2-6>{\includetikz{simple_tree_node3.tex}{.5}} \only<7>{\includetikz{simple_tree_node0.tex}{.5}} \only<8->{\includetikz{simple_tree_likelihood.tex}{.5}} }

\mode{ \includetikz{simple_tree.tex}{.5} }

\endcol

\begincol{.68\textwidth}

# Printing the colored version of the matrix
Pr <- LL$Pr
Pr[] <- sprintf("%.4f",Pr[])

Pr[5,1] <- paste0("{\\color{orange}", Pr[5,1], "}")
Pr[5,2] <- paste0("{\\color{olive}", Pr[5,2], "}")

Pr[6,1] <- paste0("{\\color{brown}", Pr[6,1], "}")
Pr[6,2] <- paste0("{\\color{gray}", Pr[6,2], "}")

Pr[4, 1] <- paste0("\\onslide<5->{\\color{red}{", Pr[4, 1],"}}")
Pr[4, 2] <- paste0("\\onslide<6->{", Pr[4, 2],"}")
Pr[1, 1] <- paste0("\\onslide<7->{\\color{cyan}{", Pr[1, 1],"}}")
Pr[1, 2] <- paste0("\\onslide<7->{\\color{blue}{", Pr[1, 2],"}}")

print(xtable::xtable(Pr), sanitize.text.function=function(x) x, comment=FALSE)

\tiny

$$ \begin{aligned} \onslide<2->{\Likecond{\Ann_3 = 0}{\aphyloObs_3} & = % \prod_{m \in {4,5}} \sum_{\ann_m \in {0,1}} \Likecond{\Ann_m = \ann_m}{\aphyloObs_m} \Prcond{\Ann_{m} = \ann_{m}}{\Ann_3 = 0}} \ \onslide<3->{& = \left(% {\color{orange} r LL$Pr[5,1]} (1 - \mu_0) + {\color{olive} r LL$Pr[5,2] }\times \mu_0 \right)\times\left(% {\color{brown} r LL$Pr[6,1]} (1 - \mu_0) + {\color{gray} r LL$Pr[6,2] }\times \mu_0 \right) \} \onslide<4->{& = \left(% {\color{orange} r LL$Pr[5,1]} (1 - r mu0) + {\color{olive} r LL$Pr[5,2]}\times r mu0 \right)\times\left(% {\color{brown} r LL$Pr[6,1]} (1 - r mu0) + {\color{gray} r LL$Pr[6,2] }\times r mu0 \right) \} \onslide<5->{& = \color{red}{r (LL$Pr[5,1]*(1 - mu0) + LL$Pr[5,2]*mu0) * (LL$Pr[6,1]*(1 - mu0) + LL$Pr[6,2]*mu0)} \} \onslide<8->{ & \mbox{Finally, the likelihood of this tree is:} \ \likelihood{\psi, \mu, \Pi}{\aphyloObs} & = (1-\pi){\color{cyan}\Likecond{\Ann_0=0}{\aphyloObs_0}} + \pi{\color{blue}\Likecond{\Ann_0=1}{\aphyloObs_0}}} \ \onslide<9->{& = (1 - r Pi)\times {\color{cyan} r LL$Pr[1,1] } + r Pi\times {\color{blue} r LL$Pr[1,2] } = \mbox{\normalsize r exp(LL$ll)}} \end{aligned} $$

\normalsize

\endcol

\endcols

The `amcmc` R package

Agenda {.t}

\tableofcontents[currentsection]

Yet another MCMC package

You may be wondering why, well:

Allows running multiple chains simultaneously (parallel)
Overall faster than other Metrop MCMC algorithms (from our experience)
Planning to include other types of kernels (the Handbook of MCMC)
Implements reflective boundaries random-walk kernel

Example: MCMC {.t}

# Parameters
set.seed(1231)
n <- 1e3
pars <- c(mean = 2.6, sd = 3)

# Generating data and writing the log likelihood function
D <- rnorm(n, pars[1], pars[2])

# Loading the packages
library(amcmc)
library(coda) # coda: Output Analysis and Diagnostics for MCMC

# Defining the ll function (data was already defined)
ll <- function(x, D) {
  x <- log(dnorm(D, x[1], x[2]))
  sum(x)
}

ans <- MCMC(
  # Ll function and the starting parameters
  ll, c(mu=1, sigma=1),
  # How many steps, thinning, and burn-in
  nbatch = 1e4, thin=10, burnin = 1e3,
  # Kernel parameters
  scale = .1, ub = 10, lb = c(-10, 0),
  # How many parallel chains
  nchains = 4,
  # Further arguments passed to ll
  D=D
  )

Example: MCMC (cont. 1) {.t}

coda::gelman.plot(ans)

Example: MCMC (cont. 2) {.t}

oldpar <- par(no.readonly = TRUE)
plot(ans)
par(oldpar)

The `aphylo` R package

Agenda {.t}

\tableofcontents[currentsection]

`aphylo` in a nutshell

Provides a representation of annotated partially ordered trees. \pause
Interacts with the ape package (most used Phylogenetics R package with ~25K downloads/month) \pause
Implements the loglikelihood calculation of our model (with C++ under-the-hood).

`aphylo`: Simulating Trees {.t}

\begincols

\begincol{.48\textwidth}

set.seed(80)
tree <- sim_tree(5)
tree

\endcol

\begincol{.48\textwidth}

atree <- raphylo(
  tree = tree, P = 2,
  psi  = c(.05, .05),
  mu   = c(.2, .1),
  Pi   = .01
  )

atree

\endcol

\endcols

`aphylo`: Visualizing annotated data {.c}

\begincols

\begincol{.49\textwidth}

plot(atree)

\endcol

\begincol{.49\textwidth}

plot_LogLike(atree)

\endcol

\endcols

`aphylo`: Tree peeling {.t}

The peeling algorithm requires visiting all nodes in a tree.\pause
The fact is, we don't need to go through branches with no annotations, as these are uninformative. \pause So we can prune them, e.g.:\pause

set.seed(1)
x <- sim_tree(25)

# What am I peeling
x_pruned <- prune(x, c(3, 19))

ids        <- attr(x, "labels")
ids_pruned <- attr(x_pruned, "labels")

z   <- `attributes<-`(x, list(dim=dim(x)))
z[] <- ids[z[]+1]

z_pruned   <- `attributes<-`(x_pruned, list(dim=dim(x_pruned)))
z_pruned[] <- ids_pruned[z_pruned[]+1]

cols  <- apply(z, 1, function(w) any(w[1] == z_pruned[,1] & w[2] == z_pruned[,2]))
ltype <- c(2, 1)[cols + 1L] 
cols  <- c("tomato", "steelblue")[cols + 1L]

oldpar <- par(no.readonly=TRUE)
par(mfrow=c(1,2), cex=1, xpd=NA, mar=c(1,0,2,0))
plot(x, main="Full tree\n x", edge.color = cols, edge.width=2, edge.lty=ltype)
plot(x_pruned, main="Pruned tree\n prune(x, c(3, 19))", edge.color = "steelblue", edge.width=2)
par(oldpar)

`aphylo`: Reading PantherDB data

# Reading the data
path <- system.file("tree.tree", package="aphylo")
dat <- read_panther(path)

# The tree
dat$tree

`aphylo`: Reading PantherDB data (cont.)

# Extra annotations
head(dat$internal_nodes_annotations)

tree <- dat$tree

# Simulating a function
set.seed(123)
atree <- raphylo(
  tree= as_po_tree(tree),
  Pi = .05, mu = c(.1, .05), psi = c(.01, .02)
)

# Estimation
ans <- aphylo_mcmc(
  params  = rep(.05, 5),
  dat     = atree,
  # Passing a Beta prior
  priors  = function(p) dbeta(p, 2, 20),
  # Parameters for the MCMC
  control = list(nchain=4, nbatch=1e4, thin=20, burnin=1e3)
  )

`aphylo`: Predictions of the model {.t}

Posterior probability:

\begin{equation} \label{eq:impute2} \Prcond{\ann_{n} = 1}{\aphyloObs} = \frac{\Prcond{\aphyloObs}{\ann_{n} = 1}}{ \Prcond{\aphyloObs}{\ann_{n} = 1} + \Prcond{\aphyloObs}{\ann_{n} = 0} \frac{\left(1 - \Pr{\ann_{n} = 1}\right)}{\Pr{\ann_{n} = 1}} } \end{equation} \pause

Where

$$ \Pr{\ann_{n} = 1} = % \pi \Prcond{\ann_{n} = 1}{\ann_{0} = 1} + (1 - \pi) \Prcond{\ann_{n} = 1}{\ann_{0} = 0} $$

\pause And

$$ \left[\begin{array}{cc} \Prcond{\ann_n = 0}{\ann_0 = 0} & \Prcond{\ann_n = 1}{\ann_0 = 0} \ \Prcond{\ann_n = 0}{\ann_0 = 1} & \Prcond{\ann_n = 1}{\ann_0 = 1} \end{array} \right] \approx \left[\begin{array}{cc} 1 - \hat\mu_0 & \hat\mu_0 \ \hat\mu_1 & 1 - \hat\mu_1 \end{array} \right]^{dist_{0n}} $$

`aphylo`: Predictions of the model (cont.) {.t}

pred <- predict(ans, what="leafs")
set.seed(8);pred <- head(pred[order(runif(nrow(pred))),,drop=FALSE])
pred <- data.frame(Gene = rownames(pred), `Posterior Prob` = unname(pred), check.names = FALSE)
print(xtable::xtable(pred, caption="Predicted probabilities for a subset of leafs of a phylogenetic tree using the {\\tt predict()} function after estimating the model parameters. The function analized was simulated on a phylogenetic tree from PantherDB."), type="latex", comment = FALSE, include.rownames = FALSE)

`aphylo`: How good is our prediction {.t}

Quality of the prediction\pause

$$ \label{eq:delta1} \delta\left(\AnnObs_H, \AnnPred_H\right) = \sum_{h, u \in H}\left[(\annObs_{h} - \annPred_{h})^2(\annObs_{u} - \annPred_{u})^2\right]^{1/2}w_{hu} $$

\pause

Which, assuming $\annPred\sim \mbox{Bernoulli}(\alpha)$, has expected value

$$ \Expected{\delta} = \sum_{h, u \in H}w_{hu}\sum_{\annPred_h, \annPred_u \in {0,1}}\Pr{\annPred_h}\Pr{\annPred_u}\left[ (\annObs_{h} - \annPred_{h})^2(\annObs_{u} - \annPred_{u})^2\right]^{1/2} $$

r prediction_score(ans)

`aphylo`: How good is our prediction (cont. 1) {.t}

plot(prediction_score(ans), main="")

Preliminary Results

Agenda {.t}

\tableofcontents[currentsection]

A simulation study {.t label=sim-setup}

\framesubtitle{Setup}

Simulation study using ~13,000 families from PantherDB
Using a Beta 1/20 prior, we simulated annotations:
- Draw a set of the parameters ${\psi_0,\psi_1 ,\mu_0, \mu_1,\pi}$,
- Simulated annotations using our model's Data Generating Process,
- Randomly removed $p\in [.1, .5]$ proportion of annotations.
With that data, we did parameter estimation and computed prediction scores using
- MLE
- MCMC with the right prior (Beta 1/20), and
- MCMC with the wrong prior (Beta 1/10, twice the mean as the right prior).
Both MCMC algorithms ran for $5\times 10^5$ iterations, burn-in of $1\times 10^4$, thinning of 100, and 5 chains.

\hyperlink{sim-convergence}{\beamerbutton{more details}}

A simulation study {.t}

\framesubtitle{Bias}

\begin{figure} \centering \includegraphics[width=.5\linewidth]{bias_trees_of_size_[57,138).pdf} \end{figure}

A simulation study {.t}

\framesubtitle{Prediction scores}

\begin{figure} \centering \caption{Distribution of prediction scores. The random prediction scores were computed analytically with parameter $p=0.3$ (as resulting from the DGP).} \includegraphics[width=.45\linewidth, trim = {0 1cm 0 2cm}, clip]{mcmc_right_prior_prediction.pdf} \end{figure}

Concluding Remarks

A parsimonious model of gene functions: easy to apply on a large scale (we already ran some simulations using all 13,000 trees from PantherDB... and it took us less than 1 week with 10 processors only).\pause
Already implemented, we are currently in the stage of writing the paper and setting up the simulation study.\pause
For the next steps, we are evaluating whether to include or how to include:\pause
- Type of node: speciation, duplication, horizontal transfer.
- Branch lengths
- Correlation structure between functions
- Using Taxon Constraints to improve predictions
- Hierarchical model: Use fully annotated trees by curators as prior information.

\begin{center} \huge \color{USCCardinal}{\textbf{Thank you!}} \end{center}

\maketitle

\appendix

Formal definitions {.t label=formaldef}

\framesubtitle{\hyperlink{definitions}{\beamerbutton{go back}}}

Phylogenetic tree: In our case, we talk about \textcolor{red}{partially ordered} phylogenetic tree, in particular, $\phylo\equiv (N,E)$ is a tuple of nodes $N$, and edges

$$ E\equiv {(n, m) \in N\times N: n\mapsto m, \textcolor{red}{n < m}} $$
Offspring of $n$: $O(n)\equiv{m\in N: (n, m) \in E, n\in N}$
Parent node of $m$: $r(m) \equiv{n \in N: (n, m) \in E, m\in N}$
Leaf nodes: $\Leaf(\phylo)\equiv {m \in N: O(m)={\emptyset}}$
Annotations: Given $P$ functions, $\Ann \equiv {\ann_n \in {0,1}^P: n\in \Leaf(\phylo)}$
Annotated Phylogenetic Tree $\aphylo \equiv(\phylo, \Ann)$
Observed Annotated Annotations $\AnnObs = {\annObs_l}_{l\in \Leaf(\phylo)}$,
Experimentally Annotated Phylogenetic Tree $\aphyloObs\equiv(\phylo, \AnnObs)$

Leaf node probabilities {.t label=leafnodesprob}

\framesubtitle{\hyperlink{peelingalgorithm}{\beamerbutton{go back}}}

The probability of the leaf nodes having annotations $\ann_l$ conditional on the observed annotation is

\begin{equation} \label{eq:leaf1} \Prcond{\AnnObs_{l} = \annObs_{l}}{\Ann_{l} = \ann_{l}} = \left{ \begin{array}{ll} \psi &\mbox{if }\annObs_{l} \neq \ann_{l} \ 1 - \psi & \mbox{otherwise} \end{array} \right. \end{equation}

Where $\psi$ can be either $\psi_0$ (mislabelling a zero), or $\psi_1$ (mislabelling a one).

Internal node probabilities {.t label=internalnodeprob}

\framesubtitle{\hyperlink{peelingalgorithm}{\beamerbutton{go back}}}

In the case of the internal nodes, the probability of a given state is defined in terms of the gain/loss probabilities

$$ \Prcond{\Ann_{n} = \ann_{l}}{\Ann_{r(n)} = \ann_{r(n)}} = \left{ \begin{array}{ll} \mu & \mbox{if }\ann_{n} \neq \ann_{r(n)} \ 1 - \mu & \mbox{otherwise} \end{array} \right. $$

Where $\mu$ can be either $\mu_0$ (gain), or $\mu_1$ (loss).
Assuming independence accross offspring, we can write

\begin{multline} \label{eq:interior1} \Likecond{\Ann_n = \ann_n}{\aphyloObs_n} = % \prod_{m \in O(n)} \sum_{\ann_m \in {0,1}} \Likecond{\Ann_m = \ann_m}{\aphyloObs_m} \ \Prcond{\Ann_{m} = \ann_{m}}{\Ann_{n} = \ann_{n}} \end{multline}

Notice that if $m$ is a leaf node, then $\Likecond{\Ann_m = \ann_m}{\aphyloObs_m} = \Prcond{\Ann_m = \ann_m}{\AnnObs_m = \annObs_m}$.

Likelihood of the tree {.t label=likelihood}

\framesubtitle{\hyperlink{peelingalgorithm}{\beamerbutton{go back}}}

Once the computation reaches the root node, $n=0$, equations \eqref{eq:leaf1} and \eqref{eq:interior1}:

Allow us writing the likelihood of the entire tree

\begin{equation} \label{eq:l} \likelihood{\psi, \mu, \pi}{\aphyloObs} = \pi\Likecond{\Ann_0=1}{\aphyloObs_0} + (1 - \pi)\Likecond{\Ann_0=0}{\aphyloObs_0} \end{equation}

A simulation study {.t label=sim-convergence}

\framesubtitle{Convergence \hyperlink{sim-setup}{\beamerbutton{go back}}}

\begin{figure} \centering \includegraphics[width=.4\linewidth, trim={0 1.5cm 0 2cm},clip]{gelmans_right_prior.pdf} \caption{Gelman diagnostic for convergence. The closer to 1, the better the convergence. Rule of thumb: A chain has a reasonable convergence if it has a Potential Scale Reduction Factor (PSRF) below 1.10.} \end{figure}

USCbiostats/phylogenetic documentation built on Oct. 28, 2023, 7:23 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

USCbiostats/phylogenetic Statistical Inference and Prediction of Annotations in Phylogenetic Trees

In USCbiostats/phylogenetic: Statistical Inference and Prediction of Annotations in Phylogenetic Trees

Agenda

On Genes and Trees

Agenda

Overview

Overview (cont.)

Model

Agenda {.t}

Some definitions {.t label=definitions}

A probabilistic model of function propagation

Peeling algorithm

Agenda {.t}

Peeling (pruning) phylogenies (Felsenstein, 1973, 1981) {.t label=peelingalgorithm}

Peeling algorithm {.t}

Peeling algorithm (cont. 1) {.t}

Peeling algorithm (cont. 2) {.t}

Peeling algorithm (cont. 3) {.t}

The amcmc R package

Agenda {.t}

Yet another MCMC package

Example: MCMC {.t}

Example: MCMC (cont. 1) {.t}

Example: MCMC (cont. 2) {.t}

The aphylo R package

Agenda {.t}

aphylo in a nutshell

aphylo: Simulating Trees {.t}

aphylo: Visualizing annotated data {.c}

aphylo: Tree peeling {.t}

aphylo: Reading PantherDB data

aphylo: Reading PantherDB data (cont.)

aphylo: Predictions of the model {.t}

aphylo: Predictions of the model (cont.) {.t}

aphylo: How good is our prediction {.t}

aphylo: How good is our prediction (cont. 1) {.t}

Preliminary Results

Agenda {.t}

A simulation study {.t label=sim-setup}

A simulation study {.t}

A simulation study {.t}

Concluding Remarks

Concluding Remarks

Formal definitions {.t label=formaldef}

Leaf node probabilities {.t label=leafnodesprob}

Internal node probabilities {.t label=internalnodeprob}

Likelihood of the tree {.t label=likelihood}

A simulation study {.t label=sim-convergence}

R Package Documentation

Browse R Packages

We want your feedback!

USCbiostats/phylogenetic
Statistical Inference and Prediction of Annotations in Phylogenetic Trees

The `amcmc` R package

The `aphylo` R package

`aphylo` in a nutshell

`aphylo`: Simulating Trees {.t}

`aphylo`: Visualizing annotated data {.c}

`aphylo`: Tree peeling {.t}

`aphylo`: Reading PantherDB data

`aphylo`: Reading PantherDB data (cont.)

`aphylo`: Predictions of the model {.t}

`aphylo`: Predictions of the model (cont.) {.t}

`aphylo`: How good is our prediction {.t}

`aphylo`: How good is our prediction (cont. 1) {.t}