In USCbiostats/aphylo: Statistical Inference and Prediction of Annotations in Phylogenetic Trees

library(aphylo)
knitr::knit_hooks$set(smallsize = function(before, options, envir) {
    if (before) {
        "\\footnotesize\n\n"
    } else {
        "\n\\normalsize\n\n"
    }
})

options(digits = 4)
knitr::opts_chunk$set(
  echo = TRUE, warning = FALSE, message = FALSE, echo=FALSE,
  out.width = ".7\\linewidth", fig.align = "center", fig.width = 7, fig.height = 5)

library(aphylo)

# Setting the seed and simulating the data
set.seed(2)
x <- raphylo(40, P = 2)

# Droping some annotations
x <- rdrop_annotations(x, .5)

The problem

plot(x, main = "")

Last year's EAC

In brief

Prof. Suchard observations on our model.
In the summary, the EAC pointed out that taxon constraints should be included.
Also, we need to develop a strategy to raise awareness about our work: algorithms and software.
Finally, both Project 2 and Core B would benefit from reaching out with other experts on data and research groups working on nearby research areas.

Notation {.t}

\begincols

\begincol{.25\linewidth} \includegraphics[height=.8\textheight]{fig/annotated-tree.pdf} \endcol

\begincol{.74\linewidth}

\begin{table}[tb] \centering \begin{tabular}{lm{.7\linewidth}} \toprule Symbol & Description \ \midrule $\phylo \equiv (\nodes, \edges)$ & Phylogenetic Tree.\ $\parent{n}$ & Parent of node $n$. \ $\offspring{n}$ & Offspring of node $n$.\ $\Ann \equiv {\ann{n}}{n\in\nodes}$ & True annotations.\ $\AnnObs \equiv {\annObs{n}}{n\in\nodes}$ & Experimental annotations.\ $\aphylo \equiv (\phylo, \Ann)$ & Annotated Phylogenetic Tree.\ $\aphyloObs \equiv (\phylo, \AnnObs)$ & Experimentally Annotated Phylogenetic Tree.\ $\aphyloObs_n$ & Induced Experimentally Annotated Sub-tree of node $n$. \ $\aphyloObs_n^c$ & Complement of $\aphyloObs_n$. \ \bottomrule \end{tabular} \caption{Mathematical Notation\label{tab:notation}} \end{table}

\endcol \endcols

Recap: Model {.t}

\begincols

\begincol{.475\linewidth}

\begin{enumerate} \item A probabilistic model of gene function evolution,

\item The probability that the root node has the function is $\pi$,

\item Conditional on its parent state, the probabilities that any given node has to either gain or lose a function are $(\gain,\loss)$,

\item \only<1>{Finally}\only<2->{\sout{Finally}}, at the leaf node, the probability that a node with no function is mislabeled as having the function is $\misszero$. Conversely, the probability that a node with a function is mislabeled as not having the function is $\missone$.

\only<2>{\item Finally, curators will report their discovery of function \emph{present}/\emph{absent} with probability $\reportzero/\reportone$.} \end{enumerate}

\endcol

\begincol{.475\linewidth}

\begin{table}[tb] \centering \begin{tabular}{lm{.7\linewidth}} \toprule Parameter & Probability \ \midrule $\pi$ & The root node has the function \ $\gain$ & Gaining a function \ $\loss$ & losing a function \ $\misszero$ & Mislabeling a 0 \ $\missone$ & Mislabeling a 1 \ \only<2>{$\reportzero$} & \only<2>{Propensity to report a 0} \ \only<2>{$\reportone$} & \only<2>{Propensity to report a 1} \ \bottomrule \end{tabular} \caption{Model parameters\label{tab:parameters}} \end{table}

\endcol

\endcols

Changes from last year

From the formal (statistical) stand point

Prediction function: Right mathematical definition of the model prediction.
New set of parameters: Propensity to report a finding.
Flexible model specification: Definition of the likelihood function for different sets of parameters

By-products generated during the implementation

The sluRm R package: A light-weight interface to Slurm.
Improvements on the amcmc R package, notably: automatic stop.

Recap: The aphylo R package

Features:

Provides a representation of annotated partially ordered trees. \pause
Integrates the ape package (most used Phylogenetics R package with ~25K downloads/month) \pause
Implements the log-likelihood calculation of our model (with C++ under-the-hood).\pause

Some new features

Model specification via formula.
Added the $\eta$ parameter.
Two implementations of the prediction function (using a post-order algorithm as suggested by Prof. Suchard), and a brute force method... we use this for unit tests.
(in the amcmc R package) Convergence monitoring and automatic stop of the MCMC algorithm.

Nice visualizations

plot_logLik(x ~ mu + psi + Pi)

ans <- aphylo_mcmc(x ~ mu + psi + Pi, priors = bprior())
plot(prediction_score(ans), main = "", which.fun=1L)

Flexible model specification

Automatic specification of the likelihood function, e.g.

\small

x ~ mu baseline model
x ~ mu + psi + Pi model including mislabeling and root node probabilities
x ~ mu + Pi same as before, but excluding mislabeling
x ~ mu + psi(1) + Pi mislabeling of 1 is fixed
x ~ mu + psi(0, 1) + Pi mislabeling of 0s and 1s is fixed

\normalsize

Flexible model specification

\footnotesize

ans

\normalsize

Simulation study

Using the entire Panther data set (~13,000 families), we applied our model's data generating process to annotate trees.\pause

Four different scenarios:\pause

Gold standard: Estimation of the model on fully annotated trees\pause
Missing data: Estimation of the model with missing annotations [from 10% to 90% missigness]\pause
Propensity to report (a): Same data as scenario 2, but we drop more observations with probabilities $\reportzero, \reportone$. Estimation does not include $\eta$.\pause
Propensity to report (b): Sames as scenario 3, but we include $\eta$.

Gold standard: Bias (small trees) {.t}

\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/01-gold-standard-bias_plots_tree-size=small.pdf} \end{figure}

Gold standard: Bias (mid-small trees) {.t}

\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/01-gold-standard-bias_plots_tree-size=mid-small.pdf} \end{figure}

Gold standard: Bias (mid-large trees) {.t}

\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/01-gold-standard-bias_plots_tree-size=mid-large.pdf} \end{figure}

Gold standard: Bias (large trees) {.t}

\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/01-gold-standard-bias_plots_tree-size=large.pdf} \end{figure}

Gold standard: Prediction {.t}

\begin{figure} \includegraphics[height=.8\textheight]{fig/01-gold-standard-auc.pdf} \end{figure}

Gold standard: Convergence {.t}

\begin{figure} \includegraphics[height=.8\textheight]{fig/01-gold-standard-gelman.pdf} \end{figure}

Missing data: Bias (small trees) {.t}

\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/02-missing-bias_plots_tree-size=small.pdf} \end{figure}

Missing data: Bias (mid-small trees) {.t}

\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/02-missing-bias_plots_tree-size=mid-small.pdf} \end{figure}

Missing data: Bias (mid-large trees) {.t}

\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/02-missing-bias_plots_tree-size=mid-large.pdf} \end{figure}

Missing data: Bias (large trees) {.t}

\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/02-missing-bias_plots_tree-size=large.pdf} \end{figure}

Missing data: Prediction {.t}

\begin{figure} \includegraphics[height=.8\textheight]{fig/02-missing-auc.pdf} \end{figure}

Missing data: Convergence {.t}

\begin{figure} \includegraphics[height=.8\textheight]{fig/02-missing-gelman.pdf} \end{figure}

Does $\eta$ improves the model? Prediction

\begincols

\begincol{.49\linewidth}

\begin{figure}\centering \includegraphics[width=.65\textheight]{fig/03-pub-bias-auc.pdf} \caption{Misspecified model (does not include $\eta$)} \end{figure}

\endcol

\begincol{.49\linewidth}

\begin{figure}\centering \includegraphics[width=.65\textheight]{fig/04-full-model-auc.pdf} \caption{Correct specification (includes $\eta$)} \end{figure}

\endcol

\endcols

Status of the paper {.t}

\centering \includegraphics[width=1\linewidth]{pages.pdf}

Concluding remarks

A parsimonious model of gene functions: easy to apply on a large scale (we already ran some simulations using all 13,000 trees from PantherDB... and it took us less than 1 ~~week~~ hour with ~~10~~ 240 processors ~~only~~).\pause

Already implemented, we are currently in the stage of ~~writing the paper and setting up the simulation study~~ finishing and submitting the paper.\pause
For the next steps, we are evaluating whether to include or how to include:\pause
- Type of node: speciation, duplication, horizontal transfer.
- Branch lengths
- Correlation structure between functions
- ~~Using Taxon Constraints to improve predictions~~
- Hierarchical model: Use fully annotated trees by curators as prior information.
We are still unsure about how to proceed with the software: R journal? Journal of Open Source Software? Journal of Statistical Software? Bioinformatics? etc.

\begin{center} \huge \color{USCCardinal}{\textbf{Thank you!}} \end{center}

\maketitle

USCbiostats/aphylo documentation built on Oct. 28, 2023, 7:22 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

USCbiostats/aphylo
Statistical Inference and Prediction of Annotations in Phylogenetic Trees

In USCbiostats/aphylo: Statistical Inference and Prediction of Annotations in Phylogenetic Trees

The problem

Last year's EAC

Notation {.t}

Recap: Model {.t}

Changes from last year

Recap: The aphylo R package

Nice visualizations

Flexible model specification

Flexible model specification

Simulation study

Gold standard: Bias (small trees) {.t}

Gold standard: Bias (mid-small trees) {.t}

Gold standard: Bias (mid-large trees) {.t}

Gold standard: Bias (large trees) {.t}

Gold standard: Prediction {.t}

Gold standard: Convergence {.t}

Missing data: Bias (small trees) {.t}

Missing data: Bias (mid-small trees) {.t}

Missing data: Bias (mid-large trees) {.t}

Missing data: Bias (large trees) {.t}

Missing data: Prediction {.t}

Missing data: Convergence {.t}

Does $\eta$ improves the model? Prediction

Status of the paper {.t}

Concluding remarks

R Package Documentation

Browse R Packages

We want your feedback!

USCbiostats/aphylo Statistical Inference and Prediction of Annotations in Phylogenetic Trees

In USCbiostats/aphylo: Statistical Inference and Prediction of Annotations in Phylogenetic Trees

The problem

Last year's EAC

Notation {.t}

Recap: Model {.t}

Changes from last year

Recap: The aphylo R package

Nice visualizations

Flexible model specification

Flexible model specification

Simulation study

Gold standard: Bias (small trees) {.t}

Gold standard: Bias (mid-small trees) {.t}

Gold standard: Bias (mid-large trees) {.t}

Gold standard: Bias (large trees) {.t}

Gold standard: Prediction {.t}

Gold standard: Convergence {.t}

Missing data: Bias (small trees) {.t}

Missing data: Bias (mid-small trees) {.t}

Missing data: Bias (mid-large trees) {.t}

Missing data: Bias (large trees) {.t}

Missing data: Prediction {.t}

Missing data: Convergence {.t}

Does $\eta$ improves the model? Prediction

Status of the paper {.t}

Concluding remarks

R Package Documentation

Browse R Packages

We want your feedback!

USCbiostats/aphylo
Statistical Inference and Prediction of Annotations in Phylogenetic Trees