library(aphylo) knitr::knit_hooks$set(smallsize = function(before, options, envir) { if (before) { "\\footnotesize\n\n" } else { "\n\\normalsize\n\n" } }) options(digits = 4) knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, echo=FALSE, out.width = ".7\\linewidth", fig.align = "center", fig.width = 7, fig.height = 5)
library(aphylo) # Setting the seed and simulating the data set.seed(2) x <- raphylo(40, P = 2) # Droping some annotations x <- rdrop_annotations(x, .5)
plot(x, main = "")
In brief
Prof. Suchard observations on our model.
In the summary, the EAC pointed out that taxon constraints should be included.
Also, we need to develop a strategy to raise awareness about our work: algorithms and software.
Finally, both Project 2 and Core B would benefit from reaching out with other experts on data and research groups working on nearby research areas.
\begincols
\begincol{.25\linewidth} \includegraphics[height=.8\textheight]{fig/annotated-tree.pdf} \endcol
\begincol{.74\linewidth}
\begin{table}[tb] \centering \begin{tabular}{lm{.7\linewidth}} \toprule Symbol & Description \ \midrule $\phylo \equiv (\nodes, \edges)$ & Phylogenetic Tree.\ $\parent{n}$ & Parent of node $n$. \ $\offspring{n}$ & Offspring of node $n$.\ $\Ann \equiv {\ann{n}}{n\in\nodes}$ & True annotations.\ $\AnnObs \equiv {\annObs{n}}{n\in\nodes}$ & Experimental annotations.\ $\aphylo \equiv (\phylo, \Ann)$ & Annotated Phylogenetic Tree.\ $\aphyloObs \equiv (\phylo, \AnnObs)$ & Experimentally Annotated Phylogenetic Tree.\ $\aphyloObs_n$ & Induced Experimentally Annotated Sub-tree of node $n$. \ $\aphyloObs_n^c$ & Complement of $\aphyloObs_n$. \ \bottomrule \end{tabular} \caption{Mathematical Notation\label{tab:notation}} \end{table}
\endcol \endcols
\begincols
\begincol{.475\linewidth}
\begin{enumerate} \item A probabilistic model of gene function evolution,
\item The probability that the root node has the function is $\pi$,
\item Conditional on its parent state, the probabilities that any given node has to either gain or lose a function are $(\gain,\loss)$,
\item \only<1>{Finally}\only<2->{\sout{Finally}}, at the leaf node, the probability that a node with no function is mislabeled as having the function is $\misszero$. Conversely, the probability that a node with a function is mislabeled as not having the function is $\missone$.
\only<2>{\item Finally, curators will report their discovery of function \emph{present}/\emph{absent} with probability $\reportzero/\reportone$.} \end{enumerate}
\endcol
\begincol{.475\linewidth}
\begin{table}[tb] \centering \begin{tabular}{lm{.7\linewidth}} \toprule Parameter & Probability \ \midrule $\pi$ & The root node has the function \ $\gain$ & Gaining a function \ $\loss$ & losing a function \ $\misszero$ & Mislabeling a 0 \ $\missone$ & Mislabeling a 1 \ \only<2>{$\reportzero$} & \only<2>{Propensity to report a 0} \ \only<2>{$\reportone$} & \only<2>{Propensity to report a 1} \ \bottomrule \end{tabular} \caption{Model parameters\label{tab:parameters}} \end{table}
\endcol
\endcols
From the formal (statistical) stand point
Prediction function: Right mathematical definition of the model prediction.
New set of parameters: Propensity to report a finding.
Flexible model specification: Definition of the likelihood function for different sets of parameters
By-products generated during the implementation
The sluRm
R package: A light-weight interface to Slurm.
Improvements on the amcmc
R package, notably: automatic stop.
Features:
Provides a representation of annotated partially ordered trees. \pause
Integrates the ape
package (most used Phylogenetics R package with ~25K downloads/month) \pause
Implements the log-likelihood calculation of our model (with C++ under-the-hood).\pause
Some new features
Model specification via formula.
Added the $\eta$ parameter.
Two implementations of the prediction function (using a post-order algorithm as suggested by Prof. Suchard), and a brute force method... we use this for unit tests.
(in the amcmc
R package) Convergence monitoring and automatic stop of the MCMC
algorithm.
plot_logLik(x ~ mu + psi + Pi)
ans <- aphylo_mcmc(x ~ mu + psi + Pi, priors = bprior()) plot(prediction_score(ans), main = "", which.fun=1L)
Automatic specification of the likelihood function, e.g.
\small
x ~ mu
baseline model
x ~ mu + psi + Pi
model including mislabeling and root node probabilities
x ~ mu + Pi
same as before, but excluding mislabeling
x ~ mu + psi(1) + Pi
mislabeling of 1 is fixed
x ~ mu + psi(0, 1) + Pi
mislabeling of 0s and 1s is fixed
\normalsize
\footnotesize
ans
\normalsize
Using the entire Panther data set (~13,000 families), we applied our model's data generating process to annotate trees.\pause
Four different scenarios:\pause
Gold standard: Estimation of the model on fully annotated trees\pause
Missing data: Estimation of the model with missing annotations [from 10% to 90% missigness]\pause
Propensity to report (a): Same data as scenario 2, but we drop more observations with probabilities $\reportzero, \reportone$. Estimation does not include $\eta$.\pause
Propensity to report (b): Sames as scenario 3, but we include $\eta$.
\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/01-gold-standard-bias_plots_tree-size=small.pdf} \end{figure}
\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/01-gold-standard-bias_plots_tree-size=mid-small.pdf} \end{figure}
\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/01-gold-standard-bias_plots_tree-size=mid-large.pdf} \end{figure}
\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/01-gold-standard-bias_plots_tree-size=large.pdf} \end{figure}
\begin{figure} \includegraphics[height=.8\textheight]{fig/01-gold-standard-auc.pdf} \end{figure}
\begin{figure} \includegraphics[height=.8\textheight]{fig/01-gold-standard-gelman.pdf} \end{figure}
\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/02-missing-bias_plots_tree-size=small.pdf} \end{figure}
\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/02-missing-bias_plots_tree-size=mid-small.pdf} \end{figure}
\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/02-missing-bias_plots_tree-size=mid-large.pdf} \end{figure}
\begin{figure}\centering \includegraphics[height=.8\textheight]{fig/02-missing-bias_plots_tree-size=large.pdf} \end{figure}
\begin{figure} \includegraphics[height=.8\textheight]{fig/02-missing-auc.pdf} \end{figure}
\begin{figure} \includegraphics[height=.8\textheight]{fig/02-missing-gelman.pdf} \end{figure}
\begincols
\begincol{.49\linewidth}
\begin{figure}\centering \includegraphics[width=.65\textheight]{fig/03-pub-bias-auc.pdf} \caption{Misspecified model (does not include $\eta$)} \end{figure}
\endcol
\begincol{.49\linewidth}
\begin{figure}\centering \includegraphics[width=.65\textheight]{fig/04-full-model-auc.pdf} \caption{Correct specification (includes $\eta$)} \end{figure}
\endcol
\endcols
\centering \includegraphics[width=1\linewidth]{pages.pdf}
A parsimonious model of gene functions: easy to apply on a large scale (we already ran some simulations using all 13,000 trees from PantherDB... and it took us less than 1 ~~week~~ hour with ~~10~~ 240 processors ~~only~~).\pause
Already implemented, we are currently in the stage of ~~writing the paper and setting up the simulation study~~ finishing and submitting the paper.\pause
For the next steps, we are evaluating whether to include or how to include:\pause
Type of node: speciation, duplication, horizontal transfer.
Branch lengths
Correlation structure between functions
~~Using Taxon Constraints to improve predictions~~
Hierarchical model: Use fully annotated trees by curators as prior information.
We are still unsure about how to proceed with the software: R journal? Journal of Open Source Software? Journal of Statistical Software? Bioinformatics? etc.
\begin{center} \huge \color{USCCardinal}{\textbf{Thank you!}} \end{center}
\maketitle
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.