In markrobinsonuzh/treeAGG: Tree Aggregation

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Introduction

We would like to simulate a count table, where some rows have differential abundance between different phenotypic outcomes. The count table has entities in rows and samples in columns. The easy way to do it is to randomly select some rows and increase or decrease the counts of samples with certain phenotypic outcomes. However, this might cause issues in compositional data because the proportion of a row might change even though its abundance has not changed. To avoid such issues, instead of adding counts to specific rows, we swap abundances between groups of rows. The group is based on the hierarchical structure of the entities, and groups with different sizes represent different hierarchical level.

Method

```r Different simulated signal patters. The branches that have abundance phenotypic outcome associated are colored, and the others are shrinked to save space. The change direction is shown as color: blue (increase) and orange (decrease). The size of the points represents the amount of change (e.g., fold change) "}

suppressPackageStartupMessages({ library(treeAGG) library(ggtree) library(ggplot2) library(cowplot) })

data("exTree")

leaf1 <- findOS(tree = exTree, ancestor = 52, only.leaf = TRUE) l1 <- length(leaf1) leaf2 <- findOS(tree = exTree, ancestor = 76, only.leaf = TRUE) l2 <- length(leaf2) leaf3 <- c(4, 3, 6, 14, 10, 9)

fig1 <- treePlot(tree = exTree, branch.length = "none", branch = c(52, 76), col.branch = c("blue", "orange"), col.other = "grey30", zoomNode = c(66, 69, 95, 85), zoomScale = 1/20 ) + geom_point2(aes(subset = (node %in% leaf1)), color = "blue", size = 2) + geom_point2(aes(subset = (node %in% leaf2)), color = "orange", size = 2) + coord_flip() + scale_x_reverse()

fig2 <- treePlot(tree = exTree, branch.length = "none", branch = c(52, 76), col.branch = c("blue", "orange"), col.other = "grey30", zoomNode = c(66, 69, 95, 85), zoomScale = 1/20) + geom_point2(aes(subset = (node %in% leaf1)), color = "blue", size = runif(l1)3) + geom_point2(aes(subset = (node %in% leaf2)), color = "orange", size = runif(l2)3) + coord_flip() + scale_x_reverse()

fig3 <- treePlot(tree = exTree, branch.length = "none", branch = c(leaf3, 76), col.branch = c(rep("blue", length(leaf3)), "orange"), col.other = "grey30", zoomNode = c(66, 69, 95, 85), zoomScale = 1/20) + geom_point2(aes(subset = (node %in% leaf3)), color = "blue", size = 2) + geom_point2(aes(subset = (node %in% leaf2)), color = "orange", size = 2)+ coord_flip() + scale_x_reverse()

plot_grid(fig1, fig2, fig3, labels = c("S1", "S2", "S3"), nrow = 1)

The counts of entities in a sample are assumed to follow a Dirichlet Multinomial
distribution as has been suggested in several articles
[@Zhao2015,@Tang2016a,@Tang2016b,@Wu2016]:
$$\mathbf{x}=(x_1,\dots,x_K) \sim \text{dirmult}(n, \mathbf{\alpha})$$
\begin{equation}
\Pr(\mathbf{x}\mid\boldsymbol{\alpha})=
\frac{\left(n!\right)\Gamma\left(\alpha_0\right)}
{\Gamma\left(n+\alpha_0\right)}\prod_{k=1}^K
\frac{\Gamma(x_{k}+\alpha_{k})}{\left(x_{k}!\right)\Gamma(\alpha_{k})}
(\#eq:dirmult1)
\end{equation}

$\mathbf{x}=(x_1,\dots,x_K)$ are the counts of $K$ entities, $n$ is the
library size of a sample, $\mathbf{\alpha} = (\alpha_1, \alpha_2, ...,
\alpha_K)$ are the parameters for the distribution, and $\alpha_0 = \sum
\alpha_k$.

In the `r CRANpkg("dirmult")` package, the Dirichlet Multinomial distribution
(equation \@ref(eq:dirmult1)) is reparameterized with $\mathbf{\pi} = (\pi_1,
..., \pi_K)$ and $\theta$ so that $\alpha_0 = \frac{(1-\theta)}{\theta}$ and
$\alpha_k = \pi_k \alpha_0$. Here, $\pi_k$ is the expected proportion of entity
$k$ in a sample.

The technical details in the simulation procedure are as below.

1. **Estimate the parameters for simulation from a real data.** Estimate the
proportions $\mathbf{\hat{\pi}} = (\hat{\pi}_{1}, \hat{\pi}_{2}, ...,
\hat{\pi}_{m})$ and the overdispersion $\hat{\theta}$ from the real data
\cite{Charlson2010} via maximum likelihood using package `r CRANpkg("dirmult")`.

2. **Randomly select two branches with different proportions to swap.** The
selected branches are zoomed in (see Figure \@ref(fig:scenarioDiff)). We provide
3 simulation scenarios (S1, S2, and S3), which differ in how the swap is
performed.

+ S1 (the left plot in Figure \@ref(fig:scenarioDiff)): The proportions of
entities in sample $i$ are $\mathbf{p}^{G1}_{i}$ and $\mathbf{p}^{G2}_{i}$ for
the group G1 and G2 (the phenotypic outcome), respectively. The values of
$\mathbf{p}^{G1}_{i}$ and $\mathbf{p}^{G2}_{i}$ are randomly generated from the
Dirichlet distribution as below.
$$\mathbf{p}^{G1}_{i} \sim \text{Dirichlet}(\mathbf{\hat{\pi}}, \hat{\theta}).$$
$$\mathbf{p}^{G2}_{i} \sim \text{Dirichlet}(\mathbf{\hat{\pi}'}, \hat{\theta})$$
Compared with group G1, the proportions of entities on the blue and orange
branch have been changed correspondingly by a common factor $r$ or $\frac{1}{r}$
in group G2.
\begin{equation}
\left\{\begin{matrix}
 {\hat{\pi}}'_{k} =\hat{\pi}_{k}; &  k \notin \text{colored branches}\\
 {\hat{\pi}}'_{k} = r\hat{\pi}_{k}; &  k \in \text{ Blue branch } \\
 {\hat{\pi}}'_{k} =\frac{1}{r}\hat{\pi}_{k}; &  k \in \text{ Orange branch}
 \end{matrix}\right.
 (\#eq:s1)
\end{equation}
where $r = \frac{\sum_{k \in Orange} \hat{\pi}_{k}}{\sum_{k \in Blue}
\hat{\pi}_{k}}$ is the fold change.

+ S2 (the middle plot in Figure \@ref(fig:scenarioDiff)): The proportions of
entities in sample $i$ are $\mathbf{p}^{G1}_{i}$ and
$\mathbf{p}^{G2}_{i}$ for the group G1 and G1, respectively. They are randomly
generated from the Dirichlet distribution as below.
$$\mathbf{p}^{G1}_{i} \sim \text{Dirichlet}(\bf{\hat{\pi}}, \hat{\theta}).$$
$$\mathbf{p}^{G2}_{i} \sim \text{Dirichlet}(\hat{\pi}', \hat{\theta})$$
Compared with group G1, the proportions of entities on the colored branches have
changed by a common factor $r_k$ in group G2. $r_k$ varies among entities in the
same branch, but they are either all above one or all below one.
\begin{equation}
(\#eq:s2)
\left\{\begin{matrix}
 {\hat{\pi}}'_{k} =\hat{\pi}_{k}; &  k \notin \text{colored branches}\\
  {\hat{\pi}}'_{k} = r_k \hat{\pi}_{k}, 
  \quad r_k = 1 + \frac{\sum_{t \in Orange} \hat{\pi}_{t} -
  \sum_{t \in Blue} \hat{\pi}_{t}}{\sum_{k \in Blue} \hat{\pi}_{k} u_k} u_k;
  & k \in \text{ Blue branch } \\
  {\hat{\pi}}'_{k} = r_k \hat{\pi}_{k}, \quad r_k = 1 + \frac{\sum_{t \in Blue}
  \hat{\pi}_{t} - \sum_{t \in Orange} \hat{\pi}_{t}}{\sum_{k \in Orange} 
  \hat{\pi}_{k} u_k} u_k & k \in \text{ Orange branch}
 \end{matrix}\right.
\end{equation}
where $u_k$ is randomly generated from a uniformly distributed function ($u_k
\sim \text{U}(0, 1)$). The proof that the proportions of entities on the blue
branch is swapped with that on the orange branch is in the
[Appendix](#appendix).

+ S3 (the right plot in Figure \@ref(fig:scenarioDiff)): As shown in Figure
\@ref(fig:scenarioDiff), all entities in one of the selected branches are
colored as blue in S1, but only some entities from that branch are colored as
blue in S3. This means not all entities on that branch are differentially
abundant in S3. Technically, we could not say those entities with blue color in
Figure \@ref(fig:scenarioDiff) is a branch, but for the convenience in
explanation, they are refered as the blue branch. We allow users to adjust the
proportion of leaf nodes colored as blue in that zoom branch via an argument
`pct` in the `SimData` function.
$$\mathbf{p}^{G1}_{i} \sim \text{Dirichlet}(\mathbf{\hat{\pi}}, \hat{\theta}).$$
$$\mathbf{p}^{G2}_{i} \sim \text{Dirichlet}(\mathbf{\hat{\pi}}', \hat{\theta})$$
Compared with group G1, the proportions of entities on the orange branch have
changed by a common factor $c$ in group G2. The value $c$ could be chosen from 0
to 1. The entities on the blue branch have been changed by a common factor $r$
that depends on the value $c$.

\begin{equation}
\left\{\begin{matrix}
 {\hat{\pi}}'_{k} =\hat{\pi}_{k}; &  k \notin \text{colored branches}\\
 {\hat{\pi}}'_{k} = r \hat{\pi}_{k}; &  k \in \text{ Blue branch } \\
 {\hat{\pi}}'_{k} = c \hat{\pi}_{k}; &  k \in \text{ Orange branch}
 \end{matrix}\right.
 (\#eq:s3)
\end{equation}
where $r =  \frac{(1-c)\sum_{k \in Orange} \hat{\pi}_{k} + \sum_{k \in Blue}
\hat{\pi}_{k}}{\sum_{k \in Blue} \hat{\pi}_{k}}$ and $0 < c < 1$. 


To summarize, there are 3 simulation scenarios available in the function
`simData`. The differential abundance is achieved by swapping the proportions of
two selected branches in group G2. This would make the entities that are not
selected keep their abundance level across groups (phenotypic outcomes).

```r
knitr::include_graphics("scene.png")

In Figure \@ref(fig:scene), the height of the bar represent the proportion. We see the tree structure as three parts, the two selected branches and the rest. The phenotypic outcome is represented by the group (G1 or G2). The proportions of the orange and blue branches are swapped between different groups in scenario S1 and S2. The height of the orange bar in the group G2 is equal to the blue bar in the group G1, and the height of the blue bar in group G2 is the same as that of the orange bar in group G1. In S1, the increase or decrease in the proportion is evenly distributed to the entities on the same branch by a common factor $r$ (on blue branch) or $\frac{1}{r}$ (on orange branch). However, in S2, the change in the proportion is not evenly distributed and the factor $r_k$ varies on the entities. In S3, the swap could be partial and depends on the value $c$. That's why the height of the orange bar in group G2 could be different to that of the blue bar in group G1.

The orange and blue branches are also referred as the signal branches in this vignette.

Example

r Biocpkg("treeAGG") provides a function simData to simulate different abundance patterns on the tree by swapping the relative abundance of two branches. To simulate a count table, a hierarchical structure and a real count table of entities are required. Here, we use a real throat microbiome data set that includes a count table of 856 OTUs (operational taxonomic units) from 60 samples and a phylogenetic tree with 856 leaf nodes and 855 internal nodes [@Charlson2010]. More details about the data could be seen via ?GUniFrac::throat.tree and ?GUniFrac::throat.otu.tab.

We could load the real data as below.

library(GUniFrac)

# the hierarchical struture
data("throat.tree")
# the count table
data("throat.otu.tab")

Signal branches known

If we already know for which two branches we want to swap the relative abundance, we can feed simData the real data via two arguments tree and data, and the branches via from.A and from.B as below. How to manually decide branches to swap could be found in section \@ref(selectBranch). throat.otu.tab has the samples in the rows and entities in the columns, so we need to transpose it before input to simData. To get reproducible result, the random seed number is set.

# simulate count table
set.seed(1122)
datA <- simData(tree = throat.tree,
                data = as.matrix(t(throat.otu.tab)),
                from.A = 1678,
                from.B = 1613)
datA
class(datA)

The output datA is a leafSummarizedExperiment object. More details about the leafSummarizedExperiment class could be found in the help page ?leafSummarizedExperiment or in the vignette Introduction to leafSummarizedExperiment and treeSummarizedExperiment.

The tree and count table could also be combined into a leafSummarizedExperiment object before using as the input for simData.

lse <- leafSummarizedExperiment(tree = throat.tree, 
                                assays = list(t(throat.otu.tab)))
lse
set.seed(1122)
datB <- simData(obj = lse,
                from.A = 1678,
                from.B = 1613)

# Two results are exactly the same
all.equal(datA, datB)

The results, datA and datB, are exactly the same. Each includes a simulated count table in assays and the original input tree throat.tree as tree in metadata. Additional information, such as fold changes of entities on the leaf level (FC), the information of branches that have differential abundance (branch) and the simulation scenario (scenario), is also stored in metadata.

Estimating parameters from real data takes a while. In the vignette, we will do simulation several times using the same real data. To save time, the function parEstimate is used to do the estimation once and save results in the metadata of the leafSummarizedExperiment object for later use.

names(metadata(lse))

# the parameters are saved in the metadata as assays.par
lse <- parEstimate(data = lse)
names(metadata(lse))

The function viewSim can be used to view the differential abundance pattern on the tree structure. The simulation result is assigned to the obj. The tree could be shown in different shapes via layout. If the tree is large, we could shrink branches without differential abundance via zoomScale to save space.

viewSim(obj = datA, layout = "rectangular", zoomScale = 1/100)

If more simulated tables are desired with the same simulation setting, the argument n could be used.

# simulate 2 count tables with the same setting
set.seed(1122)
datC <- simData(obj = lse,
                from.A = 1678,
                from.B = 1613, 
                n = 2)

length(assays(datC))

Signal branches unknown

If we can't decide which branches to have the differential abundance pattern or only decide one of them, we could let simData randomly select the suitable branches. For example, we can select two branches where one of them has twice the abundance of the other (ratio = 2).

# set seed to have reproducible results
# branches are not given
set.seed(1122)
dat1 <- simData(obj = lse,
                ratio = 2)

The information about the selected branches is given in the branch slot of the metadata. A and B are the labels of nodes that connect the selected branches with the other part of the tree. The value in ratio is the ratio of the abundance of branch B to that of branch A. It doesn't have to be exactly the same as the value specified in the argument ratio. When there are not two branches with exactly this ratio availabe in the tree, branches with value closest to the value are found. The number of leaves in these two branches are given in A_tips and B_tips. The relative abundance of the two branches are in A_prop and B_prop in percentage.

metadata(dat1)$branch

Use viewSim to visualize the signal pattern on the tree structure.

viewSim(obj = dat1, layout = "rectangular", zoomScale = 1/50)

If we want to select branches from those with the number of leaves in a given interval, it can be achieved by changing the values of minTip.A, maxTip.A, minTip.B and maxTip.B. Here, one branch is limited to have at least 20 leaves and at most 40 leaves, the other is to have at least 20 and at most 100.

set.seed(1122)
dat2 <- simData(obj = lse, 
                minTip.A = 20, maxTip.A = 40,
                minTip.B = 20, maxTip.B = 100,
                ratio = 2)

viewSim(obj = dat2, layout = "rectangular", zoomScale = 1/100)

Similarly, if branches with a certain abundance level are desired, it can be done via minPr.A and maxPr.A. These two arguments restrict the count proportion level of branch A. One could combine using these two arguments with ratio to restrict the branch B.

set.seed(1122)
dat3 <- simData(obj = lse,
                minTip.A = 20, maxTip.A = 40,
                minTip.B = 20, maxTip.B = 100, 
                maxPr.A = 0.01,
                ratio = 2)
rb <- rbind(metadata(dat2)$branch, metadata(dat3)$branch)
rownames(rb) <- c("dat2", "dat3")
rb

The proportion of branch A is r rb[2, "A_prop"] in dat3. This fulfills the requirement that the relative abundance level of branch A isn't above $0.01$ (maxPr.A= 0.01).

viewSim(obj = dat3, layout = "rectangular", zoomScale = 1/100)

The ratio in dat3 is r metadata(dat3)$branch$ratio, which is different from the value given to the argument ratio = 2. It's because there is not any pair of branches which has ratio exactly equal to 2. The pair having the value closest to 2 will be returned.

If we want to fix one branch, from.A can be used.

# use the branch A selected in dat2.
set.seed(1122)
brA <- metadata(dat2)$branch$A
dat4 <- simData(from.A = brA,
                obj = lse, 
                minTip.A = 20, maxTip.A = 40,
                minTip.B = 20, maxTip.B = 100,
                ratio = 4)

# show the selected branches 
rb <- rbind(metadata(dat2)$branch, metadata(dat4)$branch)
rownames(rb) <- c("dat2", "dat4")
rb

Branch A in dat2 and dat4 are the same.

Different simulation scenarios

We can simulate different differential abundance patterns by changing the scenario. The default scenario is "S1".

# use the branches from dat4
# change scenario to S2
set.seed(1122)
dat5 <- simData(from.A = metadata(dat4)$branch$A,
                from.B = metadata(dat4)$branch$B,
                obj = lse, 
                scenario = "S2")

If scenario = "S3" is used, the argument adjB could also be used. It is the $c$ value in formula \@ref(eq:s3) and could be chosen between 0 and 1. If it's set as NULL, the swap is similar to S1 in that the proportions of the selected branches are swapped. Note that the blue branch here is not the branch specified by from.A. It should be only part of from.A as colored with blue in the S3 of Figure \@ref(fig:scenarioDiff). Users could use pct equal to a value between 0 and 1 to decide the percentage of leaves in from.A should be selected as the blue branch.

# use the branches from dat4
# change scenario to S3
set.seed(1122)
dat6 <- simData(from.A = metadata(dat4)$branch$A,
                from.B = metadata(dat4)$branch$B,
                obj = lse, pct = 0.6, adjB = NULL, 
                scenario = "S3")

If users are interested, the values of $c$ and $r$ in formula \@ref(eq:s3) are stored as FC in the metadata. The names of fc are the node labels and the numeric values are the fold changes of corresponding nodes. Each leaf node has a fold change value. Those have values above one ($r$) are in the blue branch and below one in the orange branch ($c$) (see S3 in Figure \@ref(fig:scenarioDiff))

fc <- metadata(dat6)$FC

p1 <- viewSim(obj = dat4, layout = "circular", zoomScale = 1/200, 
              legend.theme = list(legend.position = c(0.6, 0.4),
                                  legend.background = element_rect(
                                fill="transparent")), 
              legend.title = "S1")
p2 <- viewSim(obj = dat5, layout = "circular", zoomScale = 1/200, 
              legend.theme = list(legend.position = c(0.6, 0.4),
                                  legend.background = element_rect(
                                fill="transparent")),
              legend.title = "S2")
p3 <- viewSim(obj = dat6, layout = "circular", zoomScale = 1/200, 
              legend.theme = list(legend.position = c(0.6, 0.4),
                                  legend.background = element_rect(
                                fill="transparent")),
              legend.title = "S3")

library(cowplot)
plot_grid(p1, p3)
plot_grid(p2)

In scenario S3, points in a colored branch have the same color but different sizes. This means the fold changes of entities in that branch are different in values but same in the direction.

Select branches {#selectBranch}

It's quite normal that a pair of branches that exactly fulfill the requirements specified can't be found, especially when a small tree is provided. To efficiently select suitable branches, we could summarize the information of branches before the simulation using selNode and manually select one or two branches.

The function selNode could be used to list all branches (all = TRUE) that have the number of leaf nodes between 10 (minTip = 10) and 15 (maxTip = 15) and the relative abundance between 0.03 (minPr = 0.03) and 0.05 (maxPr = 0.05) as below.

(sel1 <- selNode(obj = lse, minTip = 11, maxTip = 12, 
                    minPr = 0.03, maxPr = 0.05,  all = TRUE))

The result has listed all branches that meet the requirements. The branch nodes of these branches are given in the column nodeNum (the node number), nodeLab (the node label), and nodeLab_alias (the alias of the node label). Here, the column nodeLab has NA value because the tree has no label for the nodes selected. The relative abundance of these branches, which is a value between 0 and 1, is given in the column proportion. The numbers of descendant leaves on those branches are in column numTip.

We could take one of the branches, for example, the branch on the first row, as the fixed branch. As a pair of branches is required in the simulation, the other branch could be selected randomly by simData or manually as below. Let's say the other branch should have roughly twice abundance as the currently selected branch (minPr = 0.06, maxPr = 0.07). Of course, we don't want a branch that has overlap with the currently selected one. For this reason, skip is used.

(sel2 <- selNode(obj = lse, minPr = 0.06, maxPr = 0.07, 
                 skip = sel1$nodeLab_alias[1],
                 all = TRUE))

With the information above, we could take a branch that meets the requirement of the abundance level and has propor size (number of desendant leaves).

If the sibling branch of a selected branch is need, the function findSibling could be used. The node that connects the sibling branch and the other part of the tree is found and the its node number is returned. The node number is named with the node label. If the tree has no label for the output node and use.alias = FALSE is specified, NA is used as the name; otherwise, an alias of the node label is used (use.alias = TRUE). An alias of a node label is created by prefixing "Node_" to the node number for internal nodes and "Leaf_" for leaf nodes.

findSibling(tree = metadata(lse)$tree, 
            input = sel1$nodeLab_alias[3],
            use.alias = TRUE)

If users are curious about the descendant leaves of the selected branch, the function findOS could be used. To see the descendants of a specific node, we could put the node number or label in ancestor.

findOS(tree = metadata(lse)$tree, ancestor = sel2$nodeNum[6])

Appendix {#appendix}

Proof: $$\sum_{g \in Blue} \hat{\pi}'{g} = \sum{g \in Orange} \hat{\pi}{g}$$ $$\begin{aligned} \sum{g \in Blue} \hat{\pi}'{g} &= \sum{g \in Blue} r_g \hat{\pi}{g} \ &= \sum{g \in Blue} \hat{\pi}{g} r_g \ &= \sum{g \in Blue} \hat{\pi}{g}(1 + \frac{\sum{t \in Orange} \hat{\pi}{t} - \sum{t \in Blue} \hat{\pi}{t}}{\sum{t \in Blue} \hat{\pi}{t} u_t} u_g) \ &= \sum{g \in Blue} \hat{\pi}{g}(1 + \frac{\sum{t \in Orange} \hat{\pi}{t} - \sum{t \in Blue} \hat{\pi}{t}}{\sum{t \in Blue} \hat{\pi}{t} u_t} u_g) \ &= \sum{g \in Blue} \hat{\pi}{g} + \sum{g \in Blue} \hat{\pi}{g} \frac{\sum{t \in Orange} \hat{\pi}{t} - \sum{t \in Blue} \hat{\pi}{t}}{\sum{t \in Blue} \hat{\pi}{t} u_t} u_g \ &= \sum{g \in Blue} \hat{\pi}{g} + (\sum{t \in Orange} \hat{\pi}{t} - \sum{t \in Blue} \hat{\pi}{t}) \frac{\sum{g \in Blue} \hat{\pi}{g} u_g}{\sum{t \in Blue} \hat{\pi}{t} u_t} \ & = \sum{t \in Orange} \hat{\pi}{t} \ & = \sum{g \in Orange} \hat{\pi}_{g} \end{aligned}$$