knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Introduction

The arrangement of hypotheses in a hierarchical structure appears in many research fields and often indicates different resolutions at which data can be viewed and results can be interpreted. Examples include studies of microbial abundance where the OTUs (operational taxonomic units, often assumed to represent species) can be arranged as leaves in a phylogenetic tree, and single-cell studies where the individual cells can be organized hierarchically, e.g. via clustering. In both these examples, a question arises on which resolution level the signal in the data (e.g. differential abundance of features between groups) is best interpreted, or put another way, on what resolution the main 'driver' of the signal can be found. For example, in the microbial abundance data, a whole family of species may react similarly to a particular stimulus. In such situations, it would be more informative to summarize the differential abundance analysis at this level of resolution, rather than providing a long list of individual differentially abundant OTUs. An additional reason for aggregating data to higher levels of the phylogenetic tree is that the abundances of individual species or OTUs may be low, limiting the power of the statistical test to pick up any significant differences, whereas by aggregating the signal from multiple related OTUs, the detection power can be increased. For the single-cell data example, similarly, if a treatment truly acts on a high level, affecting the abundance of, e.g., all B-cells, it is helpful if this is picked up by the analysis pipeline, rather than returning a large number of significantly differentially abundant subclusters of the main B-cell cluster. The r Biocpkg("treeAGG") package is designed to analyze data that can be organized in a hierarchical structure, and to find the most informative resolution at which to interpret the signal of interest.

Methods {#flowchart}

The main structure of the pipeline provided in r Biocpkg("treeAGG")

knitr::include_graphics("Flowchart.png")

In the r Biocpkg("treeAGG") package, we provide an easy-to-use pipeline (Figure \@ref(fig:flowchart)) to find the most informative level of the hierarchical structure for interpretation. The starting point for the analysis is a hierarchical structure (e.g., a phylogenetic tree), a data matrix with values for all leaves in the hierarchical structure, and annotation tables for rows and columns of the matrix. From here, users can follow the standard five steps outlined in Figure \@ref(fig:flowchart) to find the main 'driver' of the signal. The details of the five steps are listed as below. The example of each step is in Section \@ref(example).

To do customized analysis, the extended data matrix can be extracted after step 2, and the results of which can be integrated back into the pipeline to do step 4 and 5. Note that the table extracted has more rows than the original table because it includes data for the entities that represent the internal nodes of the tree. In each step, we have listed the available functions to use in blue texts.

We suggest users to come back to this workflow (Figure \@ref(fig:flowchart)) after they read Section \@ref(example).

More details {#details}

Consider the case that we have a table that contains measurments of entities collected from samples with different phenotypic outcomes, and a tree that represents the hierarchical structure of the entities. Entities are in the rows of the table, and samples are in the columns. Each entity could be mapped to a node of the tree.

To arrange hypotheses in the hierarchical structure, we need that each node of the tree could be mapped to a row of the table. However, in most case, only the data of entities corresponding to the leaf nodes of the tree could be observed. For the internal nodes, the data has to be generated from that of leaf nodes. Users could decide how to reasonably create the data for the internal nodes. For the abundance data, we suggest to create the value for an internal node by summing the count of its descendant leaf nodes.

With the data ready, we test at each node of the tree the null hypothesis (H0: There is no association between the differential abundance and the phenotypic outcome.) Multiple testing correction methods, such as the Benjamin-Hochberg procedure, could be applied then to decide whether the null hypothesis at a node should be rejected.

It's likely that an internal node found to be significantly associated with the phenotypic outcome is driven by some of its descendant nodes. In other words, only some of its descendant nodes are phenotypic outcome associated, and this signal is not diluted enough by other descendant nodes that are not phenotypic outcome associated. In the tree aggregation step, we try to pinpoint the nodes that drive the association.

Examples {#example}

The packages below are required.

suppressPackageStartupMessages({
  library(treeAGG)
  library(edgeR)
  library(S4Vectors)
  library(ggtree)})

The standard part of the pipeline {#pline1}

In this section, a toy data is used to show how to perform the standard five steps outlined in Figure \@ref(fig:flowchart).

Step 1: Data preparation for the leaf nodes {#step1}

At the begining, we have a tree structure (tinyTree), a table (toyData) and the annotation data for the rows and columns of the table (rowD and colD). Entities are in the rows of the table toyData and samples are in the columns. Each entity could be mapped to a leaf node of the tree. The first five samples belong to the group G1, and the other five to G2.

The table is generated in a way that only the first five entities have differential proportion among groups. The back ground of the branches that the five entities on is colored as cyan. Our goal is find the optimal level on the tree, indicated by the violet squares, to interpret this difference.

# tree
data("tinyTree")
ggtree(tinyTree) + 
    geom_text2(aes(subset = isTip, label = label), hjust = -0.2) +
    geom_hilight(node = 18, fill = "cyan", alpha = 0.1) +
    geom_hilight(node = 13, fill = "cyan", alpha = 0.1) +
    geom_point2(aes(subset = (node %in% c(13, 18))), size = 4,
                color = "darkviolet", shape = 23, stroke = 3)
# table
# p1 and p2 differ in the first five values
p1 <- rep(0.1, 10)
p2 <- c(rep(0.05, 3), rep(0.175, 2), rep(0.1, 5) )

set.seed(1)
# randomly generate the table from multinomial distribution
toyData <- cbind(rmultinom(n = 5, size = 1000, prob = p1),
                 rmultinom(n = 5, size = 1000, prob = p2))

# annotation data for rows
# The true differential information between groups is in Diff
rowD <- DataFrame(nodeLab = tinyTree$tip.label,
                  Diff = rep(c(TRUE, FALSE), each = 5))

# annotation data for columns
colD <- DataFrame(trt = sample(letters[1:3], size = 10, 
                               replace = TRUE),
                  group = rep(c("G1", "G2"), each = 5))

All data above could be stored in a leafSummarizedExperiment container using the function leafSummarizedExperiment. More details about the leafSummarizedExperiment class could be found in the help page ?leafSummarizedExperiment or in the vignette Introduction to leafSummarizedExperiment and treeSummarizedExperiment.

lse <- leafSummarizedExperiment(assays = list(toyData), 
                                rowData = rowD,
                                colData = colD,
                                tree = tinyTree)

Step 2: Data generation for the internal nodes

The data in Section \@ref(step1) is on the level of the leaf nodes. To generate data for the internal nodes, the function nodeValue can be used. Users could decide how to generate the values for internal nodes via fun. Here, we consider the value at an internal node to be the sum of the values at its descendants (fun = sum).

tse <- nodeValue(data = lse, fun = sum)

The output tse is a treeSummarizedExperiment object. More details about the treeSummarizedExperiment class could be found in the help page ?treeSummarizedExperiment or in the vignette Introduction to leafSummarizedExperiment and treeSummarizedExperiment.

Step 3: Data analysis

With the data ready, the statistical analysis could be performed using suitable softwares. Here, runEdgeR is shown as an example to do differential abundance analysis. It's a wrapper of functions from r Biocpkg("edgeR"). The detailed analysis performed is shown step by step in Section \@ref(pline2).

new_tse <- runEdgeR(obj = tse, use.assays = 1,
                 design = NULL, contrast = NULL, 
                 normalize = TRUE, method = "TMM", 
                 adjust.method = "BH")

To do analysis, we assign tse to the obj argument. If there multiple matrix-like elements in the assays of tse, we could use use.assays to indicate which elements are used for the analysis. Here, the first one is used (use.assays = 1). The design matrix and contrasts can be customized correspondingly via design and contrast, and they will be saved in the metadata of new_tse for later check. If the design matrix is not given (design = NULL), colData is used to generate a design matrix. If the contrast is not provided (contrast = NULL), the last coefficient is tested in the glmLRT step (see ?edgeR::glmLRT and Section \@ref(pline2)).

To print out the result, topNodes can be used. The output is a list. More details about its structure could be found in Section \@ref(mtab).

(tN1 <- topNodes(data = new_tse))

Step 4: Tree aggregation {#step4}

The tree aggregation can be done in one step using treeAGG.

outP <- treeAGG(data = new_tse, agg.by = "PValue",
                sigf.by = "FDR", sigf.limit = 0.05,
                message = TRUE)

Based on the analysis result (data = new_tse), we specify the name of the column in tN1 for aggregation via agg.by, and the name of the column storing the adjusted p-value via sigf.by. The threshold for the adjusted p-value can be decided via sigf.limit. To see the running process, message = TRUE is used.

Step 5: Result visualisation {#step5}

We could use topNodes again to print out the result after aggregation. If some columns in rowData and linkData are of interest, they could be printed out too via col.rowData and col.linkData, respectively.

(tN2 <- topNodes(data = outP, 
                col.rowData = "Diff",
                col.linkData = "nodeNum"))

Compared to tN1, tN2 has three more columns, aggKeep, Diff and nodeNum. The column aggKeep is created in the step of the tree aggregation, the other two are from rowData and linkData as we specified in the arguments.

The next step is to check whether the optimal level for the interpretation is found. The true differential information is stored in the column Diff of rowData.

# Extract data having nodes that are truely differentially abundant
outP_sel <- outP[rowData(outP)$Diff %in% TRUE, ]

# the true optimal level of the signal 
# use signalNode to remove the redundant nodes
truth <- signalNode(tree = treeData(outP_sel),
                    node = linkData(outP_sel)$nodeNum)

The estimated optimal level to interpret the differential abundance pattern from r Biocpkg("treeAGG") is as below.

# the estimated signal level
res <- tN2$result_assay1$contrastNULL
found <- res$nodeNum[res$aggKeep]

The node number (nodeNum) instead of the node label is extracted because internal nodes might not have labels in the tree provided.

p <- treePlot(tree = treeData(outP) ,  
              branch = truth,
              point = found,
              layout = "rectangular")
# using the function geom_text2 from the ggtree packages to add labels
p + geom_text2(aes(label = label), hjust = -0.3)

We use treePlot (?treePlot) to visualize the results. The branches with blue edges are truly differentially abundant. The results obtained from the minP aggregation algorithm implemented in r Biocpkg("treeAGG") are shown as orange points. Compared to Figure \@ref(fig:sigTree), the orange points are exactly in the same location as the violet squares. That indicates the true signal is found correctly. More details about how to use r Biocpkg("ggtree") could be seen here.

The customized part of the pipeline {#pline2}

Data analysis {#cDA}

As described in Figure \@ref(fig:flowchart), the table stored in assays could be extracted to do customized analysis. Note: To make sure we know which row corresponds to which node of the tree, we need to use use.nodeLab = TRUE.

# extract the abundance table
dat <- assays(tse, use.nodeLab = TRUE)[[1]]

Users are free to choose any suitable software to perform analysis for their data. Here, we follow the routine steps of using the r Biocpkg("edgeR") package to analyze the toy data.

# calculate total count for each sample
# The total count is the sum of cell counts of clusters on the leaf level of the
# tree.
tip_tse <- tse[linkData(tse)$isLeaf, ]
tipDat <- assays(tip_tse, use.nodeLab = TRUE)[[1]]
libSize <- apply(tipDat, 2, sum)

# create DGEList
y <- DGEList(counts = dat, lib.size = libSize,
             remove.zeros = FALSE)

# calculate normalisation factors
y <- calcNormFactors(object = y, method = "TMM")

# construct design matrix
sample_inf <- colData(tse)
design <- model.matrix(~ trt + group, data = sample_inf)

# estimate dispersion
y <- estimateDisp(y, design = design)

# fit the negative binomial GLMs
fit <- glmFit(y, design = design, prior.count = 0.125)

# run likelihood ratio tests 
# contrast is not specified here, so the last coefficient is tested.
lrt <- glmLRT(fit, contrast = NULL)

# Use Benjamin-Hochberg method to do multiple testing correction
# n is set to Inf below, because we want to have the results of all entities.
out <- topTags(lrt, n = Inf, adjust.method = "BH")$table
head(out)

The output out has five columns, one of which is the nominal p-value (PValue) and one is the adjusted p-value (FDR). These columns are required by the tree aggregation step.

If other softwares are used to do analysis, the output is expected to has at least two columns that contain the nominal and adjusted p-values, and each row representing a node of the tree. The column names are not necessary to be PValue and FDR, as they could be specified in the function treeAGG (see Section \@ref(step4)).

Back to the standard part of the pipeline

To prepare for the tree aggregation, we now put the analysis result in the rowData of tse via updateTSE.

outList <- list(assay1 = list(out))
new_tse1 <- updateTSE(result = outList, tse = tse, 
                      use.assays = 1, design = design, 
                      contrast = NULL, fit = list(fit))

The result out is changed to a list object outList before it is assigned to the argument result. To update the treeSummarizedExperiment object (tse = tse), the result (result = outList) and the information, such as the design matrix (design), the contrasts (contrast) and the number of matrix-like elements in assays (use.assays) that were used, should be provided. We could also optionally store the result fit created by the glmFit function for later use, for example, to quickly get new result when specifying a new contrast.

The analysis performed by runEdgeR is exactly the same as the code in this section.

all.equal(new_tse, new_tse1)

Now, we are back to the standard part of the pipeline and could follow Section \@ref(step4) and Section \@ref(step5) to get the final results.

Additional

The application of pipeline on multiple tables {#mtab}

It's possible to perform analysis and aggregation simultaneously on multiple elements of assays with different contrasts.

The data tseM with two elements in the assays is created.

tseM <- treeSummarizedExperiment(assays = list(dat, 2*dat),
                                 tree = treeData(tse),
                                 rowData = rowData(tse),
                                 colData = colData(tse))

To use both elements, we specify use.assays = c(1, 2). Two different contrasts are applied (contrast) to the analysis of each element.

# data analysis
new_tseM <- runEdgeR(obj = tseM, use.assays = c(1, 2),
                 design = NULL, 
                 contrast = list(contrast1 = c(0, 1, -1, 0), 
                                 contrast2 = c(0, 0, 0, 1)), 
                 normalize = TRUE, method = "TMM", 
                 adjust.method = "BH")

If the customized analysis was performed, the results from multiple elements could be written back in one step.

To simplify the document, we directly use the output from Section \@ref(cDA) to show how to do it.

# assume two contrasts are used in the analysis
cList <- list(contrast1 = c(0, 1, -1, 0), 
              contrast2 = c(0, 0, 0, 1))

# the results from the first element of assays
# name the results as the list of the contrasts (cList)
res1 <- list(contrast1 = out,
             contrast2 = 2*out)

# the results from the second element of assays
# name the results as the list of the contrasts (cList)
res2 <- list(contrast1 = 0.5*out,
             contrast2 = 3*out)

# combine the results into a list
# name the results with 'assay' followed by a number 
# e.g., if the first and the third table in the assays were used for analysis
# then, we name the result with assay1  assay3
res <- list(assay1 = res1, 
            assay2 = res2)
# keep both results and data in the container
test_tseM <- updateTSE(result = res, tse = tseM, 
                       use.assays = c(1, 2),
                       design = NULL,
                       contrast = cList)

Now, the output could be obtained using topNodes as mentioned before. The output is a list.

rD4 <- topNodes(test_tseM)
class(rD4)

Each element of the list is the analysis result of a table in assays.

# Here, we have two elements: one is the result of the first table,
# and the other is the result of the second table
names(rD4)

Each sub-element is a list too. It stores the analysis result of a table in assays under a specific contrast.

class(rD4$result_assay1)
names(rD4$result_assay1)

Here, the table below is exactly the data out that was put in.

# There are five columns 
head(rD4$result_assay1$contrast1)


markrobinsonuzh/treeAGG documentation built on May 26, 2019, 9:32 a.m.