In markrobinsonuzh/treeAGG: Tree Aggregation

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Introduction

The leafSummarizedExperiment and treeSummarizedExperiment classes are both extensions of the SummarizedExperiment class. They are used to store rectangular data of experimental results as in a SummarizedExperiment, and also support the storage of a hierarchical structure and its link information to the rectangular data. The leafSummarizedExperiment is intended to store experimental data and annotation information for the leaf nodes of a tree, while the treeSummarizedExperiment class contains experimental (or derived) data and annotation information also for the internal nodes of the tree.

The leafSummarizedExperiment class

The comparision between leafSummarizedExperiment and SummarizedExperiment

The leafSummarizedExperiment class has exactly the same structure as the SummarizedExperiment class, including assays, rowData, colData and metadata. More details about the SummarizedExperiment structure can be found in SummarizedExperiment for coordinating experimental assays, samples, and regions of interest. The difference to SummarizedExperiment is that leafSummarizedExperiment has more restrictions on the data.

It requires a phylo object stored in metadata and named tree.
It requires that the phylo object has a unique label for each leaf node.
It requires a column nodeLab in rowData, or row names of rowData, to provide the label of the node that each row is mapped to.

The link between the assays data and the nodes of the tree strucutre is checked when the leafSummarizedExperiment object is build. Only rows that could be mapped to the nodes of the tree are kept in the matrix-like elements of assays.

The construction of leafSummarizedExperiment

We generate a toyTable with observations of 5 entities collected from 4 samples.

suppressPackageStartupMessages({
    library(treeAGG)
    library(S4Vectors)})

# assays data
set.seed(1)
toyTable <- matrix(rnbinom(20, size = 1, mu = 10), nrow = 5)
colnames(toyTable) <- paste(rep(LETTERS[1:2], each = 2), 
                            rep(1:2, 2), sep = "_")
rownames(toyTable) <- paste("entity", seq_len(5), sep = "")

toyTable

Descriptions of the r nrow(toyTable) entities and r ncol(toyTable) samples are given in the rowInf and colInf, respectively.

# row data
rowInf <- DataFrame(var1 = sample(letters[1:2], 5, replace = TRUE),
                    var2 = sample(c(TRUE, FALSE), 5, replace = TRUE),
                    row.names = rownames(toyTable))
rowInf

# column data
colInf <- DataFrame(gg = c(1, 2, 3, 3),
                    group = rep(LETTERS[1:2], each = 2), 
                    row.names = colnames(toyTable))
colInf

The hierarchical structure of the 5 entities is denoted as toyTree. We create it by using the function rtree from the package r CRANpkg("ape"). It's a phylo object.

# create a toy tree
suppressPackageStartupMessages(library(ape))
toyTree <- rtree(5)
class(toyTree)

As we see, the phylo object is actually a list of four elements.

str(toyTree)
plot(toyTree)

To store the toy data, toyTable, rowInf, colInf, and toyTree, we could use a leafSummarizedExperiment container. The map between the rows of toyTable and the leaf nodes of toyTree is checked when creating a leafSummarizedExperiment object. If the object is built as below, we get a message that 5 rows of toyTable are removed because they can't be matched to any node of the tree. The assays table of the resulting test1 object consequently has no rows.

# use the toy data to create a leafSummarizedExperiment object.
test1 <- leafSummarizedExperiment(assays = list(toyTable), rowData = rowInf,
                                  colData = colInf, tree = toyTree)
assays(test1)[[1]]

To correctly store data, we need to additionally provide the link information between the rows of toyTable and the nodes of the toyTree via a column nodeLab in rowData or via the row names of rowData.

Option 1: Create a leafSummarizedExperiment object by changing the row names of rowData.

# change the row names of the row data
rowInf_1 <- rowInf
rownames(rowInf_1) <- toyTree$tip.label
toyTable_1 <- toyTable
rownames(toyTable_1) <- toyTree$tip.label
test2 <- leafSummarizedExperiment(assays = list(toyTable_1), 
                                   rowData = rowInf_1,
                                   colData = colInf, 
                                   tree = toyTree)
assays(test2)[[1]]

Option 2: Create a leafSummarizedExperiment object by adding a column nodeLab to the rowData.

# add a column (nodeLab) to the row data
rowInf$nodeLab <- toyTree$tip.label
lse <- leafSummarizedExperiment(assays = list(toyTable),
                                rowData = rowInf,
                                colData = colInf,
                                tree = toyTree)
assays(lse)[[1]]

Although the leafSummarizedExperiment object could be sucessfully created in either way, the row names of the table in assays are different. We recommend to do it as option 2 because it keeps the original row names and works better especially when there are multiple rows mapped to a same leaf node of the tree.

The leafSummarizedExperiment only supports data on the level of the leaf nodes. If users have rows mapped to internal nodes of a tree, then the treeSummarizedExperiment class should be used instead. Technically, the treeSummarizedExperiment could also store data on the level of leaf nodes. Users could choose either leafSummarizedExperiment or treeSummarizedExperiment to store data on the level of leaf node as shown in Section \@ref(sec:tse-build)).

treeSummarizedExperiment {#tse-class}

Anatomy of treeSummarizedExperiment

knitr::include_graphics("tse.png")

Compared with the SummarizedExperiment class, there are two more slots, the tree data (treeData) and the link data (linkData), in the treeSummarizedExperiment class. Other slots, including assays, rowData, colData and metadata, are the same as those in SummarizedExperiment.

Here, we construct a treeSummarizedExperiment object using the toy example from the previous section. With the function nodeValue, data at all nodes of the tree is created based on the available data on the leaf node level. The argument fun specifies how the value at an internal node is calculated from its descendant nodes. Here, the sum is used, and thus the value for each node is equal to the sum of the values of all its descendants. To view the running process, message = TRUE is used.

tse <- nodeValue(data = lse, fun = sum, message = TRUE)

The output tse is a treeSummarizedExperiment object. The class treeSummarizedExperiment is an extension of the SummarizedExperiment class.

class(tse)
showClass("treeSummarizedExperiment")

As shown, there are two more slots, treeData and linkData, in treeSummarizedExperiment compared with SummarizedExperiment.

tse

Assays

To extract a table in assays from treeSummarizedExperiment object, we could use the assays accessor function. This is similar to the SummarizedExperiment class.

(aData <- assays(tse)[[1]])

We could use the node labels from the link data (\@ref(sec:linkData)) as the row names via the argument use.nodeLab = TRUE. Commonly, the column nodeLab is used as the row names. However, if it has duplicated values, the column nodeLab_alias is used.

aData <- assays(tse, use.nodeLab = TRUE)[[1]]

The value at each node (from sample A_1) could be visualized with the following figure. We see that the value at each internal node is the sum of those at its descendant leaves. More details about how to use r Biocpkg("ggtree") to plot the tree could be seen here.

# load packages
suppressPackageStartupMessages({
    library(ggtree)  # to plot the tree
    })

# extract a sample column from assays 
ex1 <- assays(tse)[[1]][, 1, drop = FALSE]

# combine it with the data extracted from linkData
# rename column nodeNum as node
# optional : datF <- cbind.data.frame(ex1, linkData(tse)) %>%
# rename(node=nodeNum)
datF <- cbind.data.frame(ex1, linkData(tse)) 
colN <- colnames(datF)
colN[colN == "nodeNum"] <- "node"
colnames(datF) <- colN

# plot
ggtree(treeData(tse)) %<+% datF +
  geom_text2(aes(label = A_1), color = "brown1", size = 8)

The data datF is used to annotate the tree via %<+% (see `?"%<+%")

Row data, column data and metadata

The row data, column data, and metadata could be accessed exactly as in the SummarizedExperiment class.

(rData <- rowData(tse))

The column nodeLab is now moved to the linkData (see Section \@ref(sec:linkData)), and isn't in the rowData (rData) of treeSummarizedExperiment anymore.

(cData <- colData(tse))

# It is empty in the metadata
(mData <- metadata(tse))

Link data {#sec:linkData}

The linkData() accessor is used to view the link information between rows of matrix-like elements in the assays and nodes of the tree.

(linkD <- linkData(tse))

Rows of the link data are one-to-one mapped to rows of the matrix-like element in assays. Their orders are exactly the same. Each row of the matrix-like element in assays could be mapped to a node of the tree, whose label and number are in the column nodeLab and nodeNum, respectively. The column isLeaf gives information whether a node is a leaf node. The column rowID contains the corresponding row number in assays.

If the labels of nodes are available and unique on the tree, the link data would have 4 columns including nodeLab, nodeNum, isTip, and rowID; otherwise, it would have one additonal column nodeLab_alias.

As shown in the Figure \@ref(fig:toyTREE), the tree we have used has no labels (orange text) for internal nodes.

Tree data

The tree structure could be accessed using treeData(). It is a phylo object.

treeD <- treeData(tse)
class(treeD)

The figure of the tree structure is shown in the Figure \@ref(fig:toyTREE). The node label is given in orange text, and the node number is in blue.

ggtree(treeD) + 
    geom_text2(aes(label = label), color = "darkorange", 
               hjust = -0.1, vjust = -0.7, size = 6) +
    geom_text2(aes(label = node), color = "darkblue", 
               hjust = -0.5, vjust = 0.7, size = 5)

Constructing a treeSummarizedExperiment {#sec:tse-build}

A treeSummarizedExperiment object could be constructed using the function treeSummarizedExperiment.

tseA <- treeSummarizedExperiment(tree = treeD,  
                                 assays = list(aData), 
                                 rowData = rData,
                                 colData = cData, 
                                 metadata = mData)

If the provided tree has a unique label for each node, we could also construct a treeSummarizedExperiment without providing the link data.

We recreate a tree treeN based on the old tree by adding labels to the internal nodes.

treeN <- treeD
treeN$node.label <- paste("Node_", 6:9, sep = "")

In Figure \@ref(fig:newTREE), we see that all nodes have different labels in orange text.

# plot
ggtree(treeN) + 
    geom_text2(aes(label = label), color = "darkorange", 
               hjust = -0.05, vjust = -0.4, size = 6) +
    geom_text2(aes(label = node), color = "darkblue", 
               hjust = -0.5, vjust = 0.7, size = 5)

We would need to provide the mapping information between rows and nodes of the tree if the link data is not given in the creation of a treeSummarizedExperiment object. This could be done by adding a nodeLab column to the row data. Users need to make sure they have assigned a correct label to a row.

# give a column nodeLab in row data
# the linkData is created automatically
lab <- ifelse(is.na(linkD$nodeLab), linkD$nodeLab_alias, linkD$nodeLab)
lab
(nData <- cbind(nodeLab = lab, rData))

Create tseB without providing the link data.

tseB <- treeSummarizedExperiment(tree = treeN, 
                                 assays = list(aData),
                                 rowData = nData,
                                 colData = cData, 
                                 metadata = mData)

The link data has been generated automatically and has only four columns as described in Section \@ref(sec:linkData).

linkData(tseB)

As users like, data on the level of leaf nodes could be organized as a leafSummarizedExperiment object (lseC) or a treeSummarizedExperiment object (tseC) as below.

lseC <- leafSummarizedExperiment(assays = list(toyTable),
                                 rowData = rowInf,
                                 colData = colInf,
                                 tree = toyTree)

tseC <- treeSummarizedExperiment(assays = list(toyTable),
                                 rowData = rowInf,
                                 colData = colInf, 
                                 tree = toyTree)

markrobinsonuzh/treeAGG documentation built on May 26, 2019, 9:32 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com