library("knitr") library("ape") library("data.table") library("plyr") library("dplyr") library("treelapse")
treelapse
is an R package for interactive visualization of data that can be
arranged along trees. This interactivity is helpful simplifies navigation across
the tree (and the associated comparisons across data points).
Our motivating application is to the microbiome, where time series or multinomial counts of microbial abundances can be arranged along an evolutionary tree. This abstraction is useful in other domains, however. For example, a tree can be built from geographic distances, or using hierarchical clustering. Different applications are described in the other vignettes to this package -- this vignette is designed just to get you started.
This package is pretty simple, since we have tried to maintain focus on a few of the recurring questions in this problem domain. We consider three types of data, which usually come with corresponding, relatively generic questions.
To describe the required input format for these functions, we give a toy example on simulated data. For more interesting motivating applications, see the other vignettes.
In each method, we require a tree specified as an edgelist. Specifically, this
is a two column data.frame
, with two columns, parent
and child
, specifying
the edge structure in the tree. We simulate such an edgelist using the ape
package. The tip nodes are numbered 1 to 100, and all others are internal nodes.
The root is node 101.
n_tips <- 100 tree <- rtree(n_tips) edges <- tree$edge class(edges) <- "character" colnames(edges) <- c("parent", "child")
Now, we need to associate data with each node. In general, this will be a
data.frame
, with the two columns,
unit
: The name of the node to which the value is associated. The name must
be the same as the name in edges
.value
: The scalar value associated with the node.The DOI Sankeys require an extra column, group
, for each node, while the
time and treebox require an extra column time
. Hence, each value is associated
with either a (unit)
, (unit, group)
or (unit, time)
tuple (i.e., we expect
"tall" and not "wide" representations). This is illustrated below.
DOI trees expect just a scalar associated with each node. Here, we generate a random uniform variable for each node.
units <- unique(as.character(edges)) values <- data.frame( "unit" = units, "value" = runif(n_tips + tree$Nnode) )
The result is displayed below. Clicking on a node increases its "degree-of-interest" (DOI), and refocuses the display on that node. Further, previously hidden nodes are revealed depending on their proximity to this new focus node.
The parameters leaf_width
and size_min
control the spacing and size of
points -- leaf_width
controls the $x$ spacing between nodes, while size_min
specifies the radius of the smallest node.
doi_tree( values, edges, display_opts = list("leaf_width" = 50, "size_min" = 2) )
Suppose we didn't have data for each internal node. It might be interesting to
aggregate up the tree. This is what the tree_sum
function does. There are
analogous functions for computing averages and aggregate diversities.
# reduce to just tips tip_values <- values %>% filter(unit %in% 1:n_tips) tip_names <- tip_values$unit tip_values <- setNames(tip_values$value, tip_names) # aggregate tip values summed_values <- tree_sum(edges, tip_values) values <- data.frame( unit = names(summed_values), value = summed_values )
The resulting tree is somewhat more pleasing to look at, because width of an edge leading into a node is equal to the sum of the widths flowing out (since we aggregated values from the tips up).
doi_tree( values, edges, display_opts = list( "leaf_width" = 50, "size_min" = 2 ) )
For the DOI Sankey, we include a grouping variable. We will suppose there are
tip values for each of three groups, and then aggregate up the tree one group at
a time. The function below generates random tip values and does this
aggregation. Notice that we have to convert the unit and group columns to
characters: by default these will be integers, and doi_sankey
will fail.
make_grouped_data <- function(n_groups, n_tips, edges) { grouped_values <- matrix( runif(n_tips * n_groups), n_tips, n_groups, dimnames = list(seq_len(n_tips), seq_len(n_groups)) ) grouped_list <- list() for (i in seq_len(n_groups)) { grouped_list[[i]] <- tree_sum(edges, grouped_values[, i]) } grouped_values <- do.call(rbind, grouped_list) %>% melt(varnames = c("group", "unit")) grouped_values$unit <- as.character(grouped_values$unit) grouped_values$group <- as.character(grouped_values$group) grouped_values }
Given this data, we can create the sankey visualization. Clicking on edges will turn focus to the "node" directly above the click-on edge. Of course, here the groups all have about the same width, because the leaves have randomly generated values.
grouped_values <- make_grouped_data(3, n_tips, edges) doi_sankey(grouped_values, edges)
The input format for time and treeboxes is similar to that for the DOI sankey, where we have multiple measurements of the same unit across multiple timepoints (rather than across groups). In fact, we can use the same function to generate toy data.
n_times <- 100 timebox_values <- make_grouped_data(100, n_tips, edges) colnames(timebox_values) <- c("time", "unit", "value")
Below we display the timebox. Here, pressing "new box" will let you draw a brush across the time series, highlighting any that pass through the box. The corresponding tree nodes will also be highlighted, and their names are displayed on mouseover. Further pressing "new box" will let you draw another brush over the time series -- only time series passing through both boxes will be highlighted. The display in the top right is a miniature view of the main time series display that can be used for zooming in on certain time / value windows. Click and drag a brush over this sub-display to refocus the main time series display the associated time and value ranges. Finally, searching for a node name in the search box will highlight the associated series and node in red.
display_opts <- list("size_min" = 1) timebox_tree(timebox_values, edges, display_opts = display_opts)
Timeboxes only let you query based on time series shapes. Treeboxes are the natural converse -- they let you select nodes in the tree by drawing brushes over them. Other than this, the interaction is exactly the same.
treebox(timebox_values, edges, display_opts = display_opts)
Few of the ideas in this package are new. The key conceptual advances behind DOI trees, DOI sankeys, and time / treeboxes were introduced in [@heer2004doitrees], [@hochheiser2002visual], and [@guerra2013visualizing], respectively. Further, the implementation here would be impossible without the htmlwidgets package [@vaidyanathan2014htmlwidgets] -- it's only through this this package that we can link interactive data analysis in R with interactive visualization using D3.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.