library("knitr")
library("ape")
library("data.table")
library("plyr")
library("dplyr")
library("treelapse")

Introduction

treelapse is an R package for interactive visualization of data that can be arranged along trees. This interactivity is helpful simplifies navigation across the tree (and the associated comparisons across data points).

Our motivating application is to the microbiome, where time series or multinomial counts of microbial abundances can be arranged along an evolutionary tree. This abstraction is useful in other domains, however. For example, a tree can be built from geographic distances, or using hierarchical clustering. Different applications are described in the other vignettes to this package -- this vignette is designed just to get you started.

This package is pretty simple, since we have tried to maintain focus on a few of the recurring questions in this problem domain. We consider three types of data, which usually come with corresponding, relatively generic questions.

Examples

To describe the required input format for these functions, we give a toy example on simulated data. For more interesting motivating applications, see the other vignettes.

In each method, we require a tree specified as an edgelist. Specifically, this is a two column data.frame, with two columns, parent and child, specifying the edge structure in the tree. We simulate such an edgelist using the ape package. The tip nodes are numbered 1 to 100, and all others are internal nodes. The root is node 101.

n_tips <- 100
tree <- rtree(n_tips)
edges <- tree$edge
class(edges) <- "character"
colnames(edges) <- c("parent", "child")

Now, we need to associate data with each node. In general, this will be a data.frame, with the two columns,

The DOI Sankeys require an extra column, group, for each node, while the time and treebox require an extra column time. Hence, each value is associated with either a (unit), (unit, group) or (unit, time) tuple (i.e., we expect "tall" and not "wide" representations). This is illustrated below.

DOI Trees

DOI trees expect just a scalar associated with each node. Here, we generate a random uniform variable for each node.

units <- unique(as.character(edges))
values <- data.frame(
  "unit" = units,
  "value" = runif(n_tips + tree$Nnode)
)

The result is displayed below. Clicking on a node increases its "degree-of-interest" (DOI), and refocuses the display on that node. Further, previously hidden nodes are revealed depending on their proximity to this new focus node.

The parameters leaf_width and size_min control the spacing and size of points -- leaf_width controls the $x$ spacing between nodes, while size_min specifies the radius of the smallest node.

doi_tree(
  values,
  edges,
  display_opts = list("leaf_width" = 50, "size_min" = 2)
)

Suppose we didn't have data for each internal node. It might be interesting to aggregate up the tree. This is what the tree_sum function does. There are analogous functions for computing averages and aggregate diversities.

# reduce to just tips
tip_values <- values %>%
  filter(unit %in% 1:n_tips)
tip_names <- tip_values$unit
tip_values <- setNames(tip_values$value, tip_names)

# aggregate tip values
summed_values <- tree_sum(edges, tip_values)
values <- data.frame(
  unit = names(summed_values),
  value = summed_values
)

The resulting tree is somewhat more pleasing to look at, because width of an edge leading into a node is equal to the sum of the widths flowing out (since we aggregated values from the tips up).

doi_tree(
  values,
  edges,
  display_opts = list(
    "leaf_width" = 50,
    "size_min" = 2
  )
)

DOI Sankeys

For the DOI Sankey, we include a grouping variable. We will suppose there are tip values for each of three groups, and then aggregate up the tree one group at a time. The function below generates random tip values and does this aggregation. Notice that we have to convert the unit and group columns to characters: by default these will be integers, and doi_sankey will fail.

make_grouped_data <- function(n_groups, n_tips, edges) {
  grouped_values <- matrix(
    runif(n_tips * n_groups), n_tips, n_groups,
    dimnames = list(seq_len(n_tips), seq_len(n_groups))
  )

  grouped_list <- list()
  for (i in seq_len(n_groups)) {
    grouped_list[[i]] <- tree_sum(edges, grouped_values[, i])
  }

  grouped_values <- do.call(rbind, grouped_list) %>%
    melt(varnames = c("group", "unit"))
  grouped_values$unit <- as.character(grouped_values$unit)
  grouped_values$group <- as.character(grouped_values$group)
  grouped_values
}

Given this data, we can create the sankey visualization. Clicking on edges will turn focus to the "node" directly above the click-on edge. Of course, here the groups all have about the same width, because the leaves have randomly generated values.

grouped_values <- make_grouped_data(3, n_tips, edges)
doi_sankey(grouped_values, edges)

Time / Treeboxes

The input format for time and treeboxes is similar to that for the DOI sankey, where we have multiple measurements of the same unit across multiple timepoints (rather than across groups). In fact, we can use the same function to generate toy data.

n_times <- 100
timebox_values <- make_grouped_data(100, n_tips, edges)
colnames(timebox_values) <- c("time", "unit", "value")

Below we display the timebox. Here, pressing "new box" will let you draw a brush across the time series, highlighting any that pass through the box. The corresponding tree nodes will also be highlighted, and their names are displayed on mouseover. Further pressing "new box" will let you draw another brush over the time series -- only time series passing through both boxes will be highlighted. The display in the top right is a miniature view of the main time series display that can be used for zooming in on certain time / value windows. Click and drag a brush over this sub-display to refocus the main time series display the associated time and value ranges. Finally, searching for a node name in the search box will highlight the associated series and node in red.

display_opts <- list("size_min" = 1)
timebox_tree(timebox_values, edges, display_opts = display_opts)

Timeboxes only let you query based on time series shapes. Treeboxes are the natural converse -- they let you select nodes in the tree by drawing brushes over them. Other than this, the interaction is exactly the same.

treebox(timebox_values, edges, display_opts = display_opts)

Previous Work

Few of the ideas in this package are new. The key conceptual advances behind DOI trees, DOI sankeys, and time / treeboxes were introduced in [@heer2004doitrees], [@hochheiser2002visual], and [@guerra2013visualizing], respectively. Further, the implementation here would be impossible without the htmlwidgets package [@vaidyanathan2014htmlwidgets] -- it's only through this this package that we can link interactive data analysis in R with interactive visualization using D3.

References



krisrs1128/treelapse documentation built on Jan. 3, 2020, 6:29 p.m.