library(knitr)
opts_chunk$set(warning = FALSE, message = FALSE, dpi=150, fig.width=6, fig.height=3)

In this vignette, we demonstrate how to use the localBiplots package to make local biplots. The primary function in the package is local_biplot, and it can be used either with one of the built-in distances or with a distance function that the user provides. Both methods are demonstrated.

We start off by loading some additional packages that we need. Matrix, phyloseq, and ape are all related to the computations for the custom distance we will make. rmutil is used to create the simulated data, and the remaining packages are for plotting. Some extra functions for simulating data are in data-simulation-functions.R, and some functions related to the custom distances are in unifrac-distances-and-derivs.R.

library(localBiplots)
library(Matrix)
library(phyloseq)
library(ape)
library(rmutil)
library(ggtree)
library(viridis)
library(latex2exp)
library(ggplot2)
library(reshape2)
theme_set(theme_bw())
source("data-simulation-functions.R")
source("unifrac-distances-and-derivs.R")

Simulating the example data

We start off by making the example data. The dataset is supposed to mimic microbiome data, so we have an abundance matrix with $2^6$ variables and 20 samples. The samples fall into two groups of ten each. The variables are supposed to represent bacterial taxa, and we create a phylogenetic tree describing the relationship between the taxa.

set.seed(0)
example_data = make_example_data(n_per_group = 10, log_2_p = 6, deep_split_scaling = 3, overall_scaling = 1, s = .1)
example_data_phyloseq = make_phyloseq_from_example_data(example_data, mean_matrix = FALSE)

The idea behind the simulation is taxa in one half of the tree are over-represented in one group, and taxa in the other half of the tree are over-represented in the other group. In addition, the specific taxa represented in each group are non-overlapping: if a taxon is represented in one group, its sister taxon (the closest taxon on the phylogenetic tree) is not represented in that group but it is represented in the other group. This pattern can be seen more easily in a plot of the data matrix with the phylogenetic tree below.

variable_palette = viridis(4)[c(1,2,4,3)]
variable_labels = c(TeX("(\\textbf{1}_{deep})_i = 1, (\\textbf{1}_{shallow})_i = 1"),
                       TeX("(\\textbf{1}_{deep})_i = 1, (\\textbf{1}_{shallow})_i = 0"),
                       TeX("(\\textbf{1}_{deep})_i = 0, (\\textbf{1}_{shallow})_i = 1"),
                       TeX("(\\textbf{1}_{deep})_i = 0, (\\textbf{1}_{shallow})_i = 0"))
names(variable_palette) = names(variable_labels) = c("TRUETRUE", "TRUEFALSE", "FALSETRUE", "FALSEFALSE")
p_tree = ggtree(example_data_phyloseq) +
    geom_tippoint(aes(color=var_description), size = 1.5) +
    scale_color_manual(values = variable_palette, labels = variable_labels, "Variable Type")

tip_positions = unique(subset(p_tree$data, isTip)[,c("label", "y")])
example_data_melted = melt(example_data$C)
abundance_and_position = merge(example_data_melted,
                               tip_positions,
                               by.x = "Var2", by.y = "label",
                               all.x = TRUE)
## a hack to make the colors show up better, set the zero values to NA
abundance_and_position_NAs = within(abundance_and_position, value[value == 0] <- NA)
p_tree + 
    geom_tile(aes(x = Var1 / 2 + 6, y = y, fill = value), color = "darkgray", size = .1,
              data = abundance_and_position_NAs) +
    scale_fill_distiller(palette = "Spectral", na.value = "white", "Abundance") +
    theme(legend.position = "right") +
    coord_flip() + scale_x_reverse() +
    guides(color = guide_legend(order = 1),
           fill_distiller = guide_legend(order = 2))

Creating local biplots with a built-in distance

Next we make local biplots with the Manhattan distance, which is built in to the package. The first argument to the function is the data matrix, formatted as samples by variables. The second argument is the type of distance to use: in this case, we are using the Manhattan/l1 distance, and the positive version of local biplot axes. By default local biplot axes are computed at of the original samples, but this can be modified by the samples or new arguments.

sb_l1 = local_biplot(X = example_data$C, dist = "manhattan-pos")

The output from local_biplot is a data frame that gives information about the embedding of the samples with the specified distance and the local biplot axes at a collection of points. The Embedding1 and Embedding2 columns record where the corresponding sample embeds in the MDS space (or, if using a different set of embedding dimensions, up to Embeddingn). There is a good deal of replication in the Embedding columns: each sample shows up in multiple rows, and each row corresponding to the same sample will have the same values for Embedding1 and Embedding2.

The local biplot axes are given in the Axis1 and Axis2 columns (or, if using a different set of embedding dimensions, up to Axisn). Each local biplot axis corresponds to a variable (given in the variable column), and a vector in the input data space. The sample column records the position in the input data space the local biplot axes correspond to: Originaln means that the local biplot axis for that row corresponds to the nth sample in the input dataset. The function also has the capability to compute local biplot axes at other points in the data space, or at random points, in which case the sample column would be Newn, for the nth new sample.

head(sb_l1)

We can first plot the MDS embeddings of the samples using the data contained in sb_l1:

group_a = paste0("Original", 1:10)
group_labels = c("a","b")
names(group_labels) = c("TRUE", "FALSE")
ggplot(subset(sb_l1, variable == "t1")) +
    geom_point(aes(x = Embedding1, y = Embedding2, shape = sample %in% group_a)) +
    scale_shape(name = "Sample type", labels = group_labels) +
    ggtitle("Sample Embeddings\nMDS with Manhattan Distance") +
    xlab("Axis 1") + ylab("Axis 2") + coord_fixed() +
    theme(plot.title = element_text(size = 12))

In this particular example, we have more information about the variables in the example_data$variable_descriptions data frame. To plot this information along with the local biplot axes, we merge it with sb_l1.

head(example_data$variable_descriptions)
sb_and_var_descriptions_l1 = merge(sb_l1, example_data$variable_descriptions, by = "variable")

Then we can plot the local biplot axes. In this case, we plot the set of local biplot axes corresponding to one sample around the point where that sample embeds in MDS space. We also scale up the axes so that we can see them more clearly.

ggplot(sb_and_var_descriptions_l1,
       aes(x = Embedding1, y = Embedding2,
           xend = Axis1 * 30 + Embedding1,
           yend = Axis2 * 30 + Embedding2,
           color = var_description)) +
    geom_segment(alpha = 1) +
    geom_point(aes(shape = sample %in% group_a),
               data = sb_and_var_descriptions_l1, color = "black") + 
    scale_color_manual(values = variable_palette, labels = variable_labels, "Variable Type") +
    scale_shape("Sample Type", labels = group_labels) +
    ggtitle("Local biplot axes for Manhattan distance") +
    xlab("Axis 1") + ylab("Axis 2") + coord_fixed() + xlim(c(-590, 590))

Creating local biplots with a custom distance

We can also create local biplots with custom distances. To do this, we need to provide a function that computes the distance and a function that computes the gradient of the distance. The rules are:

In the example below, we create the distance function wuf_dist_ours and the gradient function wuf_dist_deriv_ours. These are based on functions that are defined in unifrac-distances-and-derivs.R. Those functions require a specific tree, and the definition below tailors the distance to the tree in our simulated dataset.

A = makeDescendantMatrix(example_data$tree)
L = getBranchLengths(example_data$tree)
wuf_dist_ours = function(X) wuf_dist(X, A, L)
wuf_dist_deriv_ours = function(x, y) wuf_dist_deriv(x, y, A, L)

When using the local_biplot function with a custom distance, we again provide the data matrix, but instead of specifying the distance by name, we provide the functions we created as values for dist and dist_deriv.

sb_wuf = local_biplot(X = example_data$C, dist = wuf_dist_ours, dist_deriv = wuf_dist_deriv_ours)

As before, we can use the output from the local_biplot function to plot the samples in the MDS embedding space.

ggplot(subset(sb_wuf, variable == "t1")) +
    geom_point(aes(x = Embedding1, y = Embedding2, shape = sample %in% group_a)) +
    xlab("Axis 1") + ylab("Axis 2") + coord_fixed() +
    scale_shape(name = "Sample type", labels = group_labels) +
    ggtitle("Sample Embeddings\nMDS with Weighted Unifrac") +
    theme(plot.title = element_text(size = 12))

We again add the extra information in example_data$variable_descriptions to the local biplot axes in sb_wuf.

sb_and_var_descriptions_wuf = merge(sb_wuf, example_data$variable_descriptions, by = "variable")

We plot the local biplot axes, again centering each set of local biplot axes at the position where the corresponding sample embeds in MDS space, and scaling up the axes so that they are visible.

ggplot(sb_and_var_descriptions_wuf,
       aes(x = Embedding1, y = Embedding2,
           xend = 30 * Axis1 + Embedding1,
           yend = 30 * Axis2 + Embedding2,
           color = var_description)) +
    geom_segment() +
    geom_point(aes(shape = sample %in% group_a),
               data = sb_and_var_descriptions_wuf, color = "black") + 
    scale_color_manual(values = variable_palette, labels = variable_labels, "Variable Type") +
    scale_shape("Sample Type", labels = group_labels) +
    ggtitle("Local biplot axes for weighted UniFrac") +
    xlab("Axis 1") + ylab("Axis 2") + coord_fixed()

Interactive version

In some cases, a large number of samples and variables make it hard to visualize the biplot axes. To help in this situation, we also provide a function that will create an interactive plot with the samples and the biplot axes.

The interactive_biplot function takes a data matrix, either a string describing the distance or a pair dist and dist_deriv (like the local_biplot) function. In addition, it has arguments describing addition data about the samples and variables, and how to plot the samples and the biplot axes.

In the code below, we are creating an interactive biplot with the same simulated data as before and the Manhattan distance. The var_data argument gives the extra information about the variables, and the var_mapping argument tells the function to plot points corresponding to the first and second local biplot axes, and to color the local biplot axes according to the value of the variable shallow in example_data$variable_descriptions.

interactive_biplot(X = example_data$C, dist = "manhattan-pos",
                  var_mapping = aes_string(x = "Axis1", y = "Axis2", color = "shallow"),
                  var_data = example_data$variable_descriptions)

Analogously, in the code below, we are asking for an interactive plot of the local biplot axes with the weighted UniFrac distance, and we want the local biplot axes to be colored according to the value of the deep column in example_data$variable_descriptions.

interactive_biplot(X = example_data$C, dist = wuf_dist_ours,
                  dist_deriv = wuf_dist_deriv_ours,
                  var_mapping = aes_string(x = "Axis1", y = "Axis2", color = "deep"),
                  var_data = example_data$variable_descriptions)

In both cases, running the code above will cause a browser window to open. The left panel will have the sample embeddings, and clicking on one of the sample points on the left plot will lead to the biplot axes corresponding to that point to show up on the right panel.



jfukuyama/localBiplots documentation built on Jan. 10, 2023, 3:33 a.m.