adaptiveGPCA vignette
In adaptiveGPCA: Adaptive Generalized PCA

library(knitr)
opts_chunk$set(fig.width = 6, fig.height = 4)

Overview

Adaptive gPCA

Adaptive gPCA (Fukuyama, J. (2017)) is a flexible method for incorporating side information about the variables into principal components analysis. The method has a single parameter, $r$, which controls how much weight to give to the side information: $r = 0$ corresponds to standard PCA (the side information is not used at all), and $r = 1$ corresponds to the maximum amount of weight on the side information.

The output from adaptive gPCA is analogous to that given by PCA: we get representations of the samples and the variables, which can be visualized as either independently or as a biplot. The difference with standard PCA is that, assuming $r > 0$, the variable loadings are regularized so that similar variables are encouraged to have similar loadings on the principal axes.

Package functionality

There are two main ways of using this package. The function adaptivegpca will choose the value of $r$ automatically, while the function gpcaFullFamily produces analogous output for the full range of values of $r$, which is then chosen manually. We will illustrate both methods in this vignette.

Although adaptive gPCA can be applied more generally, the method was developed to incorporate phylogenetic structure in microbiome data analysis, and so there are also functions to interface with phyloseq objects and functions to visualize adaptive gPCA output along with a phylogenetic tree.

Data

The data set we will use to illustrate the functionality of this package is an antibiotic time course study[^1]. In this experiment, three subjects were given two courses of antibiotics, and the abundances of bacteria in the gut microbiome were measured before, during, and after each course of antibiotics. The data is stored in a phyloseq object called AntibioticPhyloseq, which contains variance-stabilized and centered abundances on the otu_table slot, a phylogenetic tree describing the relationships between the bacterial species measured in the phy_tree slot, and information about the samples in the sample_data slot. We want a low-dimensional representation of the samples which takes into account the phylogenetic similarities between the bacteria, which is what adaptive gPCA will provide for us.

[^1]: Dethlefsen, L., and D. A. Relman. "Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation." Proc Natl Acad Sci USA 108. Suppl 1 (2010): 4554-4561.

Fitting an agpca model

Now that we have described the data we will be using, we will show how to use the package. The first step is to load the required packages and data.

library(adaptiveGPCA)
library(ggplot2)
library(phyloseq)
data(AntibioticPhyloseq)
theme_set(theme_bw())

Next, we create the inputs required for the adaptivegpca function. If we have a phyloseq object containing taxon abundances and a phylogenetic tree, the function processPhyloseq will create a list with the following elements:

A centered data matrix, stored as X in the output.
A similarity matrix based on the phylogeny, stored as Q in the output.
A vector of weights, stored as weights. The weights are only relevant for correspondence analysis, so if the function is invoked with ca = FALSE (the default), the weights vector will be a vector containing only 1's and can be ignored.

pp = processPhyloseq(AntibioticPhyloseq)

Next, we pass our data matrix and our similarity matrix to the adaptivegpca function. The arguments to adaptivegpca are

X: A centered data matrix.
Q: A symmetric, positive definite matrix describing the similarities between the variables.
k: The number of axes to keep, defaults to 2.
weights: A vector with sample weights, defaults to a vector of all 1's (equal weights on each sample).

The matrices X and Q will be generated from a phyloseq object using the OTU table and the phylogenetic tree with the processPhyloseq function. If you have another kind of data, you can generate these matrices yourself. Below we use the output from processPhyloseq to fit an adaptive gPCA model.

out.agpca = adaptivegpca(pp$X, pp$Q, k = 2)

The output from the adaptivegpca function is stored in an object of class adaptivegpca, which has some generic printing and plotting functions associated with it. As such, if we just print out the object, it will give us some basic information about the results, namely the number of axes, the value of $r$ chosen, and the fraction of the variance explained by the top axes.

out.agpca

The output is a list with several elements. These are:

U: The sample scores on the principal axes.
V: The variable loadings before rotation. These are not used for plotting --- they need to be rotated into the space of the samples for the biplot representation.
QV: The variable loadings rotated to be in the same space as the sample scores. These are the variable loadings that should be plotted for a biplot representation of the samples and the variables.
vars: The variances along the principal axes.
r: The value of $r$ chosen by adaptivegpca, corresponding to how much regularization to perform according to the structure of the variables. $r = 0$ is standard PCA (variable structure is not considered at all), and $r = 1$ is the maximal amount of regularization.
evals: The eigenvalues of the modified similarity matrix. This is described in more detail in the paper, but it is an alternate way of understanding how much regularization is performed by the method. The larger the top eigenvalues are compared with the subsequent ones, the more regularization there is toward the top eigenvectors of the similarity matrix.

Plotting the output

We can also plot the output from adaptivegpca using a generic plotting function. By default, this will produce a scree plot, as shown below:

plot(out.agpca)

The plotting function has a type argument, which can be either scree, samples, or variables. type = scree will give a plot showing the fraction of the variance explained by each of the axes, type = samples will show the sample scores along the principal axes, and type = variables will show the variable loadings on the principal axes. The axes argument allows the user to specify which axes to plot, by default these are axes 1 and 2.

plot(out.agpca, type = "samples", axes = c(1,2))
plot(out.agpca, type = "variables", axes = c(1,2))

Finally, the inspectTaxonomy function allows interactive inspection of the taxa or variables plot, assuming you started the analysis with a phyloseq object. This function will open a browser window with a representation of the phylogenetic tree and the taxon loadings on the adaptive gPCA axes. Taxa in this plot can be selected by "brushing" the plot (drawing a rectangular box), and the selected taxa will be highlighted on the phylogenetic tree and their taxonomic assignments will be printed below. An example call is given below:

inspectTaxonomy(out.agpca, AntibioticPhyloseq)

The arguments to inspectTaxonomy are

agpcafit: The output from adaptivegpca.
physeq: A phyloseq object with a taxonomy table and a phylogenetic tree. Should be the same one used for for the adaptive gPCA fit.
axes: Which axes to plot. Defaults to the first two axes.
br.length: Should the tree include branch lengths? The trees tend to look better when the branches are all plotted as being the same length, so this defaults to FALSE.
height: The height of the plotting area in pixels. Defaults to 600.

If you click the "done" button in the browser window, the app will close and the function will return a table describing the taxonomy, position along selected axes, and names of the selected taxa.

Interactive choice of $r$

We can also choose the value of $r$ manually. In the previous section, we described how to use the adaptivegpca function, which chooses the value of $r$ automatically, but if we would like to see the results for different values of $r$ and to choose one manually, we can first use the gpcaFullFamily function to create a full set of models and then use the visualizeFullFamily function to visualize the biplots for each value of $r$.

gpcaFullFamily takes the same arguments as adaptivegpca, along with some additional arguments describing the form of the output.

The visualizeFullFamily function is a shiny gadget. Calling this function will open a browser window where you can visualize the data set for a range of values of $r$. Clicking "done" in this window will give as output an object of the same format as that given by the adaptivegpca function, the difference being that the value of $r$ was chosen manually instead of automatically. The sample_data/sample_mapping and var_data/var_mapping arguments allow you to customize the visualization. Without these arguments, the first two axes will be plotted for both the samples and the variables. If you include these arguments, you can customize the plots by providing aesthetic mappings for ggplot. Note that the code below is only included as an example and not evaluated in the vignette.

out.ff = gpcaFullFamily(pp$X, pp$Q, k = 2)
out.agpca = visualizeFullFamily(out.ff,
                    sample_data = sample_data(AntibioticPhyloseq),
                    sample_mapping = aes(x = Axis1, y = Axis2, color = type),
                    var_data = tax_table(AntibioticPhyloseq),
                    var_mapping = aes(x = Axis1, y = Axis2, color = Phylum))

In either case, we can plot the results. This time we will do it manually, using ggplot. The sample scores on the principal axes are located in out.agpca$U and the loadings of the variables on the principal axes are located in out.agpca$QV. To plot the samples, we make a data frame that includes out.agpca$U and the sample data, and to plot the taxa we make a data frame containing out.agpca$QV and the taxonomy table.

ggplot(data.frame(out.agpca$U, sample_data(AntibioticPhyloseq))) +
    geom_point(aes(x = Axis1, y = Axis2, color = type, shape = ind)) +
    xlab("Axis 1") + ylab("Axis 2")

ggplot(data.frame(out.agpca$QV, tax_table(AntibioticPhyloseq))) +
    geom_point(aes(x = Axis1, y = Axis2, color = Phylum)) +
    xlab("Axis 1") + ylab("Axis 2")

Correspondence analysis

Finally, note that the processPhyloseq function also has an argument ca (for correspondence analysis). This should be used with phyloseq objects containing raw counts, and it will process a phyloseq object so as to do an adaptive gPCA version of correspondence analysis (this entails transforming counts to relative abunadnces, computing sample weights based on the overall counts for the samples, and finally doing a weighted centering of the relative abundances).