library(knitr) opts_chunk$set(fig.width = 6, fig.height = 4)
Adaptive gPCA (Fukuyama, J. (2017)) is a flexible method for incorporating side information about the variables into principal components analysis. The method has a single parameter, $r$, which controls how much weight to give to the side information: $r = 0$ corresponds to standard PCA (the side information is not used at all), and $r = 1$ corresponds to the maximum amount of weight on the side information.
The output from adaptive gPCA is analogous to that given by PCA: we get representations of the samples and the variables, which can be visualized as either independently or as a biplot. The difference with standard PCA is that, assuming $r > 0$, the variable loadings are regularized so that similar variables are encouraged to have similar loadings on the principal axes.
There are two main ways of using this package. The function
adaptivegpca
will choose the value of $r$ automatically, while the
function gpcaFullFamily
produces analogous output for the full range
of values of $r$, which is then chosen manually. We will illustrate
both methods in this vignette.
Although adaptive gPCA can be applied more generally, the method was developed to incorporate phylogenetic structure in microbiome data analysis, and so there are also functions to interface with phyloseq objects and functions to visualize adaptive gPCA output along with a phylogenetic tree.
The data set we will use to illustrate the functionality of this
package is an antibiotic time course study[^1]. In this experiment,
three subjects were given two courses of antibiotics, and the
abundances of bacteria in the gut microbiome were measured before,
during, and after each course of antibiotics. The data is stored in a
phyloseq object called AntibioticPhyloseq
, which contains
variance-stabilized and centered abundances on the otu_table
slot, a
phylogenetic tree describing the relationships between the bacterial
species measured in the phy_tree
slot, and information about the
samples in the sample_data
slot. We want a low-dimensional
representation of the samples which takes into account the
phylogenetic similarities between the bacteria, which is what adaptive
gPCA will provide for us.
[^1]: Dethlefsen, L., and D. A. Relman. "Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation." Proc Natl Acad Sci USA 108. Suppl 1 (2010): 4554-4561.
Now that we have described the data we will be using, we will show how to use the package. The first step is to load the required packages and data.
library(adaptiveGPCA) library(ggplot2) library(phyloseq) data(AntibioticPhyloseq) theme_set(theme_bw())
Next, we create the inputs required for the adaptivegpca
function. If we have a phyloseq
object containing taxon abundances
and a phylogenetic tree, the function processPhyloseq
will create a
list with the following elements:
A centered data matrix, stored as X
in the output.
A similarity matrix based on the phylogeny, stored as Q
in the
output.
A vector of weights, stored as weights
. The weights are only
relevant for correspondence analysis, so if the function is invoked
with ca = FALSE
(the default), the weights vector will be a vector
containing only 1's and can be ignored.
pp = processPhyloseq(AntibioticPhyloseq)
Next, we pass our data matrix and our similarity matrix to the
adaptivegpca
function. The arguments to adaptivegpca
are
X
: A centered data matrix.
Q
: A symmetric, positive definite matrix describing the
similarities between the variables.
k
: The number of axes to keep, defaults to 2.
weights
: A vector with sample weights, defaults to a vector of all
1's (equal weights on each sample).
The matrices X
and Q
will be generated from a phyloseq object
using the OTU table and the phylogenetic tree with the
processPhyloseq
function. If you have another kind of data, you can
generate these matrices yourself. Below we use the output from
processPhyloseq
to fit an adaptive gPCA model.
out.agpca = adaptivegpca(pp$X, pp$Q, k = 2)
The output from the adaptivegpca
function is stored in an object of
class adaptivegpca
, which has some generic printing and plotting
functions associated with it. As such, if we just print out the
object, it will give us some basic information about the results,
namely the number of axes, the value of $r$ chosen, and the fraction
of the variance explained by the top axes.
out.agpca
The output is a list with several elements. These are:
U
: The sample scores on the principal axes.
V
: The variable loadings before rotation. These are not used for
plotting --- they need to be rotated into the space of the samples
for the biplot representation.
QV
: The variable loadings rotated to be in the same space as the
sample scores. These are the variable loadings that should be
plotted for a biplot representation of the samples and the
variables.
vars
: The variances along the principal axes.
r
: The value of $r$ chosen by adaptivegpca, corresponding to how
much regularization to perform according to the structure of the
variables. $r = 0$ is standard PCA (variable structure is not
considered at all), and $r = 1$ is the maximal amount of
regularization.
evals
: The eigenvalues of the modified similarity matrix. This is
described in more detail in the paper, but it is an alternate way of
understanding how much regularization is performed by the
method. The larger the top eigenvalues are compared with the
subsequent ones, the more regularization there is toward the top
eigenvectors of the similarity matrix.
We can also plot the output from adaptivegpca
using a generic
plotting function. By default, this will produce a scree plot, as
shown below:
plot(out.agpca)
The plotting function has a type
argument, which can be either
scree
, samples
, or variables
. type = scree
will give a plot
showing the fraction of the variance explained by each of the axes,
type = samples
will show the sample scores along the principal axes,
and type = variables
will show the variable loadings on the
principal axes. The axes
argument allows the user to specify which
axes to plot, by default these are axes 1 and 2.
plot(out.agpca, type = "samples", axes = c(1,2)) plot(out.agpca, type = "variables", axes = c(1,2))
Finally, the inspectTaxonomy
function allows interactive inspection
of the taxa or variables plot, assuming you started the analysis with
a phyloseq object. This function will open a browser window with a
representation of the phylogenetic tree and the taxon loadings on the
adaptive gPCA axes. Taxa in this plot can be selected by "brushing"
the plot (drawing a rectangular box), and the selected taxa will be
highlighted on the phylogenetic tree and their taxonomic assignments
will be printed below. An example call is given below:
inspectTaxonomy(out.agpca, AntibioticPhyloseq)
The arguments to inspectTaxonomy
are
agpcafit
: The output from adaptivegpca
.
physeq
: A phyloseq object with a taxonomy table and a phylogenetic
tree. Should be the same one used for for the adaptive gPCA fit.
axes
: Which axes to plot. Defaults to the first two axes.
br.length
: Should the tree include branch lengths? The trees tend
to look better when the branches are all plotted as being the same
length, so this defaults to FALSE
.
height
: The height of the plotting area in pixels. Defaults to 600.
If you click the "done" button in the browser window, the app will close and the function will return a table describing the taxonomy, position along selected axes, and names of the selected taxa.
We can also choose the value of $r$ manually. In the previous section,
we described how to use the adaptivegpca
function, which chooses the
value of $r$ automatically, but if we would like to see the results for
different values of $r$ and to choose one manually, we can first use
the gpcaFullFamily
function to create a full set of models and then
use the visualizeFullFamily
function to visualize the biplots for
each value of $r$.
gpcaFullFamily
takes the same arguments as adaptivegpca
, along
with some additional arguments describing the form of the output.
The visualizeFullFamily
function is a shiny gadget. Calling this
function will open a browser window where you can visualize the data
set for a range of values of $r$. Clicking "done" in this window will
give as output an object of the same format as that given by the
adaptivegpca
function, the difference being that the value of $r$
was chosen manually instead of automatically. The
sample_data
/sample_mapping
and var_data
/var_mapping
arguments
allow you to customize the visualization. Without these arguments, the
first two axes will be plotted for both the samples and the
variables. If you include these arguments, you can customize the plots
by providing aesthetic mappings for ggplot. Note that the code below
is only included as an example and not evaluated in the vignette.
out.ff = gpcaFullFamily(pp$X, pp$Q, k = 2) out.agpca = visualizeFullFamily(out.ff, sample_data = sample_data(AntibioticPhyloseq), sample_mapping = aes(x = Axis1, y = Axis2, color = type), var_data = tax_table(AntibioticPhyloseq), var_mapping = aes(x = Axis1, y = Axis2, color = Phylum))
In either case, we can plot the results. This time we will do it
manually, using ggplot. The sample scores on the principal axes are
located in out.agpca$U
and the loadings of the variables on the
principal axes are located in out.agpca$QV
. To plot the samples, we
make a data frame that includes out.agpca$U
and the sample data, and
to plot the taxa we make a data frame containing out.agpca$QV
and
the taxonomy table.
ggplot(data.frame(out.agpca$U, sample_data(AntibioticPhyloseq))) + geom_point(aes(x = Axis1, y = Axis2, color = type, shape = ind)) + xlab("Axis 1") + ylab("Axis 2")
ggplot(data.frame(out.agpca$QV, tax_table(AntibioticPhyloseq))) + geom_point(aes(x = Axis1, y = Axis2, color = Phylum)) + xlab("Axis 1") + ylab("Axis 2")
Finally, note that the processPhyloseq
function also has an argument ca
(for correspondence analysis). This should be used with phyloseq
objects containing raw counts, and it will process a phyloseq object
so as to do an adaptive gPCA version of correspondence analysis (this
entails transforming counts to relative abunadnces, computing sample
weights based on the overall counts for the samples, and finally doing
a weighted centering of the relative abundances).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.