library(knitr) opts_chunk$set(fig.width = 8, fig.height = 4)
Here we will describe how to use the treeDA package. The package provides functions to perform sparse discriminant analysis informed by the tree. The method was developed for microbiome data, but it could in principle be applied to any data with the same tree structure. The idea behind the package is that when we have predictor variables which are structured according to a tree, the mean values of the predictor variables at each node in the tree are natural predictor variables, and can be used in addition to the initial predictors defined at the leaves. For microbiome data, this means using both the abundances of the initial set of taxa as well as the abundances "pseudo-taxa", which correspond to nodes in the tree and are the agglomeration of all the taxa which descend from that node.
Without regularization, using both sets of predictors would yield an ill-defined problem because the node predictors are linear combinations of the leaf predictors. However, when we add regularization, the problem becomes well-posed and we can obtain a unique solution. Intuitively, the regularization allows us to incorporate the intuition that a solution where one node is selected is more parsimonious than one in which all the leaves descending from that node are selected.
This package is based on the implementation of sparse discriminant
analysis implemented in the
sparseLDA
package. The main function, treeda
, creates the node and leaf
predictors, performs sparse discriminant analysis on the combination
of node and leaf predictors, and then translates the results back in
terms of leaf predictors only. The package also includes functions to
perform cross-validation and plotting, which will be demonstrated in
this vignette.
Our first step is to load the required packages and data. We will
illustrate the method on an antibiotic dataset (AntibioticPhyloseq
)
provided by the package adaptiveGPCA
. Note that no other elements of
the adaptiveGPCA
package are used in this tutorial. The antibiotic
dataset consists of measurements taken from three subjects before,
during, and after taking each of two courses of an antibiotic. The
major groupings in the data are by subject (called ind
in the
phyloseq object) and by the the antibiotic condition. The antibiotic
treatment is discretized into abx/no abx in a variable called type
,
where abx corresponds to samples taken when the subject was taking the
antibiotic and the week following, and no abx corresponds to all the
other samples.
library(treeDA) library(ggplot2) library(phyloseq) library(adaptiveGPCA) library(Matrix) data(AntibioticPhyloseq) theme_set(theme_bw())
The main function in the package is called treeda
. It takes a response
vector giving the classes to be separated, a matrix of predictor
variables which are related to each other by a tree, the tree which
describes the relationships between the predictor variables, and the
sparsity (p, the number of predictors to use). In the antibiotic
dataset, we have several potential discriminatory variables. One of
these describes whether the sample was taken during or immediately
after the subject was subjected to antibiotics, and we can try to find
taxa which discriminate between these two groups using the following
command:
out.treeda = treeda(response = sample_data(AntibioticPhyloseq)$type, predictors = otu_table(AntibioticPhyloseq), tree = phy_tree(AntibioticPhyloseq), p = 15)
Here the output of the model is stored in an object called
out.treeda
. The print function will give an overview of the fitted
model, including the number of predictors used and the confusion
matrix for the training data.
out.treeda
From this, we see that 15 predictors were used (since this was what we specified in the initial call to the function). These predictors potentially include nodes in the tree (corresponding to taxonomic clades) and leaves on the tree (corresponding to individual species). The combination of nodes and leaves can be written purely in terms of the leaves (or species, or OTUs), in which case the model is using 903 of the leaves. This indicates that some of the nodes which were selected as predictive were quite deep in the tree and corresponded to large groups of taxa.
Finally, the confusion matrix shows us how well the model does on the trainnig data: we see that a total of 16 cases were classified incorrectly, split approximately evenly between cases which were actually from the abx condition and those which were actually from the no abx condition.
The object containing the output from the fit also contains other information. These are:
means
: The mean value of each predictor. This is only included if
the call to treeda
included center = TRUE
, in which case the
means are stored so that new data can be centered using the mean
values from the training data.
sds
: The standard deviation of each predictor. Like with the
means, this is only included if the call to treeda
included scale
= TRUE
, in which case the standard deviations are stored so that
the new data can be scaled using the standard devaiations from the
training data.
leafCoefficients
: A representation of the discriminating axis
using only the leaves. This is a list containing beta
, which are the
coefficients, and intercept
, which is the constant term.
input
: A list containing the response, predictors, and tree used
to fit the model.
nPredictors
: The number of predictors (in the node + leaf space)
used in the model.
nLeafPredictors
: The number of predictors in the leaf space used
in the model.
sda
: The sda object used in fitting the model.
class.names
: The names of the classes to be discriminated between.
projections
: The projections of the observations on the
discriminating axes.
classProperties
: The prior probabilities, mean in discriminating
space, and variance in the discriminating space of the classes.
predictedClasses
: Predicted classes for each observation.
rss
: Residual sum of squares: the sum of squared distances between
each observation and its class mean in the discriminating space.
Once we have fit the model, we can look at the samples projected onto
the discriminating axis. These projections are found in
out.treeda$projections
, and we can see them plotted for the
antibiotic data below. In the figure below we also separate out the
samples by individual to see whether the model works better for some
individuals than others. We see that positive scores along the
discriminating axis correspond to the no abx condition, and that there
is some difference between the individuals but that the quality of the
model is approximately the same across the three subjects.
ggplot(data.frame(sample_data(AntibioticPhyloseq), projections = out.treeda$projections)) + geom_point(aes(x = ind, y = projections, color = type))
We can also look at the coefficient vector describing the
discriminating axis using the plot_coefficients
function. This gives
a plot of the tree with the leaf coefficients aligned underneath.
plot_coefficients(out.treeda)
For comparison, we can look at the results when we try to discriminate between individuals instead of between the abx/no abx conditions. We try this with the same amount of sparsity, p = 15.
out.treeda.ind = treeda(response = sample_data(AntibioticPhyloseq)$ind, predictors = otu_table(AntibioticPhyloseq), tree = phy_tree(AntibioticPhyloseq), p = 15) out.treeda.ind
In this case, since we have three classes we obtain two discriminating axes, each of which uses 15 node or leaf predictors for a total of 30 predictors. This corresponds to only 85 leaves on the tree, indicating that the nodes which were chosen corresponded to individual leaves or to much smaller clades than when our aim was to discriminate between the abx and no abx conditions. We can see this more clearly when we look at the coefficient plot, where there are many more singleton leaves with non-zero coefficients than we saw in the corresponding plot for the abx/no abx model. Note that this model contains two discriminating axes because we have three classes, while the abx/no abx model had only one discriminating axis because there were two classes.
plot_coefficients(out.treeda.ind)
We would often like to choose the sparsity level automatically instead of manually. A common way of doing this is by cross validation, which we have implemented in the function treedacv. It takes most of the same arguments as as treeda: a vector containing the response, or the classes for each of the observations, a matrix of predictors which are related to each other by a tree, and the tree. In addition, the number of folds for the cross validation needs to be specified (the folds argument), and a vector giving the levels of sparsity to be compared by cross validation (the pvec argument). The folds argument can be given either as a single number, in which case the observations will be partitioned into that number of folds, or as a vector assigning each observation to a fold. In this case, the vector should have length equal to the number of observations, and the elements in the vector should be integers between 1 and the number of desired folds assigning the observations to a fold.
Here we are using four-fold cross validation, discriminating between
the individuals in our dataset, and comparing levels of sparsity
between 1 and 15. When we print the output from treedacv
, it tells us
both which value of p (amount of sparsity) corresponded to the minimum
CV error, and what the smallest value of p was which was within one
standard error of the minimum CV error. (The intuition behind using
this value of p instead of that with the minimum CV error is that we
would like the most parsimonious model which is statistically
indistinguishable from that with the minimum CV error). For us, the
minimum CV error is at 11, but if we were following the one standard
error rule we would use 7.
set.seed(0) out.treedacv = treedacv(response = sample_data(AntibioticPhyloseq)$type, predictors = otu_table(AntibioticPhyloseq), tree = phy_tree(AntibioticPhyloseq), folds = 4, pvec = 1:15) out.treedacv
The results from the cross validation are stored in
out.treedacv$loss.df
. This data frame contains the CV error for each
fold, the mean CV error, and the standard error of the CV error for
each value of p. We can use this matrix to plot the CV error as a
function of the sparsity, or we can use the plotting function defined
by the package, as shown below.
plot(out.treedacv)
This plot confirms what we said earlier: 11 predictors corresponds to the minimum cross validation error, and 7 predictors corresponds to the sparsest solution which is within 1 standard error of the minimum cross validation error.
We can then fit the model with 11 predictors to all the data and look at the plot of the coefficients along the discriminating axis.
out.treeda.11 = treeda(response = sample_data(AntibioticPhyloseq)$type, predictors = otu_table(AntibioticPhyloseq), tree = phy_tree(AntibioticPhyloseq), p = 11) out.treeda.11
plot_coefficients(out.treeda.11)
From the coefficient plot above, we might be interested in the
relatively large group of taxa with the largest positive
coefficients. Since the samples in the abx
condition have positive
scores on the discriminating axis, taxa with positive coefficients are
over-represented in the abx
condition. We can find out what these
are by examining the leaf coefficient vector. We first convert the
Matrix object containing the leaf coefficients into a vector, then
find all the taxa which have the maximum positive coefficient, and
then print out the unique elements of the taxonomy table corresponding
to those taxa. We see that this is a group of 74 Lachnospiraceae. They
are mostly not annotated beyond the family level, but one is annotated
as being from the genus Moryella.
coef = as.vector(out.treeda.11$leafCoefficients$beta) taxa.max = which(coef == max(coef)) length(taxa.max) unique(tax_table(AntibioticPhyloseq)[taxa.max,])
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.