This document is a user-friendly manual on how to use the Treefit
package, which is the first toolkit for quantitative trajectory
inference using single-cell gene expression data. In this tutorial, we
demonstrate how to generate and analyze two kinds of toy data with the
aim to help get familiar with the practical workflow of Treefit. After
learning some basics, we will use Treefit to perform more biologically
interesting analysis in the next tutorial
(vignette("working-with-seurat")
).
While the Treefit package has been developed to help biologists who wish to perform trajectory inference from single-cell gene expression data, Treefit can also be viewed as a toolkit to generate and analyze a point cloud in d-dimensional Euclidean space (i.e., simulated gene expression data).
Treefit provides some useful functions to generate artificial
datasets. For example, as we will now demonstrate, the function
treefit::generate_2d_n_arms_star_data()
creates data that
approximately fit a star tree with the number of arms or branches; the
term star means a tree that looks like the star symbol "*"; for
example, the alphabet letters Y and X can be viewed as star trees that
have three and four arms, respectively.
The rows and columns of the generated data correspond to data points (i.e., n single cells) and their features (i.e., expression values of d different genes), respectively. The Treefit package can be used to analyze both raw count data and normalized expression data, but regarding the production of toy data, it is meant to be used to generate continuous data like normalized expression values.
Importantly, we can generate data with a desired level of noise by
changing the value of the fatness
parameter of this function. For
example, if you set the fatness
parameter to 0.0
then you will get
precisely tree-like data without noise. By contrast, setting the
fatness
to 1.0
gives very noisy data that are no longer
tree-like. In this tutorial, we deal with two types of toy data whose
fatness
values are 0.1
and 0.8
, respectively. We note that
Treefit can be used to generate and analyze high dimensional datasets
but we focus on generating 2-dimensional data to make things as simple
as possible in this introductory tutorial.
Let us first generate 2-dimensional tree-like data that contain 500 data points and reasonably fit a star tree with three arms. We can create such data and draw a scatter plot of them (Figure 1) simply by executing the following two lines of code:
star.tree_like <- treefit::generate_2d_n_arms_star_data(500, 3, 0.1) plot(star.tree_like)
Similarly, we can generate noisy data that do not look so tree-like as
the previous one. In this tutorial, we simply change the value of the
fatness
parameter from 0.1
to 0.8
in order to obtain
non-tree-like data. The following two lines of code yields a scatter
plot shown in Figure 2:
star.less_tree_like <- treefit::generate_2d_n_arms_star_data(500, 3, 0.8) plot(star.less_tree_like)
Having generated two datasets whose tree-likeness are different, we can analyze each of them. The Treefit package allows us to estimate how well the data can be explained by tree models and to predict how many "principal paths" there are in the best-fit tree. As shown in Figure 1, the points in the first data clearly form a shape like the letter "Y" and so the data are considered to fit a star tree with three arms very well, whereas Figure 2 indicates that the second data are no longer very tree-like because of the high level of added noise. Our goal in this tutorial is to reproduce these conclusions by using Treefit.
Let us estimate the goodness-of-fit between tree models and the first
toy data. This can be done by using treefit::treefit()
as
follows. The name
parameter is optional but we should specify
it, if possible. Because it's useful to identify the estimation.
fit.tree_like <- treefit::treefit(list(expression=star.tree_like), name="tree-like") # Save the analytsis result to use other tutorials. saveRDS(fit.tree_like, "fit.tree_like.rds") fit.tree_like
treefit::treefit()
returns a treefit
object that summarizes the
analysis of Treefit. We will explain how to interpret the results in
the next section. For now, we may focus on learning how to use
Treefit.
As we will see later, it is helpful to visualize the results using
plot()
. By executing plot(fit.tree_like)
, we can obtain the
following two user-friendly visual plots, which makes it easier to
interpret the results of the Treefit analysis.
plot(fit.tree_like)
We can analyze the second toy data in the same manner.
fit.less_tree_like <- treefit::treefit(list(expression=star.less_tree_like), name="less-tree-like") # Save the analytsis result to use other tutorials. saveRDS(fit.less_tree_like, "fit.less_tree_like.rds") fit.less_tree_like
plot(fit.less_tree_like)
We can compare different results by passing all results to plot()
as
follows:
plot(fit.tree_like, fit.less_tree_like)
Before interpreting the previous results, we briefly summarize the process of the Treefit analysis that consists of the following three steps.
First, Treefit repeatedly "perturbs" the input data (i.e., adds some small noise to the original row count data or normalized expression data) in order to produce many slightly different datasets that may have been acquired in the biological experiment.
Second, for each dataset, Treefit calculates a distance matrix that represents the dissimilarities between sample cells and then constructs a tree from each distance matrix. The current version of Treefit computes a minimum spanning tree (MST) that has been widely used for trajectory inference.
Finally, Treefit evaluates the goodness-of-fit between the data and tree models. The underlying idea of this method is that the structure of trees inferred from tree-like data tends to have high robustness to noise, compared to non-tree-like data. Therefore, Treefit measures the mutual similarity between estimated trees in order to check the stability of the tree structures. To this end, Treefit constructs a p-dimensional subspace that extracts the main features of each tree structure and then measuring mutual similarities between the subspaces by using a special type of metrics called the Grassmann distance. In principle, when the estimated trees are mutually similar in their structure, the mean and standard deviation (SD) of the Grassmann distance are small.
Although the word "Grassmann distance" may sound so unfamiliar to some
readers, the concept appears in different disguises in various
practical contexts. For example, the Grassmann distance has a close
connection to canonical correlation analysis (CCA). Treefit provides
two Grassmann distances $max_cca_distance
and $rms_cca_distance
that can be used for different purposes as we now explain.
The Treefit analysis using the first Grassmann distance
$max_cca_distance
(shown in the left panel of Figure 5) tells us the
goodness-of-fit between data and tree models. In principle, as
mentioned earlier, if the mean and SD of $max_cca_distance
are
small, then this means that the estimated trees are mutually similar
in their structure. As can be observed, the distance changes according
to the dimensionality p of the feature space, but
$max_cca_distance
has the property that the value decreases
monotonically as p increases for any datasets.
Comparing the Treefit results for the two datasets, we see that the mean Grassmann distance for the first data does not fall below the second one regardless of the value of p and that the SD of the Grassmann distance for the first data is very small compared to the second data. These results imply that the estimated tree structures are very robust to noise in the first case but not in the second case. Thus, Treefit has verified that the first data are highly tree-like while the second data are not.
The Treefit analysis using the other Grassmann distance
$rms_cca_distance
(shown in the right panel of Figure 5) is useful to
infer the number of "principal paths" in the best-fit tree. From a
biological perspective, this analysis can be used to discover a novel
or unexpected cell type from single-cell gene expression (for details,
see vignette("working-with-seurat")
).
Unlike the previous Grassmann distance, the mean value of
$rms_cca_distance
can fluctuate depending on the value of
p. Interestingly, we can predict the number of principal paths
in the best-fit tree by exploring for which p the distance
value reaches "the bottom of a valley" (i.e., attains a local
minimum). More precisely, when $rms_cca_distance
attains a local
minimum at a certain p, the value p+1 indicates the
number of principal paths in the best-fit
tree. $n_principal_paths_candidates
has these p+1 values. We
don't need to calculate them manually. When Treefit produces a plot
having more than one valleys, the smallest p is usually most
informative for the prediction. The smallest p+1 value can be
obtained by $n_principal_paths_candidates[1]
.
Comparing the Treefit results for the two datasets, we first see that both plots attains a local minimum at p=2. This means that for both datasets the best-fit tree has p+1=3 principle paths, which is correct because both were generated from the same star tree with three arms. Another important point to be made is that the SD of the Grassmann distance for the first data is very small at p=2 compared to that for the second data; in other words, Treefit made this prediction more confidently for the first dataset than for the second one. This result is reasonable because the first dataset is much less noisy than the second one. Thus, Treefit has correctly determined the number of principal paths in the underlying tree together with the goodness-of-fit for each dataset.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.