library(knitr) opts_chunk$set( fig.pos = "!h", out.extra = "", fig.align = "center" )
We will use a synthetic dataset to illustrate the functionalities of the condiments package. We start directly with a dataset where the following steps are assumed to have been run:
# For analysis library(condiments) library(slingshot) # For data manipulation library(dplyr) library(tidyr) # For visualization library(ggplot2) library(RColorBrewer) library(viridis) set.seed(2071) theme_set(theme_classic())
data("toy_dataset", package = "condiments") df <- toy_dataset$sd
As such, we start with a matrix df
of metadata for the cells: coordinates in a reduced dimension space (Dim1, Dim2)
, a vector of conditions assignments conditions
(A or B) and a lineage assignment.
We can first plot the cells on the reduced dimensions
p <- ggplot(df, aes(x = Dim1, y = Dim2, col = conditions)) + geom_point() + scale_color_brewer(type = "qual") p
We can also visualize the underlying skeleton structure of the two conditions.
p <- ggplot(df, aes(x = Dim1, y = Dim2, col = conditions)) + geom_point(alpha = .5) + geom_point(data = toy_dataset$mst, size = 2) + geom_path(data = toy_dataset$mst, aes(group = lineages), size = 1.5) + scale_color_brewer(type = "qual") + facet_wrap(~conditions) + guides(col = FALSE) p
We can then compute the imbalance score of each cell using the imbalance_score function.
scores <- imbalance_score(Object = df %>% select(Dim1, Dim2) %>% as.matrix(), conditions = df$conditions) df$scores <- scores$scores df$scaled_scores <- scores$scaled_scores
There are two types of scores. The raw score is computed on each cell and looks at the condition distribution of its neighbors compared the the overall distribution. The size of the neighborhood can be set using the k
argument, which specify the number of neighbors to consider. Higher values means more local imbalance.
ggplot(df, aes(x = Dim1, y = Dim2, col = scores)) + geom_point() + scale_color_viridis_c(option = "C")
The local scores are quite noisy so we can then use local smoothers to smooth the scores of individual cells. The smoothness is dictated by the smooth
argument. Those smoothed scores were also computed using the imbalance_score
function.
ggplot(df, aes(x = Dim1, y = Dim2, col = scaled_scores)) + geom_point() + scale_color_viridis_c(option = "C")
As could be guessed from the original plot, the bottom lineage shows a lot of imbalance while the top one does not. The imbalance score can be used to check: + If the integration has been successful. In general, some regions should be balanced + To identify the regions of imbalance for further analyses.
The first step of our workflow is to decide whether or not to infer the trajectories separately or not. On average, it is better to infer a common trajectory, since a) this allow for a wider range of downstream analyses, and b) more cells are used to estimate the trajectory. However, the condition effect might be strong enough to massively disrupt the differentiation process, which would require fitting separate trajectories.
slingshot[@Street2018a] relies on a reduced dimensionality reduction representation of the data, as well as on cluster labels. We can visualize those below:
ggplot(df, aes(x = Dim1, y = Dim2, col = cl)) + geom_point()
The topologyTest assess the quality of the common trajectory inference done by slingshot and test whether we should fit a common or separate trajectory. This test relies on repeated permutations of the conditions followed by trajectory inference so it can take a few seconds.
rd <- as.matrix(df[, c("Dim1", "Dim2")]) sds <- slingshot(rd, df$cl) ## Takes ~1m30s to run top_res <- topologyTest(sds = sds, conditions = df$conditions) knitr::kable(top_res)
The test clearly fails to reject the null that we can fit a common trajectory so we can continue with the sds
object. This will facilitate downstream analysis. For an example of how to proceed if the topologyTest reject the null, we invite the user to refer to relevant case study used in our paper.
We can thus visualize the trajectory
ggplot(df, aes(x = Dim1, y = Dim2, col = cl)) + geom_point() + geom_path(data = slingCurves(sds, as.df = TRUE) %>% arrange(Order), aes(group = Lineage), col = "black", size = 2)
Even though we can fit a common trajectory, it does not mean that the cells will differentiate similarly between the conditions. The first question we can ask is: for a given lineage, are cells equally represented along pseudotime between conditions?
psts <- slingPseudotime(sds) %>% as.data.frame() %>% mutate(cells = rownames(.), conditions = df$conditions) %>% pivot_longer(starts_with("Lineage"), values_to = "pseudotime", names_to = "lineages")
ggplot(psts, aes(x = pseudotime, fill = conditions)) + geom_density(alpha = .5) + scale_fill_brewer(type = "qual") + facet_wrap(~lineages) + theme(legend.position = "bottom")
The pseudotime distributions are identical across conditions for the first lineage but there are clear differences between the two conditions in the second lineage.
To test for differential progression, we use the progressionTest. The test can be run with global = TRUE
to test when pooling all lineages, or lineages = TRUE
to test every lineage independently, or both. Several tests are implemented in the progressionTest. function. Here, we will use the default, the custom KS test [@smirnov1939estimation].
prog_res <- progressionTest(sds, conditions = df$conditions, global = TRUE, lineages = TRUE) knitr::kable(prog_res)
As expected, there is a global difference over all lineages, which is driven by differences of distribution across lineage 2 (i.e. the bottom one).
Even though we can fit a common trajectory, it does not mean that the cells will differentiate similarly between the conditions. The first question we can ask is: for a given lineage, are cells equally between the two lineages for the two conditions?
Visualizing differences 2D distributions can be somewhat tricky. However, it is important to note that the sum of all lineage weights should sum to 1. As such, we can only plot the weights for the first lineage.
df$weight_1 <- slingCurveWeights(sds, as.probs = TRUE)[, 1] ggplot(df, aes(x = weight_1, fill = conditions)) + geom_density(alpha = .5) + scale_fill_brewer(type = "qual") + labs(x = "Curve weight for the first lineage")
The distribution has tri modes, which is very often the case for two lineages: + A weight around 0 represent a cell that is mostly assigned to the other lineage (i.e. lineage 2 here) + A weight around .5 represent a cell that is equally assigned to both lineages, i.e. before the bifurcation. + A weight around 1 represent a cell that is mostly assigned to this lineage (i.e. lineage 1 here)
In condition A, we have many more cells with a weight of 0 and, since those are density plots, fewer cells with weights around .5 and 1. Visually, we can guess that cells in condition B differentiate preferentially along lineage 1.
To test for differential fate selection, we use the fateSelectionTest. The test can be run with global = TRUE
to test when pooling all pairs of lineages, or pairwise = TRUE
to test every pair independently, or both. Here, there is only one pair so the options are equivalent. Several tests are implemented in the fateSelectionTest. function. Here, we will use the default, the classifier test[@Lopez-Paz2016].
set.seed(12) dif_res <- fateSelectionTest(sds, conditions = df$conditions, global = FALSE, pairwise = TRUE) knitr::kable(dif_res)
As could be guessed from the plot, we have clear differential fate selection.
The workflow above focus on global differences, looking at broad patterns of differentiation. While this is a necessary first step, gene-level information is also quite meaningful.
To do so requires tradeSeq[@VandenBerge2020] > 1.3.0. Considering that we have a count matrix counts
, the basic workflow is:s
library(tradeSeq) sce <- fitGAM(counts = counts, sds = sds, conditions = df$conditions) cond_genes <- conditionTest(sds)
For more details on fitting the smoothers, we refer users to the tradeSeq website and to the accompanying Bioconductor workshop.
For both of the above procedures, it is important to note that we are making multiple comparisons (in this case, 5). The p-values we obtain from these tests should be corrected for multiple testing, especially for trajectories with a large number of lineages.
That said, trajectory inference is often one of the last computational methods in a very long analysis pipeline (generally including gene-level quantification, gene filtering / feature selection, and dimensionality reduction). Hence, we strongly discourage the reader from putting too much faith in any p-value that comes out of this analysis. Such values may be useful suggestions, indicating particular features or cells for follow-up study, but should generally not be treated as meaningful statistical quantities.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.