knitr::opts_chunk$set(echo = TRUE)
PHATE is a tool for visualizing high dimensional single-cell data with natural progressions or trajectories. PHATE uses a novel conceptual framework for learning and visualizing the manifold inherent to biological systems in which smooth transitions mark the progressions of cells from one state to another. To see how PHATE can be applied to single-cell RNA-seq datasets from hematopoietic stem cells, human embryonic stem cells, and bone marrow samples, check out our preprint on BioRxiv. Running this tutorial should take approximately 10 minutes from start to finish.
Moon, van Dijk, Wang, Gigante et al. (2019), Visualizing structure and transitions in high-dimensional biological data, Nature Biotechnology https://doi.org/10.1038/s41587-019-0336-3.
If you haven't yet installed PHATE, you can find installation instructions in our GitHub README.
We'll install a couple more tools for this tutorial.
if (!require(viridis)) install.packages("viridis") if (!require(ggplot2)) install.packages("ggplot2") if (!require(readr)) install.packages("readr") if (!require(Rmagic)) install.packages("Rmagic")
If you have never used MAGIC, you should also install MAGIC from the command line as follows:
```{bash install_python_phate, eval=FALSE} pip install --user magic-impute
### Loading packages We load the phateR package and a few others for convenience functions. ```r library(phateR) library(ggplot2) library(readr) library(viridis) library(Rmagic)
In this tutorial, we will analyze myeloid and erythroid cells in mouse bone marrow, as described in Paul et al., 2015. You can run this tutorial with your own data by downloading https://raw.githubusercontent.com/KrishnaswamyLab/phateR/master/inst/examples/bonemarrow_tutorial.Rmd and opening it in RStudio. The example data is located in the PHATE GitHub repository and we can load it directly from the web.
# load data bmmsc <- read_csv("https://github.com/KrishnaswamyLab/PHATE/raw/master/data/BMMC_myeloid.csv.gz") bmmsc <- bmmsc[,2:ncol(bmmsc)] bmmsc[1:5,1:10]
First, we need to remove lowly expressed genes and cells with small library size.
# keep genes expressed in at least 10 cells keep_cols <- colSums(bmmsc > 0) > 10 bmmsc <- bmmsc[,keep_cols] # look at the distribution of library sizes ggplot() + geom_histogram(aes(x=rowSums(bmmsc)), bins=50) + geom_vline(xintercept = 1000, color='red')
# keep cells with at least 1000 UMIs keep_rows <- rowSums(bmmsc) > 1000 bmmsc <- bmmsc[keep_rows,]
We should library size normalize and transform the data prior to PHATE. Many people use a log transform, which requires adding a "pseudocount" to avoid log(0). We square root instead, which has a similar form but doesn't suffer from instabilities at zero.
bmmsc <- library.size.normalize(bmmsc) bmmsc <- sqrt(bmmsc)
Let's examine the raw data with PCA.
bmmsc_PCA <- as.data.frame(prcomp(bmmsc)$x)
Now we'll plot the results with ggplot2
. We'll color the plot by Mpo, a myeloid marker.
ggplot(bmmsc_PCA) + geom_point(aes(PC1, PC2, color=bmmsc$Mpo)) + labs(color="Mpo") + scale_color_viridis(option="B") ggsave('BMMSC_data_R_PCA.png', width=5, height=5)
Because the data is noisy, PCA doesn't tell us much - but we can see that the data is broadly separated along the first components by myeloid <-> erythroid cells.
Running PHATE is as simple as running the phate
function.
We'll just use the default parameters for now, but the following parameters can be tuned (read our documentation at https://cran.r-project.org/web/packages/phateR/phateR.pdf or by running help(phateR::phate)
to learn more):
knn
: Number of nearest neighbors (default: 5). Increase this (e.g. to 20) if your PHATE embedding appears very disconnected. You should also consider increasing knn
if your dataset is extremely large (e.g. >100k cells)decay
: Alpha decay (default: 40). Decreasing a increases connectivity on the graph, increasing a decreases connectivity. This rarely needs to be tuned. Set it to NULL
for a k-nearest neighbors kernel.t
: Number of times to power the operator (default: 'auto'). This is equivalent to the amount of smoothing done to the data. It is chosen automatically by default, but you can increase it if your embedding lacks structure, or decrease it if the structure looks too compact.gamma
: Informational distance constant between -1 and 1 (default: 1). gamma=1
gives the PHATE log potential, but other informational distances can be interesting. If most of the points seem concentrated in one section of the plot, you can try gamma=0
.# run PHATE bmmsc_PHATE <- phate(bmmsc)
Now we plot the results using ggplot2
.
ggplot(bmmsc_PHATE) + geom_point(aes(PHATE1, PHATE2, color=bmmsc$Mpo)) + labs(color="Mpo") + scale_color_viridis(option="B")
The branches are a little collapsed, so we can decrease the connectivity by decreasing k
from the default value of 5, and increasing a
from the default value of 15. We could also change t
from the automatic value printed in the PHATE output (here it is 10) - an increase in t
reduces noise, a decrease in t
can prevent the data from collapsing subtle structures. We'll pass in the old PHATE object as the init
argument, which can sometimes use precomputed intermediate data to prevent recomputing things that haven't changed.
bmmsc_PHATE <- phate(bmmsc, knn=4, decay=100, t=10, init=bmmsc_PHATE) ggplot(bmmsc_PHATE) + geom_point(aes(PHATE1, PHATE2, color=bmmsc$Mpo)) + labs(color="Mpo") + scale_color_viridis(option="B") ggsave('BMMSC_data_R_phate.png', width=5, height=5)
Much better! Now we can see more subtle structure in the erythroid branch, and the myeloid branch isn't so collapsed.
Many genes suffer from dropout to the point that coloring by a gene gives little to no information. Take for example Ifitm1, which is an stem cell marker.
ggplot(bmmsc_PHATE) + geom_point(aes(PHATE1, PHATE2, color=bmmsc$Ifitm1)) + labs(color="Ifitm1") + scale_color_viridis(option="B") ggsave('BMMSC_data_R_phate_colored_by_Ifitm1_before_MAGIC.png', width=5, height=5)
Even though we expect the central population to be entirely stem cells, many of these cells express no Ifitm1. Let's run MAGIC and try again. You can learn more about using MAGIC at https://github.com/KrishnaswamyLab/MAGIC.
bmmsc_MAGIC <- magic(bmmsc, t=4, genes="Ifitm1") ggplot(bmmsc_PHATE) + geom_point(aes(x=PHATE1, y=PHATE2, color=bmmsc_MAGIC$result$Ifitm1)) + scale_color_viridis(option="B") + labs(color="Ifitm1") ggsave('BMMSC_data_R_phate_colored_by_Ifitm1_after_MAGIC.png', width=5, height=5)
That looks reasonable - the stem cell population expresses the most Ifitm1, as expected. That's it, we're done!
If you have any questions or require assistance using PHATE, please contact us at https://krishnaswamylab.org/get-help.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.