knitr::opts_chunk$set( collapse = TRUE, comment = "##" )
This vignette describes tree reconstruction procedure and basic routines to explore gene expression patterns associated with the tree. It demonstrates application of the tree analysis to neural crest system. The guideline starts with processed data, including normalized gene expression matrix and t-SNE embedding, shows how to reconstruct the tree, analyse transcriptional events along the tree and provides a number of visualization routines.
library(igraph) library(mgcv) library(quadprog) library(pcaMethods) library(Rcpp) library(inline) library(RcppArmadillo) library(pbapply) library(glmnet) library(crestree)
Tree reconstruction requires a gene expression matrix. Addditionally, exploration of the tree on top of existing 2D embedding significantly simplifies interpretation. It is thus important to provide embedding of cells (e.g. t-SNE, PCA etc.). The following command uploads processed neural crest data ( PAGODA pipeline used to preprocess the data can be found here), including expression matrix and t-SNE embedding:
data(crest)
Data list crest
contains embedding emb
, colors of clusters clcol
and a vector of neural crest-derived cells nc.cells
in the dataset:
emb <- crest$emb str(emb)
clcol <- crest$clcol str(clcol)
nc.cells <- crest$nc.cells str(nc.cells)
Here is a visualization of embedding with cells colored by clusters, and discriminated neural crest and neural tube cells:
par(mfrow=c(1,1),mar=c(4,4,1,1)) plot(crest$emb,col=crest$clcol,pch=ifelse( rownames(crest$emb)%in%crest$nc.cells,19,1),cex=0.2,xlab="tSNE 1",ylab="tSNE 2") legend("bottomright",c("neural crest","neural tube"),pch=c(19,1),cex=0.2)
The data crest
contains matrix of expression levels normalized to cell size fpm
and expression levels adjusted for mean-variance trend wgm
:
fpm <- crest$fpm str(fpm)
wgm <- crest$wgm str(wgm) wgwm <- crest$wgwm # matrix of expression weights
Of note, matrices contain only 1169 the most over-dispersed genes. Alternatively, we can upload the full matrix from a web server (it can take some time):
fpm <- read.table("http://pklab.med.harvard.edu/ruslan/neural_crest/fpm.txt",header=TRUE) fpm <- as.matrix(fpm) str(fpm)
Algorithm has a number of important parameters to be selected (they can be defined by default, but it is good to have control over them), in particular cell-cell distance metrics
(cosine-based or euclidean), number of tree principal points (PPs) M
and tree parameters lambda
(stringency of the tree) and sigma
(fuzziness of cells to principal points assignment):
metrics <- "cosine" M <- length(nc.cells) # use as many pricipal points as the number of cells
Now we can model a parsimonious tree using fpm
experssion matrix:
lambda <- 150 sigma <- 0.015 z <- ppt.tree(X=fpm[rownames(wgm),nc.cells], emb=emb, lambda=lambda, sigma=sigma, metrics=metrics, M=M, err.cut = 5e-3, n.steps=50, seed=1, plot=FALSE)
The reconstructed tree z
, that is modeled in high-dimensional expression space, can be visualized on top of embedding emb
using plotppt
routine:
plotppt(z,emb,tips=FALSE,cex.tree = 0.1,cex.main=0.2,lwd.tree = 1)
We next switch to expression matrix wgm
with weights wgmw
used in the paper. Of note, optimal tree parameters lambda
and sigma
are sensitive to the data properties, such as dataset size or choice of expression matrices. In section "Selection of optimal tree parameters" we discuss a strategy of parameters selection and suggest two guiding routines. Below the tree is modeled and visualized with a new choice of expression matrices:
lambda <- 250 sigma <- 0.04 ppt <- ppt.tree(X=wgm[,nc.cells], W=wgwm[,nc.cells], emb=emb, lambda=250, sigma=0.04, metrics="cosine", M=M, err.cut = 5e-3, n.steps=30, seed=1, plot=FALSE) plotppt(ppt,emb,tips=FALSE,cex.tree = 0.1,cex.main=0.2,lwd.tree = 1)
Optionally, stable properties of the tree can be assessed using sampling of cells. Below we generate 20 trees through subsampling of 90% of cells without replacement:
ppt_ensemble <- bootstrap.ppt(X=wgm[,nc.cells], W=wgwm[,nc.cells], emb=emb, metrics=metrics, M=as.integer(length(nc.cells)*0.9), lambda=lambda, sigma=sigma, plot=FALSE, n.samples=20, seed=NULL,replace=FALSE)
Sampling of trees can be visualized on embedding using routing plotpptl
:
plotpptl(ppt_ensemble,emb, cols=adjustcolor("grey",alpha=0.1),alpha=0.05, lwd=1)
Now we can prepare the tree for downstream analysis. For that, we will remove small spurious branches, orient the tree by assigning a root and project cells onto the tree.
While major stable branches reflect biologically strong signal, small spurious branches likely reflect artifacts or incomplete convergence of the algorithm. Tips and forks of the tree can be explored on top of the embedding with flags tips=TRUE
and forks=TRUE
that show ids of principal points of tips and forks. For example, visually five leaves (380, 178, 98, 99, 572)
correspond to notable branches, while leave 295
reflects spurious branch:
plotppt(ppt,emb,tips=TRUE,forks=FALSE,cex.tree = 0.2,lwd.tree = 2)
Spurious branchs are removed using cleanup.branches
routine, which suggests a number of criterion to eliminate undesired branches. Below we retain only tips.number
tips of the tree that maximally preserve the tree structure (alternatively, we could directly supply a vector of tip ids tips.remove
for removal):
ppt <- cleanup.branches(ppt,tips.remove = c(139,295))
Of note, after removing spurious branches, numeration of the remaining principal point changes:
plotppt(ppt,emb,tips=TRUE,forks=FALSE,cex.tree = 0.2,lwd.tree = 2)
The tree does not provide information about directionality of dynamics. Selection of a tree root with routine setroot
is sufficient to orient the tree:
ppt <- setroot(ppt,root=355)
Finally, each cell is projected onto the tree. It provides estimates of a cell pseudotime and branch. Probabilistic position of a cell on the tree is naturally delivered by the method and stored in the field R
of tree object. For example, probailistic distribution of a given cell on the tree is shown below:
cell <- nc.cells[2] # choose a cell pprobs <- ppt$R[cell,] # probabilities of tree projections plotppt(ppt,emb,pattern.tree = ppt$R[cell,],cex.tree = 1,lwd.tree = 0.1) # plot probabilities using pattern.tree parameter points(emb[cell,1],emb[cell,2],cex=1,pch=19,col="black") # show cell position on embedding
We next use routine project.cells.onto.ppt
to assign maximum likelihood projection of each cell on the tree and estimate cells pseudotime (if emb
is supplied than the routine plots cells colored by branch position). To account for uncertainty in cell projections, we can sample n.mapping
probabilistic mappings of cells onto the tree:
ppt <- project.cells.onto.ppt(ppt,emb,n.mapping = 100)
We are ready to study gene expression patterns along the tree. The first step is to identify genes that have expression levels significantly varied along the tree (tree-associated genes). Routine test.associated.genes
estimates significance of each gene's association with the tree using an input expression matrix, e.g. fpm
:
ppt <- test.associated.genes(ppt,n.map=1,fpm,summary=TRUE)
A field stat.association
of ppt
provides summary statistics of genes association, including amplitude of changes along the tree A
, p-value pval
, B-H adjustment for multiple testing fdr
and binary classification sign
of differential expression along the tree. Also, robustness of differential expression is estimated as a fraction st
of probabilistic projections (if n.mappings
> 1 in project.cells.onto.ppt) when a gene was detected as differentially expressed.
head(ppt$stat.association[order(ppt$stat.association$pval),])
knitr::kable(head(ppt$stat.association[order(ppt$stat.association$pval),]))
Only differentially expressed genes (TRUE in sign
) are later used to model expression patterns along the tree. A set of differentially expressed genes can be manually modified in a column sign
. In the original paper, differentially expressed genes (vector genes.tree
in crest data) were estimated based on 100 probabilistic cell mappings, here we define only them as significant (to avoid time-consuming calculations):
genes.tree <- crest$genes.tree ppt$stat.association$sign <- FALSE ppt$stat.association[genes.tree,]$sign <- TRUE
Now expression levels of differentially expressed genes can be modeled as a function of pseudotime along the tree.
ppt <- fit.associated.genes(ppt,fpm,n.map=1)
There are different ways to visualize expression trends of a gene along the tree. For example, as a function of pseudotime:
gene <- "Neurog2" visualise.trajectory(ppt,gene,fpm[gene,],cex.main = 3,lwd.t2=0.5)
The other way is to show how fitted expression levels fit.summary
change along the tree on the embedding:
par(mar=c(4,4,3,1)) plotppt(ppt,emb,pattern.cell = ppt$fit.summary[gene,],gene="Neurog2",cex.main=1,cex.tree = 1.0,lwd.tree = 0.1,par=FALSE)
We can now use matrix of expression profiles fit.summary
smoothed along the tree to cluster differentially expressed genes and explore major tree-associated patterns of expression. First, lets select a subset of genes that have large magnitude of variability along the tree:
genes <- rownames(ppt$stat.association)[ppt$stat.association$sign==TRUE & ppt$stat.association$A > 2]
str(genes)
Then smoothed expression profiles can be clustered using a variety of methods. Clusters of genes can be explored using visualise.clusters
visualization routine, using as a default hierarchical clustering with Ward linkage and cosine-based similarity with predefined number of clust.n
clusters:
visualise.clusters(ppt,emb,clust.n = 10,cex.gene=1,cex.cell=0.05,cex.tree=0.2)
Alternatively, it is possible to provide a vector of gene clusters for visualization. Below we use hierarchical clustering with euclidean distance to cluster genes:
hc <- hclust(dist(ppt$fit.summary[genes.tree,]),method="ward.D") # hierarchical clustering clust <- cutree(hc,10) # partition of genes in 4 clusters str(clust)
And supply a vector clust
for visualization:
visualise.clusters(ppt,emb,clust=clust,cex.gene=1,cex.cell=0.05,cex.tree=0.2)
In some cases a subtree of the tree, for example a single trajectory, is of particular interest. A set of routines used to select subtree, visualize gene patterns along the subtree and provide genes associated with the subtree. Below we choose a single trajectory:
plotppt(ppt,emb[,],tips=TRUE,tree.col = ppt$pp.info$color,forks=TRUE,cex.tree = 1,lwd.tree = 0.1) # visualize tree tips zseg <- extract.subtree(ppt,c("355","165")) # select root and terminal leave of the trajectory
Explore at expression patterns of a gene along selected subtree, defined by zseg
, using additional parameter subtree
with two visualization options:
plotppt(ppt,emb,gene=gene,mat=fpm,cex.main=1,cex.tree = 1.5,lwd.tree = 0.1,subtree=zseg)
visualise.trajectory(ppt,gene,fpm,cex.main = 3,subtree = zseg,lwd.t2=1)
We also can assess differential expression along the subtree:
stat.subtree <- test.associated.genes(ppt,n.map=1,fpm,subtree = zseg)
Resulting stat.subtree
is a summary table of genes associated with the subtree:
head(stat.subtree[order(stat.subtree$pval),])
knitr::kable(head(stat.subtree[order(stat.subtree$pval),]))
Expression levels of a transcription factor (TF) do not yet inform about its regulatory impact on target genes. Here we use coordinated changes of expression levels of TF target genes as a readout of regulatory impact. For that, we use a matrix of predicted target-TF scores (generated as described in the paper):
str(crest$motmat)
Smoothed expression levels ppt$fit.list[[1]]
of targets are modeled as a linear combination of unknown TF activities using lasso regression:
act <- activity.lasso(ppt$fit.list[[1]],crest$motmat) dim(act)
Matrix act
contains predicted activity of each TF in each cell. For example, tree-projected pattern of Neurog2 activity indicates its regulatory impact in sensory branch:
tf <- "Neurog2" par(mar=c(4,4,3,1)) plotppt(ppt,emb,pattern.cell = act[tf,],gene=tf,cex.main=0.5,cex.tree = 0.5,lwd.tree = 0.1,par=FALSE,pallete = colorRampPalette(c("darkgreen","gray50","orange")) )
A particularly interesting implication of the tree is analysis of bifurcation point. Usually, the first step of such analysis is infererence of genes that are differentially expressed between two post-bifurcation branches. Bifurcaiton point is formalized as a fork consisting of a root and two leaves. Below we select a root and two leaves:
plotppt(ppt,emb,tips=TRUE,forks=FALSE,cex.tree = 0.2,lwd.tree = 2)
root <- 355 leaves <- c(165,91)
A routine test.fork.genes
performs assessment of genes differentially expression between post-bifurcation branches:
fork.de <- test.fork.genes(ppt,fpm[,],root=root,leaves=leaves,n.mapping = 1)
A table fork.de
contains summary statistics of fold change effect
, p-value p
and adjusted p-value fdr
of differential expression between branches, magnitude pd1.a
and p-value pd1.p
of expression changes from derivative branch 1 to progenitor branch:
head(fork.de[order(fork.de$p),],)
knitr::kable(head(fork.de[order(fork.de$p),],))
See manual analysis of bifurcation point for detailed analysis.
Choice of parameters sigma
and lambda
for tree reconstruction is of crucial importance. We suggest a combination of formal criteria and exploratory analysis for selection of parameters. First, parameter sigma
is selected as an optimum of cross validation upon lambda
=0:
sig <- sig.explore(X=wgm[,nc.cells],metrics="cosine",sig.lims=seq(0.01,0.1,0.01),plot=TRUE)
Optimum sigma:
sig
Parameter lambda
is selected upon optimal sigma
using entropy criteria. However, the estimate is not fully robust. Using routine lambda.explore
we additionally show trees for two intermediate lambda
parameters and leave a final choice or further exploration to the user:
lambda.stat <- lambda.explore(X=wgm[,nc.cells],M=length(nc.cells),metrics="cosine",emb=emb,sigma=sig,base=2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.