This vignette describes analysis of individual bifurcation points based on a reconstructed transcriptional tree. As example, it explores bifurcation point between sensory and autonomic nervous systems in neural crest. The guideline starts with tree reconstruction, identifies fate-specific genes and estimates timing of their activation, assess existence and formation of fate-biases, and predicts time of genes inclusion in fate-biased phase.
library(igraph)
library(mgcv)
library(quadprog)
library(pcaMethods)
library(Rcpp)
library(inline)
library(RcppArmadillo)
library(pbapply)
library(crestree)
library(ggplot2); library(gridExtra); library(grid);
data(crest)
emb <- crest$emb
clcol <- crest$clcol
nc.cells <- crest$nc.cells
wgm <- crest$wgm
wgwm <- crest$wgwm # matrix of expression weights
fpm <- read.table("http://pklab.med.harvard.edu/ruslan/neural_crest/fpm.txt",header=TRUE)
fpm <- as.matrix(fpm)
genes.tree <- crest$genes.tree
See manual the guide for tree reconstruction for detailed instructions. The tree can be either manually reconstructed:
M <- length(nc.cells);
lambda <- 250;
sigma <- 0.04
ppt <- ppt.tree(X=wgm[,nc.cells], W=wgwm[,nc.cells], emb=emb, lambda=250, sigma=0.04, metrics="cosine", M=M, err.cut = 5e-3, n.steps=30, seed=1, plot=FALSE)
ppt <- cleanup.branches(ppt,tips.remove = c(139,295))
ppt <- setroot(ppt,355)
ppt <- project.cells.onto.ppt(ppt,n.mapping = 100)
Alternatively, the tree object used in the paper can be downloaded from:
ppt <- readRDS(url("http://pklab.med.harvard.edu/ruslan/neural_crest/tree_structure_full.rds"))
Bifurcation point is charactarized by a progenitor and derivative branches.
plotppt(ppt,emb,tips=TRUE,forks=FALSE,cex.tree = 0.2,lwd.tree = 2)
We thus start with selection a root of progenitor branch (355) and two leaves of derivative branches (165 and 91):
root <- 355
leaves <- c(165,91)
Here is a fork charactarizing sensory-autonomic bifurcation point:
subtree <- extract.subtree(ppt,c(root,leaves))
plotppt(ppt,emb,tips=TRUE,forks=FALSE,cex.tree = 0.3,lwd.tree = 3,subtree=subtree)
A routine test.fork.genes
performs assessment of genes differentially expression between post-bifurcation branches:
fork.de <- test.fork.genes(ppt,fpm,root=root,leaves=leaves,n.mapping = 10)
A table fork.de
contains summary statistics of fold change effect
, p-value p
and adjusted p-value fdr
of differential expression between branches, magnitude pd1.a
(pd2.a
) and p-value pd1.p
(pd2.p
) of expression changes from first (second) derivative branch to progenitor branch:
head(fork.de[order(fork.de$p),],)
| | effect| p| fdr| st| stf| pd1.a| pd1.p| pd2.a| pd2.p| |:-------|---------:|--:|---:|---:|---:|---------:|-----:|----------:|---------:| |Neurog2 | 1.9032796| 0| 0| 0.9| 0.9| 0.1012255| 0| -0.0262510| 0.0001858| |Dll1 | 1.4404789| 0| 0| 1.0| 1.0| 0.0976255| 0| 0.0043581| 0.3941055| |Srrm4 | 0.8383684| 0| 0| 1.0| 1.0| 0.0646809| 0| 0.0099734| 0.0002647| |Mfng | 0.9532576| 0| 0| 1.0| 1.0| 0.0630761| 0| 0.0034360| 0.4424417| |Eya2 | 1.0311280| 0| 0| 0.9| 0.9| 0.0737302| 0| 0.0082446| 0.0035890| |Pcdh8 | 1.1417276| 0| 0| 1.0| 0.9| 0.0515220| 0| -0.0193619| 0.0000199|
We next consider a gene to be preferentially expressed along the first (second) branch if it has effect.b1
(effect.b2
) increased expression compared to another post-bifurcation branch and significant increase (p < 0.05) relative to progenitor branch:
fork.de <- branch.specific.genes(fork.de,effect.b1 = 0.1,effect.b2 = 0.3)
Column state
characterizes genes that are specific to first (1), second (2), or neither (0) of derivative branches.
genes.sensory <- rownames(fork.de)[fork.de$state==1]
genes.autonomic <- rownames(fork.de)[fork.de$state==2]
For consistency with the original results, we also limit genes to genes.tree
set associated with the tree:
genes.sensory <- intersect(genes.sensory,genes.tree)
str(genes.sensory)
## chr [1:98] "Rdh10" "Hes6" "Cxcr4" "Nfasc" "5730559C18Rik" "Rgs16" "Nhlh1" "Zbtb18" "Utrn" "Gamt" "Neurod4" "Upp1" ...
genes.autonomic <- intersect(genes.autonomic,genes.tree)
str(genes.autonomic)
## chr [1:122] "Serpine2" "Lrrfip1" "Cdh19" "Ralb" "Angptl1" "Pbx1" "Mpz" "Lama4" "Cnn2" "Timp3" "Ascl1" "Elk3" ...
Dynamics of gene expression is reflected in timing of activation and expression optimum. Routine activation.fork
estimates timing of optimum expression of smoothed expression and activation point as a first passage of derivative through deriv.cutoff
cutoff:
fork.de.act <- activation.fork(ppt,fork.de,fpm,root,leaves,deriv.cutoff = 0.015,n.mapping=10)
fork.de.act
table provides additional columns optimum
and activation
for genes predicted to be differentially expressed (stat
= 1 or 2).
Branch-specific sets of genes (genes.sensory
and genes.autonomic
) can now be partitioned in early and late genes based on time of expression activation. A logical solution is to orient early/late genes relative to bifurcation point. Timing of root, bifurcation point and leaves are
fork.pt(ppt,root,leaves)
| root| bifurcation| leave 1| leave 2| |----:|-----------:|--------:|--------:| | 0| 17.5719| 27.59971| 31.49521|
We use cutoff=16.0
on timing of activation to define early and late genes:
cutoff <- 16.0
genes.sensory.late <- genes.sensory[fork.de.act[genes.sensory,]$activation > cutoff]
genes.sensory.early <- setdiff(genes.sensory,genes.sensory.late)
genes.autonomic.late <- genes.autonomic[fork.de.act[genes.autonomic,]$activation > cutoff]
genes.autonomic.early <- setdiff(genes.autonomic,genes.autonomic.late)
Now we can check if early/late genes modules follow co-activation or mutually-exclusive patterns:
cells <- rownames(ppt$cell.summary)[ppt$cell.summary$seg %in% extract.subtree(ppt,c(root,leaves))$segs]
par(mfrow=c(1,2))
plot(t(programs[c(1,3),cells]),col=ppt$cell.summary[cells,]$color,pch=19,cex=0.5)
plot(t(programs[c(2,4),cells]),col=ppt$cell.summary[cells,]$color,pch=19,cex=0.5)
Co-activation of both fate-specific programs poses a question of when cell acquire bias in favor of one or another program. For that, we look for coordinated expression of each module inside more homogeneous subpopulations. First, bifurcation fork is partitioned in non-intersecting windows of wind
cells:
freq <- slide.cells(ppt,root,leaves,wind=50)
Visualization of group of cells assigned to each non-intersecting window:
fig_cells <- fig.cells(emb,freq)
marrangeGrob( c(fig_cells),ncol=length(fig_cells),nrow=1,top=NA)
Windows can be also selected manually, below we follow selection used in the paper:
regions = list( list(7,151,200,1),list(7,101,151,1),list(7,51,100,1),list(7,1,50,1),list(list(6,5,1,2),1,50, -1),list(list(6,5,1,2),51,100, -1),list(5,1,50,1),list(1,1,50,1))
freq <- slide.cells(ppt,root,leaves,wind=50,regions=regions)
fig_cells <- fig.cells(emb,freq)
marrangeGrob( c(fig_cells),ncol=length(fig_cells),nrow=1,top=NA)
Next, routine slide.cors
estimates average correlation of each early fate-specific gene with both modules (genes.sensory.early
and genes.autonomic.early
) in each window of cells:
cors <- slide.cors(freq,fpm,genes.sensory.early,genes.autonomic.early)
Now joint visualization enables tracking how genes of fate-specific modules coordinate expression during progression along pseudotime:
fig_cor <- fig.cors(cors,genes.sensory.early,genes.autonomic.early)
marrangeGrob( c(fig_cells,fig_cor),ncol=length(fig_cells),nrow=2,
layout_matrix = matrix(seq_len(2*length(fig_cells)), nrow = 2, ncol = length(fig_cells),byrow=TRUE),top=NA)
To obtain more contrasted (and reproducible with the paper) view, a set of early genes could be further cleaned up by removing fate-specific genes having low correlation with its modules around bifurcation point:
corA <- cors[[5]][,1]
genesetA <- names(which(corA[genes.sensory.early] > 0.07))
corB <- cors[[5]][,2]
genesetB <- names(which(corB[genes.autonomic.early] > 0.07))
Re-estimation average window-specific correlations for cleaned up sets of genes genesetA
and genesetB
:
cors <- slide.cors(freq,fpm,genesetA,genesetB)
fig_cor <- fig.cors(cors,genesetA,genesetB)
marrangeGrob( c(fig_cells,fig_cor),ncol=length(fig_cells),nrow=2,
layout_matrix = matrix(seq_len(2*length(fig_cells)), nrow = 2, ncol = length(fig_cells),byrow=TRUE),top=NA)
More generally, formal trends of local coordination of fate-specific modules along branching trajectories can be estimated using synchro
routine:
w=30
step=10
crd <- synchro(ppt,fpm,root,leaves,genesetA,genesetB,w,step,n.mapping=100,n.points = 300,span.smooth = 0.1,perm=FALSE)
And visualized:
visualize.synchro(crd)
In order to infer genes orchestrating emergence of local intra-module correlations, we developed an iterative approach that estimates timing of genes inclusion in a core of correlated genes of the module. Routine onset
estimates inclusion timing of early sensory genes genes.sensory.early
for vector mappings
of probabilistic cell projections:
inclusion.sensory <- onset(ppt,geneset=genes.sensory.early,nodes=c(root,leaves),expr=fpm,alp=20,w=40,step=10,n.cells=280,mappings=1:100,do.perm=FALSE,winperm=30,permut.n=10)
Average inclusion timing of each gene is
head(apply(inclusion.sensory,1,mean))
## Rdh10 Hes6 Cxcr4 5730559C18Rik Utrn Gamt
## 11.08199 12.97442 15.67575 12.01563 13.08833 15.35606
As an example, predicted cumulative probability of gene Pou4f1 inclusion is shown below:
show.gene.inclusion("Pou4f1",inclusion.sensory)
Overall summary statistics of inclusion probabilistics is visualized using show.inclusion.summary
routine with each row of the heatmap reflecting cumulative inclusion probability of a single gene (as in example above):
show.inclusion.summary(inclusion.sensory,gns.show=c("Neurog2","Pou4f1"))
Analogous inference of inclusion timing for early atuonomic module genes.autonomic.early
:
inclusion.autonomic <- onset(ppt,geneset=genes.autonomic.early,nodes=c(root,leaves),expr=fpm,alp=20,w=40,step=10,n.cells=280,mappings=1:100,do.perm=FALSE,winperm=30,permut.n=10)
show.inclusion.summary(inclusion.autonomic,gns.show = c("Mef2c","Pbx1"))
For comparison, inclusion events are not detected in a control expression matrix, where expression levels were locally permuted using option do.perm=TRUE
among winperm=10
batches of cells along pseudotime:
inclusion.sensory.control <- onset(ppt,geneset=genes.sensory.early,nodes=c(root,leaves),expr=fpm,alp=20,w=40,step=10,n.cells=280,
mappings=1:10,do.perm=TRUE,winperm=10,permut.n=10)
show.inclusion.summary(inclusion.sensory.control,gns.show=NA)
Parameter alp
in routine onset
, which was set to 20 in the analysis above, regulates stringency of inclusion point detection. Assessing summary statistics on inclusion times for a range of alp
levels for real and permutated expression profiles enables selection of optimal alp
. First, select a range of alp
levels:
alps <- seq(1,40,length.out = 10)
Then estimate mean gene inclusion time for alps
levels:
incl.real <- unlist(lapply(alps,function(alp){
cat(alp);cat('\n')
res.est <- onset(ppt,geneset=genes.sensory.early,nodes=c(root,leaves),expr=fpm,alp=alp,w=40,step=10,n.cells=280, mappings=1:10,do.perm=FALSE,winperm=10,permut.n=10)
return(mean(res.est))
}))
Output is average inclusion time for each alp level:
head(incl.real)
## [1] 8.137490 9.916677 11.244118 11.996381 12.592830 13.203767
Estimate inclusion time for locally permuted expression levels by setting do.perm=TRUE
and local permutation window winperm=10
:
incl.perm <- unlist(lapply( alps,function(alp){
cat(alp);cat('\n')
res.est <- onset(ppt,geneset=genes.sensory.early,nodes=c(root,leaves),expr=fpm,alp=alp,w=40,step=10,n.cells=280, mappings=1:10,do.perm=TRUE,winperm=30,permut.n=10)
mean(res.est)
}))
Comparison of early inclusion timing among all genes for real and control expression levels:
inclusion.stat(alps, incl.real, incl.perm)
From that one can see that if alp > 10
than overall false positive rate in permutations is close to zero, while detection sensitivity of real data gradually declines.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.