```{css, echo=FALSE} pre code { white-space: pre !important; overflow-x: scroll !important; word-break: keep-all !important; word-wrap: initial !important; }
```r knitr::opts_chunk$set(collapse = TRUE, comment = "#>", echo=TRUE, message=FALSE, warnings=FALSE )
A wealth of clustering algorithms are available for Single-cell RNA sequencing (scRNA-seq), but it remains challenging to compare and characterize the features across different scales of resolution. To resolve this challenge Multi-resolution Reconciled Tree (MRtree), builds a hierarchical tree structure based on multi-resolution partitions that are highly flexible and can work with most scRNA-seq clustering algorithms. This tutorial will show the usage of the MRtree R package and how to couple it with different clustering methods.
The main function to perform MRtree analysis is mrtree()
, where the data input is one of the following:
prefix
, and suffix, suffix
,. For example, by specifying prefix=’K_‘
, columns in the data frame starting with K_
will be considered.RNA_snn_res.
as the output of Seurat Louvain clustering. Need to modify the prefix if using a different assay and different neighbor graph or method.sc3_
and suffix _clusters
as output from SC3 clustering.sc_clustering_methods.R
file provide a few clustering wrappers (note that some of the dependent packages are not imported in package installation, to avoid redundant dependencies, so make sure the respective clustering package is installed before the analysis, see instructions below), including:
sc_clustering.seurat
, graph clustering provided by Seurat package (@RN226). Make sure to install Seurat
and its dependent R packages (R >= 4.0.0).sc_clustering.sc3
, Consensus clustering provided by SC3 package (@RN33). Make sure to install SC3
and SingleCellExperiment
R packages from Bioconductor (R >= 3.3)sc_clustering.soup
, Semi-soft clustering provided in SOUP package (@RN195). Make sure to install the SOUP
package following the instructions on \url{https://github.com/lingxuez/SOUPR} (R >= 3.4.2). sc_clustering.umap_kmeans
, UMAP dimension reduction followed by K-means clustering in the low dimensional space. Make sure to install uwot
package for performing UMAP and ADPclust
for selecting the number of clusters for comparison (R ≥ 3.0.0).sc_clustering.simlr
, Single-cell Interpretation via Multi-kernel LeaRning (SIMLR) (@wang2018simlr). Make sure to install SIMLR
package from Bioconductor (R >= 4.0.0)To use other clustering methods, we first generate the clusterings at multiple resolutions. Save the results into a label matrix, where rows are samples and columns are clusterings.
The output of the main function mrtree
contains:
labelmat.mrtree
The reconciled tree represented using the label matrix, where the first and last column corresponds to the top and bottom layer in the tree. The repeated columns are removed to leave the unique layers of the tree.labelmat.recon
The reconciled tree is stored in a label matrix, where the columns match with the columns from the raw input label matrix. This enables matching columns between initial raw clustering and final clusters in MRtree-constructed tree to calculate the stability metric.labelmat.flat
The initial flat clustering label matrix as input.labelmat.ccs
The results from with-resolution concensus clustering if performed ('concensus=TRUE').resolutions
Resolution parameters if provided by the input object.paths
The remaining tree paths stored as a label matrix.params
Store the parameters to reproduce the results.In this tutorial, we illustrate the usage of MRtree functions using a simulated dataset. The data is simulated from SymSim by supplying a tree structure as shown below, containing UMI counts of 500 cells on 500 genes. If you want generate the simulated data using the simulation functions, make sure to install SymSim and the dependent packages listed on \url{https://github.com/YosefLab/SymSim}. Otherwise you can use the saved data example data_example
.
library(mrtree)
# # The data simulation can be performed using SymSim package with the following wrapper function, # # The following simulation code take some time to run (skip to load data with data("data_example")) # tree = tree1(plot=T)$phylo.tree # symsim.param.zeisel = list(gene_effects_sd=2, sigma=0.2, scale_s=0.4, alpha_mean=0.5, alpha_sd=0.025, # depth_mean=2e5,depth_sd=1.5e4, nPCR1=14, prop_hge=0.01, mean_hge=3) # ngenes = 500 # ncells = 500 # truek = 8; min_popsize = floor(ncells/truek) # i_minpop = which(tree$tip.label=='Microglia') # # # seed = 42 # simu.out = generateDataSymSim(ncells=ncells, ngenes=ngenes, tree=tree, # params=symsim.param.zeisel, plot.tsne=T, # i_minpop=i_minpop, min_popsize=min_popsize, # nevf=150, n_de_evf=40, sigma=0.9, seed=seed) # # simu.out$tsne_UMI_counts # dat = simu.out # # data_example = simu.out; usethis::use_data(data_example) data("data_example", package = 'mrtree') # load data dat = data_example # tsne plot of 500 cells dat$tsne_UMI_counts
As the first step, we conduct multi-resolution flat clustering using Seurat with a range of resolution parameters. We recommend choosing the maximum resolution (gamma.max
) between $2$ and $3$. The initial multi-resolution clustering results can be visualized using a cluster tree, where each layer corresponds to one clustering, the tree nodes represent clusters, and edges between clusters in adjacent layers indicate shared data points between two clusters.
set.seed(42) metadata = dat$metadata rownames(metadata) = dat$metadata$cellid ref.labels = dat$metadata$type # specify the resolution parameters # resolutions = seq(0.1, sqrt(3), 0.05)^2 # alternatively and preferrably, we provide a sampling tool to # sample resolution parameters to uniformly cover different scales A = seurat_get_nn_graph(counts=dat$counts, metadata=metadata, npc=10) resolutions = modularity_event_sampling(A=A, n.res=30, gamma.min=0.01, gamma.max=2.5) # sample based on the similarity matrix # clustering using Suerat seurat.out = sc_clustering.seurat(counts=dat$counts, resolutions=resolutions, metadata=metadata, npcs=10, return.seurat.object=TRUE, vars.to.regress=NULL, find.variable.features=FALSE,verbose=FALSE) # initial cluster tree from Seurat flat clustering plot_clustree(labelmat=seurat.out$seurat.clusters, prefix ='RNA_snn_res.', ref.labels = ref.labels, plot.ref = FALSE)
Then we apply MRtree to obtain the hierarchical cluster tree. The resulted tree structure can be visualized using a dendrogram, with a pie chart on each tree node detailing the cluster composition given the known true labels.
out = mrtree(seurat.out$obj, n.cores = 2, consensus=FALSE, augment.path=FALSE) # if there are few partitions per k, within resolution consensus step can speed up the algorithm # weight per sample is encoraged if the classes are imbalanced plot_tree(labelmat=out$labelmat.mrtree, ref.labels=ref.labels, plot.piechart = TRUE, node.size = 0.4, tip.label.dist = 10, bottom.margin=30 )
We evaluate the per-resolution clustering performance with a novel metric, Adjusted Multiresolution Rand Index (AMRI). The metric is adapted from Adjusted Rand Index (ARI) to address the problem of ARI biasing for higher resolution.
ks.flat = apply(out$labelmat.flat, 2, FUN=function(x) length(unique(x))) ks.mrtree = apply(out$labelmat.mrtree, 2, FUN=function(x) length(unique(x))) amri.flat = sapply(1:ncol(out$labelmat.flat), function(i) AMRI(out$labelmat.flat[,i], ref.labels)$amri) amri.flat = aggregate(amri.flat, by=list(k=ks.flat), FUN=mean) amri.recon = sapply(1:ncol(out$labelmat.mrtree), function(i) AMRI(out$labelmat.mrtree[,i], ref.labels)$amri) df = rbind(data.frame(k=amri.flat$k, amri=amri.flat$x, method='Seurat flat'), data.frame(k=ks.mrtree, amri=amri.recon, method='MRtree')) ggplot2::ggplot(data=df, aes(x=k, y=amri, color=method)) + geom_line() + theme_bw()
We calculate the similarity between the initial flat clustering and MRtree clusters across resolution on the x-axis. Lower similarity indicates the selected clustering algorithm is not able to generate stable clusters at the given resolution. In this example, the changing point is observed at $k=7$, while the stability drops steeply when $k>8$.
stab.out = stability_plot(out) stab.out$plot
Next, we apply MRtree together with SC3, which is itself a consensus method, despite that a user-provided resolution parameter is still required.
ks = 2:10 metadata = dat$metadata rownames(metadata) = dat$metadata$cellid ref.labels = dat$metadata$type set.seed(1) clust.out = sc_clustering.sc3(exprs=dat$counts, Ks=ks, type = 'counts', colData=metadata, biology = F, gene_filter = F, n_cores=2, pct_dropout_min=-1)
Plot the initial clustering reults from SC3 via clustree:
plot_clustree(labelmat=clust.out$sce@colData, prefix='sc3_', suffix = "_clusters", ref.labels = ref.labels, plot.ref = F)
Run MRtree and plot the resulted hierarchical cluster tree:
out = mrtree(clust.out$sce, n.cores = 2, prefix='sc3_', suffix = "_clusters", consensus=F, sample.weighted=F) plot_tree(labelmat=out$labelmat.mrtree, ref.labels=ref.labels, plot.piechart = T, node.size = 0.4, tip.label.dist = 10, bottom.margin=30 )
We evaluate the per-resolution clustering performance via AMRI:
ks.flat = apply(out$labelmat.flat, 2, FUN=function(x) length(unique(x))) ks.mrtree = apply(out$labelmat.mrtree, 2, FUN=function(x) length(unique(x))) amri.flat = sapply(1:ncol(out$labelmat.flat), function(i) AMRI(out$labelmat.flat[,i], ref.labels)$amri) amri.flat = aggregate(amri.flat, by=list(k=ks.flat), FUN=mean) amri.recon = sapply(1:ncol(out$labelmat.mrtree), function(i) AMRI(out$labelmat.mrtree[,i], ref.labels)$amri) df = rbind(data.frame(k=amri.flat$k, amri=amri.flat$x, method='SC3 flat'), data.frame(k=ks.mrtree, amri=amri.recon, method='MRtree')) ggplot2::ggplot(data=df, aes(x=k, y=amri, color=method)) + geom_line() + theme_bw()
We calculate the similarity between the initial flat clustering and MRtree clusters across scales.
stab.out = stability_plot(out) stab.out$plot
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.