```{css, echo=FALSE} pre code { white-space: pre !important; overflow-x: scroll !important; word-break: keep-all !important; word-wrap: initial !important; }

```r
knitr::opts_chunk$set(collapse = TRUE, comment = "#>",
                      echo=TRUE, message=FALSE, warnings=FALSE
                      )

Introduction

A wealth of clustering algorithms are available for Single-cell RNA sequencing (scRNA-seq), but it remains challenging to compare and characterize the features across different scales of resolution. To resolve this challenge Multi-resolution Reconciled Tree (MRtree), builds a hierarchical tree structure based on multi-resolution partitions that are highly flexible and can work with most scRNA-seq clustering algorithms. This tutorial will show the usage of the MRtree R package and how to couple it with different clustering methods.

Usage

The main function to perform MRtree analysis is mrtree(), where the data input is one of the following:

sc_clustering_methods.R file provide a few clustering wrappers (note that some of the dependent packages are not imported in package installation, to avoid redundant dependencies, so make sure the respective clustering package is installed before the analysis, see instructions below), including:

To use other clustering methods, we first generate the clusterings at multiple resolutions. Save the results into a label matrix, where rows are samples and columns are clusterings.

The output of the main function mrtree contains:

Example with simulated data

In this tutorial, we illustrate the usage of MRtree functions using a simulated dataset. The data is simulated from SymSim by supplying a tree structure as shown below, containing UMI counts of 500 cells on 500 genes. If you want generate the simulated data using the simulation functions, make sure to install SymSim and the dependent packages listed on \url{https://github.com/YosefLab/SymSim}. Otherwise you can use the saved data example data_example. Tree structure used for simulating single cell data using SymSim packages

library(mrtree)
# # The data simulation can be performed using SymSim package with the following wrapper function,
# # The following simulation code take some time to run (skip to load data with data("data_example"))
# tree = tree1(plot=T)$phylo.tree
# symsim.param.zeisel = list(gene_effects_sd=2, sigma=0.2, scale_s=0.4, alpha_mean=0.5, alpha_sd=0.025,
#                            depth_mean=2e5,depth_sd=1.5e4, nPCR1=14, prop_hge=0.01, mean_hge=3)
# ngenes = 500
# ncells = 500
# truek = 8; min_popsize = floor(ncells/truek)
# i_minpop =  which(tree$tip.label=='Microglia')
# 
# 
# seed =  42
# simu.out = generateDataSymSim(ncells=ncells, ngenes=ngenes, tree=tree,
#                               params=symsim.param.zeisel, plot.tsne=T,
#                               i_minpop=i_minpop, min_popsize=min_popsize,
#                               nevf=150, n_de_evf=40, sigma=0.9, seed=seed)
# 
# simu.out$tsne_UMI_counts
# dat = simu.out
# # data_example = simu.out; usethis::use_data(data_example)

data("data_example", package = 'mrtree') # load data
dat = data_example
# tsne plot of 500 cells
dat$tsne_UMI_counts

MRtree with Seurat clustering

As the first step, we conduct multi-resolution flat clustering using Seurat with a range of resolution parameters. We recommend choosing the maximum resolution (gamma.max) between $2$ and $3$. The initial multi-resolution clustering results can be visualized using a cluster tree, where each layer corresponds to one clustering, the tree nodes represent clusters, and edges between clusters in adjacent layers indicate shared data points between two clusters.

set.seed(42)

metadata = dat$metadata
rownames(metadata) = dat$metadata$cellid
ref.labels = dat$metadata$type

# specify the resolution parameters
# resolutions = seq(0.1, sqrt(3), 0.05)^2

# alternatively and preferrably, we provide a sampling tool to 
# sample resolution parameters to uniformly cover different scales
A = seurat_get_nn_graph(counts=dat$counts, metadata=metadata, npc=10)
resolutions = modularity_event_sampling(A=A, n.res=30, 
                                        gamma.min=0.01, gamma.max=2.5) # sample based on the similarity matrix

# clustering using Suerat 
seurat.out = sc_clustering.seurat(counts=dat$counts, resolutions=resolutions, 
                                  metadata=metadata, npcs=10,
                                  return.seurat.object=TRUE, vars.to.regress=NULL,
                                  find.variable.features=FALSE,verbose=FALSE)

# initial cluster tree from Seurat flat clustering
plot_clustree(labelmat=seurat.out$seurat.clusters, prefix ='RNA_snn_res.', 
              ref.labels = ref.labels, plot.ref = FALSE)

Then we apply MRtree to obtain the hierarchical cluster tree. The resulted tree structure can be visualized using a dendrogram, with a pie chart on each tree node detailing the cluster composition given the known true labels.

out = mrtree(seurat.out$obj, n.cores = 2, consensus=FALSE, augment.path=FALSE)
# if there are few partitions per k, within resolution consensus step can speed up the algorithm
# weight per sample is encoraged if the classes are imbalanced

plot_tree(labelmat=out$labelmat.mrtree, ref.labels=ref.labels, plot.piechart = TRUE,
          node.size = 0.4, tip.label.dist = 10, bottom.margin=30 )

We evaluate the per-resolution clustering performance with a novel metric, Adjusted Multiresolution Rand Index (AMRI). The metric is adapted from Adjusted Rand Index (ARI) to address the problem of ARI biasing for higher resolution.

ks.flat = apply(out$labelmat.flat, 2, FUN=function(x) length(unique(x)))
ks.mrtree = apply(out$labelmat.mrtree, 2, FUN=function(x) length(unique(x)))
amri.flat = sapply(1:ncol(out$labelmat.flat), function(i) AMRI(out$labelmat.flat[,i], ref.labels)$amri)
amri.flat = aggregate(amri.flat, by=list(k=ks.flat), FUN=mean)
amri.recon = sapply(1:ncol(out$labelmat.mrtree), function(i) AMRI(out$labelmat.mrtree[,i], ref.labels)$amri)

df = rbind(data.frame(k=amri.flat$k, amri=amri.flat$x, method='Seurat flat'), 
           data.frame(k=ks.mrtree, amri=amri.recon, method='MRtree'))
ggplot2::ggplot(data=df, aes(x=k, y=amri, color=method)) + geom_line() + theme_bw()

We calculate the similarity between the initial flat clustering and MRtree clusters across resolution on the x-axis. Lower similarity indicates the selected clustering algorithm is not able to generate stable clusters at the given resolution. In this example, the changing point is observed at $k=7$, while the stability drops steeply when $k>8$.

stab.out = stability_plot(out)
stab.out$plot

MRtree with SC3 clustering

Next, we apply MRtree together with SC3, which is itself a consensus method, despite that a user-provided resolution parameter is still required.

ks = 2:10
metadata = dat$metadata
rownames(metadata) = dat$metadata$cellid
ref.labels = dat$metadata$type

set.seed(1)
clust.out =  sc_clustering.sc3(exprs=dat$counts, Ks=ks, type = 'counts', colData=metadata,
                               biology = F, gene_filter = F,  n_cores=2,  pct_dropout_min=-1)

Plot the initial clustering reults from SC3 via clustree:

plot_clustree(labelmat=clust.out$sce@colData, prefix='sc3_', suffix = "_clusters",
              ref.labels = ref.labels, plot.ref = F)

Run MRtree and plot the resulted hierarchical cluster tree:

out = mrtree(clust.out$sce, n.cores = 2, prefix='sc3_', suffix = "_clusters", 
             consensus=F, sample.weighted=F)

plot_tree(labelmat=out$labelmat.mrtree, ref.labels=ref.labels, plot.piechart = T,
          node.size = 0.4, tip.label.dist = 10, bottom.margin=30 )

We evaluate the per-resolution clustering performance via AMRI:

ks.flat = apply(out$labelmat.flat, 2, FUN=function(x) length(unique(x)))
ks.mrtree = apply(out$labelmat.mrtree, 2, FUN=function(x) length(unique(x)))
amri.flat = sapply(1:ncol(out$labelmat.flat), function(i) AMRI(out$labelmat.flat[,i], ref.labels)$amri)
amri.flat = aggregate(amri.flat, by=list(k=ks.flat), FUN=mean)
amri.recon = sapply(1:ncol(out$labelmat.mrtree), function(i) AMRI(out$labelmat.mrtree[,i], ref.labels)$amri)

df = rbind(data.frame(k=amri.flat$k, amri=amri.flat$x, method='SC3 flat'), 
           data.frame(k=ks.mrtree, amri=amri.recon, method='MRtree'))
ggplot2::ggplot(data=df, aes(x=k, y=amri, color=method)) + geom_line() + theme_bw()

We calculate the similarity between the initial flat clustering and MRtree clusters across scales.

stab.out = stability_plot(out)
stab.out$plot

Reference



pengminshi/MRtree documentation built on March 6, 2023, 4:20 p.m.