In this vignette we describe the basic usage of the DuoClustering2018
package:
how to retrieve data sets and clustering results, and how to construct various
plots summarizing the performance of different methods across several data sets.
suppressPackageStartupMessages({ library(ExperimentHub) library(SingleCellExperiment) library(DuoClustering2018) library(plyr) })
The clustering evaluation [@Duo2018-F1000] is based on 12 data sets (9 real and
3 simulated), which are all provided via ExperimentHub
and retrievable via
this package. We include the full data sets (after quality filtering of cells
and removal of genes with zero counts across all cells) as well as three
filtered versions of each data set (by expression, variability and dropout
pattern, respectively), each containing 10% of the genes in the full data set.
To get an overview, we can list all records from this package that are available
in ExperimentHub
:
eh <- ExperimentHub() query(eh, "DuoClustering2018")
The records with names starting in sce_
represent (filtered or unfiltered)
data sets (in SingleCellExperiment
format). The records with names starting in
clustering_summary_
correspond to data.frame
objects with clustering results
for each of the filtered data sets. Finally, the
duo_clustering_all_parameter_settings
object contains the parameter settings
we used for all the clustering methods. For clustering summaries and parameter
settings, the version number (e.g., _v2
) corresponds to the version of the
publication.
The records can be retrieved using their ExperimentHub
ID, e.g.:
eh[["EH1500"]]
Alternatively, the shortcut functions provided by this package can be used:
sce_filteredExpr10_Koh()
For each included data set, we have applied a range of clustering methods (see
the run_clustering
vignette for more details on how this was done, and how to
apply additional methods). As mentioned above, the results of these clusterings
are also available from ExperimentHub
, and can be loaded either by their
ExperimentHub
ID or using the provided shortcut functions, as above. For
simplicity, the results of all methods for a given data set are combined into
a single object. As an illustration, we load the clustering summaries for two
different data sets (Koh
and Zhengmix4eq
), each with two different gene
filterings (Expr10
and HVG10
):
res <- plyr::rbind.fill( clustering_summary_filteredExpr10_Koh_v2(), clustering_summary_filteredHVG10_Koh_v2(), clustering_summary_filteredExpr10_Zhengmix4eq_v2(), clustering_summary_filteredHVG10_Zhengmix4eq_v2() ) dim(res)
The resulting data.frame
contains 10 columns:
dataset
: The name of the data setmethod
: The name of the clustering methodcell
: The cell identifierrun
: The run ID (each method was run five times for each data set and number
of clusters)k
: The imposed number of clusters (for all methods except Seurat)resolution
: The imposed resolution (only for Seurat)cluster
: The assigned cluster labeltrueclass
: The true class of the cellest_k
: The estimated number of clusters (for methods allowing such
estimation)elapsed
: The elapsed time of the runhead(res)
For some of the plots generated below, the points will be colored according to the clustering method. We can enforce a consistent set of colors for the methods by defining a named vector of colors to use for all plots.
method_colors <- c(CIDR = "#332288", FlowSOM = "#6699CC", PCAHC = "#88CCEE", PCAKmeans = "#44AA99", pcaReduce = "#117733", RtsneKmeans = "#999933", Seurat = "#DDCC77", SC3svm = "#661100", SC3 = "#CC6677", TSCAN = "grey34", ascend = "orange", SAFE = "black", monocle = "red", RaceID2 = "blue")
Each plotting function described below returns a list of ggplot
objects. These
can be plotted directly, or further modified if desired.
The plot_performance()
function generates plots related to the performance of
the clustering methods. We quantify performance using the adjusted Rand Index
(ARI) [@hubert1985comparing], comparing the obtained clustering to the true
clusters. As we noted in the publication [@Duo2018-F1000], defining a true
partitioning of the cells is difficult, since they can often be grouped together
in several different, but still interpretable, ways. We refer to our paper for
more information on how the true clusters were defined for each of the data
sets.
perf <- plot_performance(res, method_colors = method_colors) names(perf) perf$median_ari_vs_k perf$median_ari_heatmap_truek
The plot_stability()
function evaluates the stability of the clustering
results from each method, with respect to random starts. Each method was run
five times on each data set (for each k), and we quantify the stability by
comparing each pair of such runs using the adjusted Rand Index.
stab <- plot_stability(res, method_colors = method_colors) names(stab) stab$stability_vs_k stab$stability_heatmap_truek
In order to evaluate the tendency of the clustering methods to favor equally
sized clusters, we calculate the Shannon entropy [@Shannon1948-cw] of each
clustering solution (based on the proportions of cells in the different
clusters) and plot this using the plot_entropy()
function. Since the maximal
entropy that can be obtained depends on the number of clusters, we use
normalized entropies, defined by dividing the observed entropy by log2(k)
. We
also compare the entropies for the clusterings to the entropy of the true
partition for each data set.
entr <- plot_entropy(res, method_colors = method_colors) names(entr) entr$entropy_vs_k entr$normentropy
The plot_timing()
function plots various aspects of the timing of the
different methods.
timing <- plot_timing(res, method_colors = method_colors, scaleMethod = "RtsneKmeans") names(timing) timing$time_normalized_by_ref
Most performance evaluations above are performed on the clustering solutions
obtained by imposing the "true" number of clusters. The plot_k_diff()
function
evaluates the difference between the true number of cluster and the number of
clusters giving the best agreement with the true partition, as well as the
difference between the estimated and the true number of clusters, for the
methods that allow estimation of k.
kdiff <- plot_k_diff(res, method_colors = method_colors) names(kdiff) kdiff$diff_kest_ktrue
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.