Introduction

This vignette provides quick guide for using the Wind package to compute weighted normalized mutual information (wNMI) and weighted Rand index (wRI) to evaluate the clustering results by comparing a clustering output with a reference which has a hierarchical structure.

The motivating example here is from single cell RNA sequencing (scRNA-seq). But the metrcs can be applied to any situation when the true class labels in the reference has a hierarchical structure. For examples, the subjects being clustered could be animals, plants, movies with reference labels as breed, species/cultivar, genre.

Background

Cell clustering is one of the most common practice and routinely performed in scRNA-seq analysis. There are a number of clustering methods tailored specifically for scRNA-seq data. These methods usually partition the cells into several groups, with each group representing a cell type or subtype. To evaluate the performance of a clustering method, the common practice is to compare clustering result with reference labels, where the reference is obtained from another source with high confidence. The most widely used measures for the agreement between the clustering and the reference label are the Adjusted Rand Index (ARI) and the Normalized Mutual Information (NMI). These metrics are based on the assumption that the groups are completely exchangeable and overlook an important characteristic of single cell data: true cluster structure for a cell population is often hierarchical. Failing to take this true hierarchy into account in the evaluation of clustering results leads to assessments that do not accurately reflect the ability to group cells.

This package provides functionalities to compute two new metrics: weighted Rand index (wRI) and weighted mutual information (wMI), for the evaluation of scRNA-seq clustering results. The general idea is to obtain weights from cell type hierarchy, and use the weights in RI and MI calculation to reward/penalize the correct/incorrect classification.

Citation

Instruction

Quick start

Computation of wRI and wNMI requires following inputs:

Example below uses Y for expression matrix, trueclass for ground truth, and clusterRes for clustering result.

Compute weighted normalized mutual information (wNMI)

Computation of wNMI is done in two steps: first construct cell hierarchy, and then compute wNMI.

ctStruct = createRef(Y, trueclass)
this_wNMI = wNMI(ctStruct, trueclass, clusterRes)

Compute weighted Rand index (wRI)

Computation of wRI is also done in two steps: first compute weights, and then compute wRI

weights = createWeights(Y, trueclass)
this_wRI = wRI(trueclass, clusterRes)

A real life example for a PBMC dataset

We first load in an example dataset distributed with the package. The data was was generated by the 10x Genomics GemCode protocol to profile the transcriptome of eight pre-sorted cell types (B-cells, naive cytotoxic T-cells, CD14 monocytes, regula- tory T-cells, CD56 NK cells, memory T-cells, CD4 T-helper cells and naive T-cells) in peripheral blood mononuclear cells (PBMC). The original data contains more than 3000 cells. We randomly sampled 500 cells from the orginal data and use that for demonstration.

The dataset contains:

In this example, we want to evaluate the clustering results for five methods, and compare the evaluations from the weighted and traditional unweighted NMI and RI.

library(Wind)
data(Zhengmix8eq)

Use weighted normalized mutual information (wNMI)

  1. The first step is to create a reference hierarchical tree. Figure below shows the hierarchical structure of the cell types.
ctStruct = createRef(Y, trueclass)
plot(ctStruct$hc, xlab="", axes=FALSE, ylab="", ann=FALSE)
  1. Next we compute wNMI, and compare them with NMI. The results show that all methods have better performance using wNMI than using NMI. The performance gains are different: CIDR and TSCAN show more substantial gains under wNMI, indicating that their performances are not as bad as suggested by traditional NMI.
methods = names(clusterRes)
allNMI = matrix(0, nrow=length(methods), ncol=2)
rownames(allNMI) = methods
colnames(allNMI) = c("NMI", "wNMI")
for(i in 1:length(clusterRes)) {
  allNMI[i,1] = wNMI(ctStruct, trueclass, clusterRes[[i]], FALSE)
  allNMI[i,2] = wNMI(ctStruct, trueclass, clusterRes[[i]])
}
barplot(t(allNMI), beside=TRUE, ylim=c(0.4,1.05), 
        legend.text=TRUE, xpd=FALSE)

Use weighted Rand index (wRI)

  1. The first step is to create weight matrices for correct/incorrect classification.
weights = createWeights(Y, trueclass)
  1. Next we compute wRI, and compute them with RI
allRI = matrix(0, nrow=length(methods), ncol=6)
rownames(allRI) = methods
colnames(allRI) = c("RI", "NI1","NI2","wRI","wNI1","wNI2")
for(i in 1:length(clusterRes)) {
    allRI[i,1:3] = wRI(trueclass, clusterRes[[i]]) [1:3]
    allRI[i,4:6] = wRI(trueclass, clusterRes[[i]], 
                       weights$W0, weights$W1)[1:3]
}
barplot(t(allRI[,c(1,4)]), beside=TRUE, ylim=c(0.7,1.05), 
        legend.text=TRUE, xpd=FALSE)

Session Info

sessionInfo()


haowulab/Wind documentation built on Nov. 4, 2019, 1:27 p.m.