neighborPurity: Compute neighborhood purity
In bluster: Clustering Algorithms for Bioconductor

Description Usage Arguments Details Value Weighting by frequency Author(s) Examples

Use a hypersphere-based approach to compute the “purity” of each cluster based on the number of contaminating observations from different clusters in its neighborhood.

neighborPurity(
  x,
  clusters,
  k = 50,
  weighted = TRUE,
  BNPARAM = KmknnParam(),
  BPPARAM = SerialParam()
)

`x`	A numeric matrix-like object containing observations in rows and variables in columns.
`clusters`	Vector of length equal to `ncol(x)`, specifying the cluster assigned to each observation.
`k`	Integer scalar specifying the number of nearest neighbors to use to determine the radius of the hyperspheres.
`weighted`	A logical scalar indicating whether to weight each observation in inverse proportion to the size of its cluster. Alternatively, a numeric vector of length equal to `clusters` containing the weight to use for each observation.
`BNPARAM`	A BiocNeighborParam object specifying the nearest neighbor algorithm. This should be an algorithm supported by `findNeighbors`.
`BPPARAM`	A BiocParallelParam object indicating whether and how parallelization should be performed across genes.

The purity of a cluster is quantified by creating a hypersphere around each observation in the cluster and computing the proportion of observations in that hypersphere from the same cluster. If all observations in a cluster have proportions close to 1, this indicates that the cluster is highly pure, i.e., there are few observations from other clusters in its region of the coordinate space. The distribution of purities for each cluster can be used as a measure of separation from other clusters.

In most cases, the majority of observations of a cluster will have high purities, corresponding to observations close to the cluster center. A fraction of observations will have low values as these lie at the boundaries of two adjacent clusters. A high degree of over-clustering will manifest as a majority of observations with purities close to zero. The maximum field in the output can be used to determine the identity of the cluster with the greatest presence in a observation's neighborhood, usually an adjacent cluster for observations lying on the boundary.

The choice of k is used only to determine an appropriate value for the hypersphere radius. We use hyperspheres as this is robust to changes in density throughout the coordinate space, in contrast to computing purity based on the proportion of k-nearest neighbors in the same cluster. For example, the latter will fail most obviously when the size of the cluster is less than k.

A DataFrame with one row per observation in x and the columns:

purity, a numeric field containing the purity value for the current observation.
maximum, the cluster with the highest proportion of observations neighboring the current observation.

Row names are defined as the row names of x.

By default, purity values are computed after weighting each observation by the reciprocal of the number of observations in the same cluster. Otherwise, clusters with more observations will have higher purities as any contamination is offset by the bulk of observations, which would compromise comparisons of purities between clusters. One can interpret the weighted purities as the expected value after downsampling all clusters to the same size.

Advanced users can achieve greater control by manually supplying a numeric vector of weights to weighted. For example, we may wish to check the purity of batches after batch correction in single-cell RNA-seq. In this application, clusters should be set to the batch blocking factor (not the cluster identities!) and weighted should be set to 1 over the frequency of each combination of cell type and batch. This accounts for differences in cell type composition between batches when computing purities.

If weighted=FALSE, no weighting is performed.

Aaron Lun

m <- matrix(runif(1000), ncol=10)
clusters <- clusterRows(m, BLUSPARAM=NNGraphParam())
out <- neighborPurity(m, clusters)
boxplot(split(out$purity, clusters))

# Mocking up a stronger example:
centers <- matrix(rnorm(30), nrow=3)
clusters <- sample(1:3, 1000, replace=TRUE)
y <- centers[clusters,,drop=FALSE]
y <- y + rnorm(length(y))

out2 <- neighborPurity(y, clusters)
boxplot(split(out2$purity, clusters))