makeSNNGraph: Build a nearest-neighbor graph
In LTLA/bluster: Clustering Algorithms for Bioconductor

makeSNNGraph

R Documentation

Build a nearest-neighbor graph

Description

Build a shared or k-nearest-neighbors graph of observations for downstream community detection.

Usage

makeSNNGraph(
  x,
  k = 10,
  type = c("rank", "number", "jaccard"),
  BNPARAM = KmknnParam(),
  num.threads = 1,
  BPPARAM = SerialParam()
)

makeKNNGraph(
  x,
  k = 10,
  directed = FALSE,
  BNPARAM = KmknnParam(),
  num.threads = 1,
  BPPARAM = SerialParam()
)

neighborsToSNNGraph(
  indices,
  type = c("rank", "number", "jaccard"),
  num.threads = 1
)

neighborsToKNNGraph(indices, directed = FALSE)

Arguments

`x`	A matrix-like object containing expression values for each observation (row) and dimension (column).
`k`	An integer scalar specifying the number of nearest neighbors to consider during graph construction.
`type`	A string specifying the type of weighting scheme to use for shared neighbors.
`BNPARAM`	A BiocNeighborParam object specifying the nearest neighbor algorithm.
`num.threads`	Integer scalar specifying the number of threads to use.
`BPPARAM`	Deprecated, use `num.threads` instead.
`directed`	A logical scalar indicating whether the output of `buildKNNGraph` should be a directed graph.
`indices`	An integer matrix where each row corresponds to an observation and contains the indices of the `k` nearest neighbors (by increasing distance and excluding self) from that observation.

Details

The makeSNNGraph function builds a shared nearest-neighbour graph using observations as nodes. For each observation, its k nearest neighbours are identified using the findKNN function, based on distances between their expression profiles (Euclidean by default). An edge is drawn between all pairs of observations that share at least one neighbour, weighted by the characteristics of the shared nearest neighbors - see “Weighting Schemes” below.

The aim is to use the SNN graph to perform clustering of observations via community detection algorithms in the igraph package. This is faster and more memory efficient than hierarchical clustering for large numbers of observations. In particular, it avoids the need to construct a distance matrix for all pairs of observations. Only the identities of nearest neighbours are required, which can be obtained quickly with methods in the BiocNeighbors package.

The choice of k controls the connectivity of the graph and the resolution of community detection algorithms. Smaller values of k will generally yield smaller, finer clusters, while increasing k will increase the connectivity of the graph and make it more difficult to resolve different communities. The value of k can be roughly interpreted as the anticipated size of the smallest subpopulation. If a subpopulation in the data has fewer than k+1 observations, buildSNNGraph and buildKNNGraph will forcibly construct edges between observations in that subpopulation and observations in other subpopulations. This increases the risk that the subpopulation will not form its own cluster as it is more interconnected with the rest of the observations in the dataset.

The makeKNNGraph method builds a simpler k-nearest neighbour graph. Observations are again nodes, and edges are drawn between each observation and its k-nearest neighbours. No weighting of the edges is performed. In theory, these graphs are directed as nearest neighour relationships may not be reciprocal. However, by default, directed=FALSE such that an undirected graph is returned.

The neighborsToSNNGraph and neighborsToKNNGraph functions operate directly on a matrix of nearest neighbor indices, obtained using functions like findKNN. This may be useful for constructing a graph from precomputed nearest-neighbor search results. Note that the user is responsible for ensuring that the indices are valid, i.e., range(indices) is positive and no greater than max(indices).

Value

A graph where nodes are cells and edges represent connections between nearest neighbors. For buildSNNGraph, these edges are weighted by the number of shared nearest neighbors. For buildKNNGraph, edges are not weighted but may be directed if directed=TRUE.

Shared neighbor weighting schemes

If type="rank", the weighting scheme defined by Xu and Su (2015) is used. The weight between two nodes is k - r/2 where r is the smallest sum of ranks for any shared neighboring node. For example, if one node was the closest neighbor of each of two nodes, the weight between the two latter nodes would be k - 1. For the purposes of this ranking, each node has a rank of zero in its own nearest-neighbor set. More shared neighbors, or shared neighbors that are close to both observations, will generally yield larger weights.

If type="number", the weight between two nodes is simply the number of shared nearest neighbors between them. The weight can range from zero to k + 1, as the node itself is included in its own nearest-neighbor set. This is a simpler scheme that is also slightly faster but does not account for the ranking of neighbors within each set.

If type="jaccard", the weight between two nodes is the Jaccard similarity between the two sets of neighbors for those nodes. This weight can range from zero to 1, and is a monotonic transformation of the weight used by type="number". It is provided for consistency with other clustering algorithms such as those in seurat.

Tehcnically, edges with zero weights are assigned a nominal small positive weight of the order of 1e-6. This is done only to satisfy the requirements for positive weights in many igraph clustering algorithms. We do not just remove these edges as that might lead to the situation where some observations have no edges at all and thus form single-observation clusters.

Note that the behavior of k for type="rank" is slightly different from that used in the original SNN-Cliq implementation by Xu and Su. The original implementation considers each observation to be its first nearest neighbor that contributes to k. Here, the k nearest neighbours refers to the number of other observations.

Author(s)

Aaron Lun, with KNN code contributed by Jonathan Griffiths.

References

Xu C and Su Z (2015). Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31:1974-80

Examples

m <- matrix(rnorm(10000), ncol=10)

g <- makeSNNGraph(m)
clusters <- igraph::cluster_fast_greedy(g)$membership
table(clusters)

# Any clustering method from igraph can be used:
clusters <- igraph::cluster_walktrap(g)$membership
table(clusters)

# Smaller 'k' usually yields finer clusters:
g <- makeSNNGraph(m, k=5)
clusters <- igraph::cluster_walktrap(g)$membership
table(clusters)

LTLA/bluster documentation built on Sept. 8, 2024, 4:37 a.m.