knitr::opts_chunk$set(echo = TRUE)
R[@R] and Python are popular languages for machine learning. Both are easy to code and, more importantly, have many packages that provide APIs that commonly used by researchers and developers. But there are some differences between them. Python is more powerful and the language is more complex. And it is used in a wide range of fields, not just machine learning. Therefore there is a larger community and it is easier to get support. The R language is the opposite. Focusing on data science, the language is simpler and easier to use. Compared to Python, the community is relatively smaller and it is sightly harder to get help.
Spectral clustering[@von2007tutorial] is a very useful clustering algorithm in manifold learning. The main idea of spectral clustering is that, instead of clustering the original data directly, it clusters the graph of the data. First, it computes the similarity matrix of the data, which is a graph. Second, it calculates the K smallest eigenvectors of the Laplacian matrix that comes from the similarity matrix, which is equivalent to clustering the graph. To obtain the labels, an assignment method is used on the eigenvectors, which is usually k-means. Because the assignment is done in low dimensions, the method is non-linear and is good at high-dimensional data.
In Python, Scikit-learn[@sklearn_api] is one of the most widely used machine learning packages. It has good APIs for spectral clustering and is continuously updating its code. This project tries to bring the spectral clustering from sklearn to R, which provides similar APIs. This project has a better hierarchy and is easier for further development than the original code in Python. More importantly, this project in R is easier to use than sklearn in Python.
The source code is available at https://github.com/arthans/SpectralClustering. Since this is an R implementation of sklearn.cluster.SpectralClustering, some of the documentation and comments in the code are identical to those in sklearn. In addition, because there are so many files, it is not a good idea to append all of them in this report. Therefore, please refer to the previously mentioned website, and I will not update the code after the deadline of the final project.
Lots of papers[@shi2000normalized] provide very clear demonstrations of spectral clustering. And the theory of spectral clustering will not be discussed in this report. This report is more focused on the algorithm and code implementation.
First, the algorithm can be described as follows.
In the first step, the affinity matrix $W$ is computed from the data. There are three main types[@von2007tutorial] of affinity matrix: $\epsilon$-neighborhood, k-nearest neighbor and fully connected. $\epsilon$-neighborhood only keeps the edge if the distance between two points is less than $\epsilon$. K-nearest neighbor keeps the edge when one point is among the K-nearest neighbors of another point. Fully connected uses a similarity function to weight the edges, which is regularly used. The Gaussian function, aka RBF, $s(x_i,x_j)=exp(-\|x_i-x_j\|^2/(2\sigma^2))$ is a common choice for the similarity function.
In the second step, the graph, the affinity matrix, needs to be cut into K clusters. The goal is to maximize the edge within the clusters, minimize the edge between the clusters and each cluster has the most points. [@shi2000normalized] proved that it is equivalent to minimizing the eigenvalues of the normalized Laplacian matrix. In this way, there are two minor steps. First, calculate the Laplacian Matrix $L=D-W$, and $D$ is the degree matrix $D_{i,i}=\sum_jW_{i,j}$.The normalized Laplacian Matrix is $L_n=D^{-\frac{1}{2}}LD^{-\frac{1}{2}}$. Then, compute eigenvalues and eigenvectors of $L_n$. The eigenvectors corresponding to the K smallest eigenvalues are the low-dimensional representations of the data.
In the final step, the new points are clustered. There are three methods: k-means[@von2007tutorial], QR[@damle2019simple] and discretize[@stella2003multiclass]. QR is the fastest method because it computes labels directly. Discretize is robust and converges faster than k-means. And k-means is the most popular choice.
This project is an R implementation of sklearn.cluster.SpectralClustering. Some changes are made to adapt it to the R language. The structure of the functions is modified to simplify further development. Some of the documentation and comments are the same as those in sklearn due to the similar API, which benefits migration.
According to the algorithm, there are four R scripts: Main Function, Affinity Matrix, Spectral Embedding and Assignment.
The Main Function is simple. It just calls the other three functions in turn.
spectral_clustering <- function(X, n_clusters = 8, n_components = NULL, affinity = "rbf", assign_labels = "kmeans", n_init = 10) { affinity_matrix <- affinity_kernel(X, kernel = affinity) n_components <- if (is.null(n_components)) n_clusters else n_components maps <- spectral_embedding(affinity_matrix, n_components = n_components) labels <- assignment(maps, assign_labels, centers = n_clusters, nstart = n_init) return(labels) }
For more information, please refer to the 'spectral_clustering' function in the package documentation.
Affinity Matrix has three kernels: precomputed, nearest_neighbors and RBF. Precomputed means the input X is already the affinity matrix. Since the RBF is one kind of kernel functions that $K_{ij}=k(x_i,x_j)$, pairwise_kernels is used to simply the further development.
affinity_kernel <- function(X, kernel = "rbf") { if (kernel == "precomputed") { affinity_matrix <- X } else if (kernel == "nearest_neighbors") { connectivity <- kneighbors_graph(X) affinity_matrix <- 0.5 * (connectivity + t(connectivity)) } else { affinity_matrix <- pairwise_kernels(X, kernel) } return(affinity_matrix) }
For more information, please refer to the 'affinity_kernel', 'kneighbors_graph', 'pairwise_kernels' and 'rbf_kernel' function in the package documentation.
Spectral Embedding computes the laplacian matrix and its eigenvectors. Sklearn provides three eigen solvers, each with its own advantages. Unfortunately, the eigen function in the base package only has LAPACK.
spectral_embedding <- function(adjacency, n_components = 8) { # Laplacian laplacian <- laplacian(adjacency, normed = TRUE) # Eigenvector: Only one solver: LAPACK # TODO: Add more solvers embedding <- eigen(laplacian, symmetric = TRUE)$vectors # http://www.di.fc.ul.pt/~jpn/r/spectralclustering/spectralclustering.html return(embedding[, (ncol(embedding)-n_components+1):ncol(embedding)]) }
For more information, please refer to the 'spectral_embedding' and 'laplacian' function in the package documentation.
Assignment clusters the point in the maps. There are three algorithms: kmeans, cluster_qr and discretize. The properties of these three algorithms have been discussed in the algorithm section.
assignment <- function(maps, assign_labels = "kmeans", centers = 8, nstart = 10) { if (assign_labels == "kmeans") { labels <- stats::kmeans(maps, centers = centers, nstart = nstart)$cluster } else if (assign_labels == "cluster_qr") { labels <- cluster_qr(maps) } else if (assign_labels == "discretize") { labels <- discretize(maps) } return(labels) }
For more information, please refer to the 'assign', 'cluster_qr' and 'discretize' function in the package documentation.
It is very easy to use this package. Take Homework 3 for example.
library(SpectralClustering) data(authors) pca <- prcomp(authors, retx = TRUE, center = TRUE, scale. = TRUE, rank. = 2)$x par(mfrow = c(2, 3)) for (affinity in c('nearest_neighbors', 'rbf')) { for (assign_labels in c('kmeans', 'cluster_qr', 'discretize')) { label <- spectral_clustering(authors, 4, affinity = affinity, assign_labels = assign_labels) plot(pca, col = label, main = paste0(affinity, ' ', assign_labels)) } }
And it is also easy to use if there is a precomputed affinity matrix.
# affinity matrix: epsilon-neighborhood d <- pairwise_kernels(authors, 'rbf') A <- d < 1e-40 # spectral_clustering label <- spectral_clustering(A, 4, affinity = 'precomputed', assign_labels = 'cluster_qr') plot(pca, col = label, main = 'spectral clustering')
It is not a good idea to use $\epsilon$-neighborhood for this problem.
In this section, the performance of the package is tested.
tt = c() for (assign_labels in c('kmeans', 'cluster_qr', 'discretize')) { for (affinity in c('nearest_neighbors', 'rbf')) { t <- system.time( spectral_clustering(authors, 4, affinity = affinity, assign_labels = assign_labels) )[3] names(t) <- assign_labels tt <- c(tt, t) } } barplot( tt, ylab = 'time(s)', legend.text = c('nearest_neighbors', 'rbf'), col = c(2, 4), main = 'Performance' )
For three assign algorithms, there is no significant difference. For affinity, nearest_neighbors is much faster than RBF, which needs to do the expensive exp() function.
To improve performance, there is a more detailed test.
tt2 = c() t <- system.time(affinity_matrix <- affinity_kernel(authors, kernel = 'rbf'))[3] names(t) <- 'Affinity Matrix' tt2 <- c(tt2, t) t <- system.time(maps <- spectral_embedding(affinity_matrix, n_components = 4))[3] names(t) <- 'Spectral Embedding' tt2 <- c(tt2, t) t <- system.time(labels <- assignment(maps, 'kmeans', centers = 4))[3] names(t) <- 'Assignment' tt2 <- c(tt2, t) barplot(tt2, ylab = 'time(s)', main = 'rbf kmeans')
It is obvious that the Affinity Matrix is the key to improve the performance. One of the possible ways is to parallelize the function, since the computation of the affinity matrix is highly parallel.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.