title: "Dimensionality Reduction and Clustering: single cell RNA-seq Data" author: "Group 8: Spencer Haupert, Boya Jiang, Kailin Wang" date: "12/17/2021" output: pdf_document: default documentclass: article classoption: a3paper geometry: margin=3.5cm
library(knitr) library(clusteringscRNA) library(microbenchmark) library(ggplot2) load("../../data/tumor_reduced.rda") tumor_reduced2 = tumor_reduced[,1:200]
In recent years, there has been considerable interest in analyzing single-cell RNA (scRNA) data. When it comes to complex pathologies like cancer, scRNA data can illuminate the heterogeneities and commonalities across different types of cell. Unsupervised learning techniques are often employed to cluster cells.
However, scRNA data analysis can suffer from the so-called "curse of dimensionality" since it is often the case that $p \gg n$ or at least $p \approx n$. Therefore, a dimensionality reduction step is often in order before clustering to make this computationally challenging problem tractable. Another challenge we face is that common clustering algorithms' performance can vary wildly depending on the initialization procedure. Therefore, our objectives are two fold: 1) implement a robust and efficient dimensionality reduction technique and 2) improve existing, widely-used clustering methods to mitigate issues related to initialization.
Our test data comes from The Cancer Genome Atlas's Pan-Cancer Atlas data products [1]. The particular dataset we use is hosted on UCI's ML repository [2], and it contains labeled scRNA data for 5 cancer types. For demonstration, we selected a subset of data containing 801 cells and 200 genes, each cell corresponding to one of the 5 cancer types. The first five rows and columns of the data and its are shown here:
(tumor_reduced2[1:5,1:5])
For our project, we implement a sparse version of Principal Component Analysis (PCA), k-means clustering, and the EM algorithm for a gaussian mixture model. We make these implementations available to the public through two R packages, spcaRcpp
(for sparse PCA) and clusteringscRNA
for k-means and the EM algorithm. Links to these packages, to a Google Colab page with a brief tutorial, and to the dataset used for evaluation can be found below.
R Packages:
https://github.com/srhaup2/clustering_scRNA
https://github.com/BoyaJiang/spcaRcpp
Google Colab:
https://colab.research.google.com/drive/14U0oFzB21j1-rswnQfkHt3YT93l2Z9-7#scrollTo=v3tym2Lcq5v-
Database:
https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq
Principal component analysis (PCA) is a popular data-processing and dimension-reduction technique. However, PCA suffers from the fact that each principal component is a linear combination of all the variables, making it difficult to interpret the results. Sparse principal component analysis (SPCA) was designed to remedy this inconsistency and to give additional interpretability to the projected data. Specifically, SPCA promotes sparsity in the modes, and the resulting sparse modes have only a few active (non-zero) coefficients, while the marjority of coefficients are zero. As a consequence, the model has improved interpretability, because the principal components are formed as a linear combination of only a few of the original variables. This method also prevents overfitting in a data setting where the number of variables is much greater than the number of observations ($n >> p$).
The formulation of SPCA by Zou, Hastie and Tibshirani [3] directly incorporates sparsity inducing regularizers into the optimization problem:
$$ \begin{array}{l} \underset{\mathbf{A}, \mathbf{B}}{\operatorname{minimize}} f(\mathbf{A}, \mathbf{B})=\frac{1}{2}\left\|\mathbf{X}-\mathbf{X B A}^{\top}\right\|_{\mathrm{F}}^{2}+\psi(\mathbf{B}) \ \text { subject to } \mathbf{A}^{\top} \mathbf{A}=\mathbf{I} \end{array} $$ where $B$ is a sparse weight matrix and $A$ is an orthonormal matrix. The penalty $\psi$ denotes a sparsity inducing regularizer such as the elastic net. Specifically, the optimization problem is minimized using an alternating algorithm:
The problem splits across the $k$ columns of $\mathbf{B}$, yielding a regularized regression problem in each case:
$$
\mathbf{b}{j}^{*}=\underset{\mathbf{b}{j}}{\arg \min } \frac{1}{2}\left\|\mathbf{X} \mathbf{A}(:, j)-\mathbf{X} \mathbf{b}{j}\right\|^{2}+\psi\left(\mathbf{b}{j}\right)
$$
The principal components can then be calculated as a sparsely weighted linear combination of the observed variables $\mathbf{Z = XB}$. The $\mathbf{B}$ update step relies on an iterative method using proximal gradient methods to find a stationary point.
There are several existing R packages that implements SPCA, i.e. sparsepca
, elasticnet
, and EESPCA
. Among these, the sparsepca::spca
provided a starting point for optimizing the SPCA function in terms of computational efficiency[4,5]. In order to improve the performance of the function, the iterative step was re-implemented in RcppArmadillo, which provides an interface to the Armadillo C++ numerical algebra library. RcppArmadillo offers a balance between performance and ease of use. The resulting package is spcaRcpp
. The spcaRcpp
function from the package takes in a $n \times p$ data matrix or data frame $X$, a parameter $k$ indicating the maximal rank, the sparsity controlling parameter $\alpha$, the ridge shrinkage parameter $\beta$, a logical value center
, the maximum number of iterations and the stopping criteria for the convergence. The function then returns a list containing the following: a matrix of variable loadings, standard deviations, eigenvalues, centering, variance, and the principal component scores.
By performing SPCA on the tumor
data using the following code:
spcaRcpp(tumor, k = 15, alpha = 1e-04, beta = 1e-04, center = TRUE, max_iter = 1000, tol = 1e-05)
The function returns 15 principal components (PCs) with a cumulative explained variance ratio of 70%. Figure 1. is a visualization of PC1 vs PC2 by each known true tumor label. The validity of spcaRcpp
is confirmed by all.equal()
tests comparing to the original sparsepca::spca
function. The performance of spcaRcpp
is tested using microbenchmark
after 100 runs. As shown in Figure 2a., the re-implementation of SPCA using Rcpp (top) successfully increased its speed comparing to the original sparsepca::spca
(bottom) function.
knitr::include_graphics(c("rcpp2.png", "rcpp4.png"))
It is also worth noting that when setting both $\alpha$ and $\beta$ to 0, the spca
function is no longer introducing sparsity, and the results returned are the same as traditional PCA. Figure 2b. shows the speed of spcaRcpp
and stats::prcomp
are similar when removing the sparsity in the modes.
EM clustering is one of the most widely-used algorithm for clustering on RNA-seq data. There are several existing R packages that implement EM algorithm for clustering on multivariate normal mixture data. One of the most popular EM clustering algorithm that addresses this kind of data is mixtools::mvnormalmixEM
[6,7,8]. However, this is a general version that does clustering on multivariate normal mixtures data without discriminations on the mutual independence among the variables, which brings in unnecessary complexities when dealing with the mutually independent case. For this reason, We developed the class normMixEM
in clusteringscRNA
package specifically for the case of mutual independent variables data.
Simplification is realized based on the fact that when the independency holds, the covariance matrix in each mvnormal distribution would be diagnal, thus can be further compressed into vectors. In this case, for
loops on a list of matrix can be replaced by matrix operations, which makes improvements on efficiency possible. The likelihood and log-likelihood function for mutually independent case can be expressed as follows:
knitr::include_graphics(c("em1.png"))
where: $logP({x_{i,j}}|{\mu}{z{i},j}, {\sigma}{z{i},j})= log(\frac{1}{\sqrt{2 \pi}{\sigma}{z{i},j}} e^{{-\frac{1}{2}}(\frac{x_{i,j}-{\mu}{z{i},j}}{{\sigma}{z{i},j}})^2})= - \frac{1}{2}log(2 \pi)-log({\sigma}{z{i},j})-{\frac{1}{2}}(\frac{x_{i,j}-{\mu}{z{i},j}}{{\sigma}{z{i},j}})^2$
The normMixEM
algorithm consists of three main parts: normMixEM
(class), rmvnorm_chol
(function) and normalmix_init
(function). A comparison table between the clusteringscRNA
and mixtools
is displayed below:
knitr::include_graphics(c("em2.png"))
We conducted an example test comparing the runtime of normMixEM()
with mvnormalmixEM()
using the following code. In this case, we can see that normMixEM()
gives 'perfect' clustering, and runs 6 times faster than the mvnormalmixEM()
in mixtools
. Further implementations of normMixEM()
on the tumor data is discussed in the results part.
set.seed(1000) x.1 <- rmvnorm(40, c(0, 0)) x.2 <- rmvnorm(60, c(3, 4)) X.1 <- rbind(x.1, x.2) class_t <- c(rep(1,40),rep(1,60)) EM <- normMixEM$new(input_dat = X.1,num_components = 2L) res_EM <- EM$run.EM(loglik_tol=1e-5) class <- apply(res_EM$prob_mat, 1, which.max) plot(1L:res_EM$iter,res_EM$loglik_list,type="l") kable(data.frame(est=class, true=class_t) %>% count(true,est) %>% mutate(freq=n/sum(n)))
knitr::include_graphics(c("em4.png", "em_mb.png"))
knitr::include_graphics(c("em5.png","em3.png"))
Last but not least, we should emphasize on the importance of validating the mutual independence assumption before implementing the algorithm, since this is the assumption based on which normMixEM
is developed. Particularly, when the dimension of data is too large, mutual independence would probably not hold, thus normMixEM
would usually behave poorly on high-dimensional dataset. In this case, dimension reduction would be necessary before doing clustering with normMixEM
algorithm.
K-means is a very popular and relatively straightforward clustering algorithm. Many variations exist, but the most common implementation is referred to as Lloyd’s algorithm [9]. LLoyd's K-means algorithm is as follows:
Formally, cluster quality is assessed by within-cluster sum of squared error (WCSSE) given by:
For clusters $C_1, ..., C_k$, $WCSSE = \sum_{i = 1}^k \sum_{x \in C_i} ||x - \mu_i||^2$ where $\mu_i$ is the centroid for each cluster.
Lloyd's algorithm is attractive because of its ease of implementation and because it is rather intuitive. However, because the initialization procedure is completely random and because K-means converges only to a local optima, Lloyd's method can result in quite poor clusters if bad initial centroids are chosen by chance.
For this reason, attempts have been made to improve Lloyd's method's initialization [10]. One very popular method is kmeans++, which is the default method in many software packages including the ubiquitous Python scikit-learn package [11].The method is as follows:
Kmeans++ represents a considerable improvement over random initialization. This method operates under the idea that cluster centers should be spread out to maximize the opportunity to find the global maxima. Despite the popularity of kmeans++, some studies suggest it is not the optimal initialization procedure [12]. The authors of kmeans++ also propose another even better method called greedy kmeans++. The algorithm is as follows:
The difference between kmeans++ and greedy kmeans++ is highlighted in bold. In essence, instead of sampling one new data point each iteration, j are sampled and the best one is chosen.
The most popular implementation of kmeans in R is stats::kmeans()
, but several others exist including flexclust::kcca()
, clustR::KMeans_rcpp()
, and pracma::kmeanspp
. Some of these functions implement kmeans++, but to our knowledge, no current R package implements greedy kmeans++.
Our kmeans implementation is capable of running Lloyd's algorithm with the following initialization methods: random, kmeans++, and greedy kmeans++. The function can be used as shown below:
kmeans_clust(X, k, nstart = 1L, iter.max = 10L, init.method = "random", center = TRUE, scale = TRUE)
It takes as input:
And it returns:
First, we present a speed comparison of kmeans_clust()
with stats::kmeans()
.
res = microbenchmark( stats = kmeans(tumor_reduced2, centers = 5, nstart = 1, algorithm = "Lloyd"), clusteringscRNA_random = kmeans_clust(tumor_reduced2, k = 5, nstart = 1, init.method = "random"), clusteringscRNA_kmeanspp = kmeans_clust(tumor_reduced2, k = 5, nstart = 1, init.method = "kmeans++"), clusteringscRNA_gkmeanspp = kmeans_clust(tumor_reduced2, k = 5, nstart = 1, init.method = "gkmeans++"), times = 10 ) autoplot(res) + labs(title = "Figure 5. Speed Comparison of stats vs clusteringscRNA")
We can see that the function in the stats
package is the fastest, which is not surprising since it is written in C and Fortran. However, considering that only a small portion of the clusteringscRNA
function is written in Rcpp, these results are not too bad. The random initialization and kmeans++ method have a similar runtime, but the greedy kmeans++ method has a much longer runtime because it does many more distance calculations.
Next, we can evaluate accuracy on the raw data using Adjusted Rand Index:
library(mclust) set.seed(114) tumor_reduced2 = tumor_reduced[,1:1000] res1 = matrix(0,2,10) rownames(res1) = c("ARI", "WCSSE") res2 = res1 res3 = res1 res4 = res1 for (i in 1:10) { m = kmeans(tumor_reduced2, centers = 5, nstart = 1, iter.max = 10, algorithm = "Lloyd") res1[1,i] = adjustedRandIndex(m$cluster,TC) res1[2,i] = m$tot.withinss } for (i in 1:10) { m = kmeans_clust(tumor_reduced2, k = 5, nstart = 1, iter.max = 10, init.method = "random", center = FALSE, scale = FALSE) res2[1,i] = adjustedRandIndex(m$clusters[,1],TC) res2[2,i] = m$wcsse } for (i in 1:10) { m = kmeans_clust(tumor_reduced2, k = 5, nstart = 1, iter.max = 10, init.method = "kmeans++", center = FALSE, scale = FALSE) res3[1,i] = adjustedRandIndex(m$clusters[,1],TC) res3[2,i] = m$wcsse } for (i in 1:10) { m = kmeans_clust(tumor_reduced2, k = 5, nstart = 1, iter.max = 10, init.method = "gkmeans++", center = FALSE, scale = FALSE) res4[1,i] = adjustedRandIndex(m$clusters[,1],TC) res4[2,i] = m$wcsse } res_df = data.frame(`Median ARI` = c(median(res1[1,]), median(res2[1,]), median(res3[1,]), median(res4[1,])), `Median WCSSE` = c(median(res1[2,]), median(res2[2,]), median(res3[2,]), median(res4[2,]))) rownames(res_df) = c("stats", "clusteringscRNA_random","clusteringscRNA_kmeanspp","clusteringscRNA_gkmeanspp") print.data.frame(res_df)
Greedy kmeans++ offers an improvement in accuracy and precision as measured by ARI and WCSSE. It is worth noting that the lowest WCSSE does not always correspond to the best ARI. In testing these functions, we found greedy kmeans++ fairly consistently provided the lowest WCSSE, but it does not always provide the highest ARI.
In order to test the performance and speed of our functions, we performed SPCA and EM or k-means clustering on the tumor
data. Based on prior knowledge, the number of clusters was set to 5. First, dimensionality reduction was applied to the data using spcaRcpp
. The resulting principal components were clustered using either EM or k-means algorithm. Table 2a shows the frequency of observations assigned to each cluster compared with the true labels from one run using EM clustering. The percentage of accurately clustered entries is approximately 98.9%. However, since EM algorithm depends on the initialization point, this result is not reliable. Therefore, we also calculated the average Adjusted Rand Index (ARI) from 10 runs. ARI computes the similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. The resulting average ARI from EM algorithm was 0.733, and the average runtime was 0.172 seconds.
knitr::include_graphics(c("results3.png"))
Next, we computed the accuracy of k-means clustering after dimensionality reduction using the following code. The initialization method gkmeans++
was chosen since it was expected to outperform other methods as discussed previously. The nstart
was set to 10 in order to achieve higher accuracy.
clusteringscRNA::kmeans_clust(spca_out$scores, k =5, nstart = 10, init.method = "gkmeans++", center = F, scale = F)
From Table 2b, the accuracy of k-means clustering was approximately 93%. Again, in order to better assess the reliability and accuracy of this method, average ARI was computed from 10 runs. The average ARI of k-means clustering was 0.818, and the average runtime was 1.095 seconds. In comparison, the same k-means function took approximately 1.612 seconds to perform clustering on the raw data without dimensionality reduction. This is expected because the loss of information from SPCA has an impact on both the speed and accuracy of clustering methods.
In this study, we presented new versions of SPCA, EM and k-means algorithms with improvements on efficiency and accuracy. To certify the accuracy of our method, we applied SPCA and EM clustering or k-means clustering on the tumor
data and estimated the consistency of the real labels and ours by calculating ARI values. Specifically, greedy k-means method achieved higher accuracy (average ARI = 0.818) compared to EM algorithm (average ARI = 0.733). However, the performance speed of greedy k-means is much slower than EM algorithm (1.095 secs vs 0.172 secs on average). This result provides some insight on the speed/accuracy trade-off when selecting clustering methods for scRNA data.
Although our methods were able to demonstrate similar or even better performance compared to their existing counterparts, there are still several limitations. One main difficulty is that both k-means and EM algorithm are sensitive to initialization parameters, which include the starting point and also the cluster number k
. Given the randomness of the initialization, the clustering result may vary from time to time, which reduces the stability of the algorithms and increases the difficulty of interpretation.
The speed of clustering algorithms is another concern. As the single-cell datasets have become larger and larger, improvements on efficiency would be required,and this can be realized by writing in C++, Fortran, etc. Since only a small part of the codes in our study is written in C++, it's reasonable to expect further improvements on efficiency by rewriting in other compiled languages.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.