knitr::opts_chunk$set( comment = "#>" )
This vignette aims to illustrate the basic usage of SWKM
package.
The released version of SWKM
can be installed directly from Github with devtools
:
library(devtools) devtools::install_github("Van1yu3/SWKM")
As this package depends on the RcppArmadillo
package, Windows users should install Rtools in advance to compile the package.
Then we can load SWKM
:
library(SWKM)
We illustrate the usage of SWKM
on two datasets.
First load the dataset:
data("NormalDisData")
The dataset contains the following information:
names(NormalDisData)
data
is a 60 by 500 matrix indicating 60 observations with 500 features. The first 50 features are cluster-specific features.
ncluster
is the true number of clusters, which is 3 here.
true.label
is a 60-dimension vector indicating the true cluster.
noisy.label
indicates the position of noisy observations. The noisy observations are generated from the normal distribution with the same mean but a larger variance.
The number of clusters K
must be determined before the clustering method is performed. The function ChooseK
, which employs Gap Statistic(Robert T. et al. (2001)), is designed for this goal. A good way of lessening the impact of noisy observations on the choice of K
is to remove them.
set.seed(1) cK <- ChooseK(NormalDisData$data[-NormalDisData$noisy.label,],nClusters = 1:6) plot(cK)
Next, let's tune weight for the noisy observations:
K <- cK$OptimalK system.time( res.tuneU <- kmeans.weight.tune(x = NormalDisData$data,K = K,noisy.lab = NormalDisData$noisy.label,weight.seq = NULL) )
If weight.seq
is provided, then noisy.lab
will be omitted. Otherwise, the default setting is to fix the weight of normal observations as 1 and assign a sequence of candidate weight c(1,0.8,0.5,0.2,0.1,0.08,0.05,0.02,0.01,0.005,0.001)
to noisy observations. See help("kmeans.weight.tune")
for more details.
We can plot the result to see which weight label is chosen:
plot(res.tuneU)
From the figure, the weight of noisy observations is chosen to be 0.02.
The next step is to tune the sparsity parameter (Daniela M Witten and Robert Tibshirani (2010)):
res.tunes <- KMeansSparseCluster.permute.weight(x = NormalDisData$data,K = K,weight = res.tuneU$bestweight)
Finally we can perform sparse weighted K-means with the weight and sparsity parameter:
res <- KMeansSparseCluster.weight(x = NormalDisData$data,K = K,wbounds = res.tunes$bestw,weight = res.tuneU$bestweight)
Check the clustering result and the number of features selected:
table(res[[1]]$Cs,NormalDisData$true.label) sum(res[[1]]$ws!=0) order(res[[1]]$ws,decreasing = TRUE)[1:50]
As expected, the 50 most important features lie mainly in the first 50 features.
This dataset has a dimension of 100 by 5000, and simulates the read counts in single-cell epigenomic data. The noisy observations in this dataset are set to be with 90\% missing values.
The usage in this dataset is quite similar to the previous one, except that this dataset needs data preprocessing. One useful preprocessing method is to first take rank for each cell, then standardize the rank matrix to mean 0 and unit standard deviation.
set.seed(1) data("DMdata") # data preprocessing data <- t(DMdata$data) data_rank <- apply(data, 2, rank) data_rank_center<- t(t(data_rank) - colMeans(data_rank)) data_rank_center_scale <- t(t(data_rank_center)/apply(data_rank_center, 2, sd)) data_processed <- t(data_rank_center_scale) # tune the number of cluster K # nperms and nstart are set to be small in order to save computation time cK <- ChooseK(data_processed[-DMdata$noisy.label,],nClusters = 1:6,nperms = 10,nstart = 5) plot(cK) K <- cK$OptimalK # tune weight # the process of tuning weight will take some time system.time( res.tuneU <- kmeans.weight.tune(x = data_processed,K = K,noisy.lab = DMdata$noisy.label,nperms = 10,nstart = 5) ) plot(res.tuneU) # perform weighted K-means res <- kmeans.weight(x = data_processed,K = K,weight = res.tuneU$bestweight) # check the result table(res$cluster,DMdata$true.label)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.