This is a fast expectation maximization (EM) algorithm of Gaussian mixture model (GMM) for clustering using RcppArmadillo and openmp. Rcpp can speed up computation of single function with many loops and parallel computation makes algorithm scalable to massive data set.
In order to interface R with C++, we must have R package Rcpp and RcppArmadillo on computer.
This package uses parallel computing in R and Cpp simultaneously. To implement parallel computing in R, install library foreach and doSNOW. Parallel computing via openmp in Rcpp require clang version be above 4.0. For MacOSX users, Macport is recommanded to install clang.
sudo port select --set clang <version>
which clang
port select --list clang
Choosing a version that is higher than 4.0(6.0/7.0). Then in Terminal, make R to use clang we installed
mkdir .R
cd .R
touch Makevars
emacs Makevars
In Makevars file, type
CC = /opt/local/bin/clang
CXX = /opt/local/bin/clang++
CXX11 = /opt/local/bin/clang++
Save and quit, restart R. Then openmp is available on Rcpp.
First of all, compile Cpp Attributes in R console,
Rcpp::compileAttributes()
then in Terminal, build our package
R CMD build RcppParallelGMM
R CMD INSTALL RcppParallelGMM_1.0.tar.gz
We download data from kaggle https://www.kaggle.com/puneet6060/intel-image-classification. This is image data set of Natual Scenes around the world, containing 25k images. Information of three channels of RGB (red, blue, green) from each photo is extracted and pixel frequency/density can be computed. It relies on package jpeg. The data set contains scenes include buildings, forests, moutains, streets, sea and glacier. Each photo corresponds to a vector. Photo clustering can be achieved using Gaussian mixture model (GMM) and EM algorithm.
Here is an example of image clustering, when our K is chosen to be 3,
and when K = 4,
The code is
library(RcppParallelGMM)
library(doSNOW)
library(foreach)
options(digits=10)
path = "data/seg_train"
fileNames = dir(path)
filePath = vector(mode = "character", length = length(fileNames))
for ( i in 1: length(fileNames)){
filePath[i] = sapply(fileNames[i], function(x){paste(path, x, sep = "/")})
}
for (j in 1:length(filePath)){
imageNames = dir(filePath[j])
assign(paste0("imagePath_", j ) ,sapply(imageNames, function(x){paste(filePath[j],x,sep = "/")}))
as.vector(paste0("imagePath_", j ))
}
train = c(
imagePath_1[201:300],
imagePath_2[201:300],
imagePath_3[201:300],
imagePath_4[201:300],
imagePath_5[201:300],
imagePath_6[201:300]
)
## Randomly shuffle
#set.seed(257)
rows <- sample(length(train))
shuffle_train = train[rows]
Y = Readjpeg(shuffle_train)
## Initialization
N = nrow(Y)
p = ncol(Y)
## Components
K = 3
#prob = rep(1/K, K)
prob = runif(K, min=0.1, max = 0.9)
prob = prob/sum(prob)
#set.seed(7)
mean = matrix(rnorm(K*p), K, p)
sigma = array(rep(diag(p), K), dim = c(p, p, K))
ans = EM(prob, mean, sigma, Y)
table(ans$class)
picn <- as.matrix(shuffle_train)
image(K, ans, shuffle_train)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.