README.md

RcppParallelGMM: parallel computing for GMM model via Rcpp

Introduction

This is a fast expectation maximization (EM) algorithm of Gaussian mixture model (GMM) for clustering using RcppArmadillo and openmp. Rcpp can speed up computation of single function with many loops and parallel computation makes algorithm scalable to massive data set.

Installation

In order to interface R with C++, we must have R package Rcpp and RcppArmadillo on computer.

This package uses parallel computing in R and Cpp simultaneously. To implement parallel computing in R, install library foreach and doSNOW. Parallel computing via openmp in Rcpp require clang version be above 4.0. For MacOSX users, Macport is recommanded to install clang.

sudo port select --set clang <version>
which clang
port select --list clang

Choosing a version that is higher than 4.0(6.0/7.0). Then in Terminal, make R to use clang we installed

mkdir .R
cd .R
touch Makevars
emacs Makevars

In Makevars file, type

CC = /opt/local/bin/clang

CXX = /opt/local/bin/clang++

CXX11 = /opt/local/bin/clang++

Save and quit, restart R. Then openmp is available on Rcpp.

Install the package

First of all, compile Cpp Attributes in R console,

Rcpp::compileAttributes()

then in Terminal, build our package

R CMD build RcppParallelGMM
R CMD INSTALL RcppParallelGMM_1.0.tar.gz

Dataset

We download data from kaggle https://www.kaggle.com/puneet6060/intel-image-classification. This is image data set of Natual Scenes around the world, containing 25k images. Information of three channels of RGB (red, blue, green) from each photo is extracted and pixel frequency/density can be computed. It relies on package jpeg. The data set contains scenes include buildings, forests, moutains, streets, sea and glacier. Each photo corresponds to a vector. Photo clustering can be achieved using Gaussian mixture model (GMM) and EM algorithm.

Output

Here is an example of image clustering, when our K is chosen to be 3,

and when K = 4,

The code is

library(RcppParallelGMM)
library(doSNOW)
library(foreach)
options(digits=10)

path = "data/seg_train"
fileNames = dir(path)
filePath = vector(mode = "character", length = length(fileNames))
for ( i in 1: length(fileNames)){
  filePath[i] = sapply(fileNames[i], function(x){paste(path, x, sep = "/")})
}

for (j in 1:length(filePath)){
  imageNames = dir(filePath[j])
  assign(paste0("imagePath_", j ) ,sapply(imageNames, function(x){paste(filePath[j],x,sep = "/")}))
  as.vector(paste0("imagePath_", j ))
}

train = c(
  imagePath_1[201:300],
  imagePath_2[201:300],
  imagePath_3[201:300],
  imagePath_4[201:300],
  imagePath_5[201:300],
  imagePath_6[201:300]
)
## Randomly shuffle
#set.seed(257)
rows <- sample(length(train))
shuffle_train = train[rows]

Y = Readjpeg(shuffle_train)
## Initialization
N = nrow(Y)
p = ncol(Y)
## Components
K = 3
#prob = rep(1/K, K)
prob = runif(K, min=0.1, max = 0.9)
prob = prob/sum(prob)
#set.seed(7)
mean = matrix(rnorm(K*p), K, p)
sigma = array(rep(diag(p), K), dim = c(p, p, K))

ans = EM(prob, mean, sigma, Y)
table(ans$class)

picn <- as.matrix(shuffle_train)
image(K, ans, shuffle_train)


yehanxuan/tamu-689-final documentation built on Dec. 8, 2019, 5:25 p.m.