A quick tour of **HDclust**

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

HDclust is a contributed R package for clustering high dimensional data using Hidden Markov model on Variable Blocks (HMM-VB). Briefly, the model consists of sequentially dependent groups of variables , i.e., the variable blocks. Each variable block follows a Gaussian mixture distribution. Statistical dependence among the blocks is modelled by the Markov process of the underlying states (i.e. the mixture component identities). The Baum-Welch (BW) estimation algorithm is used to fit the model to data. The clustering is performed by finding the density modes with Modal Baum-Welch algorithm (MBW). In the case of a single variable block the model reduces to a regular Gaussian Mixture Model (GMM). The package provides functions to train HMM-VB model with known or unknown variable block structure; search for the optimal number of states of HMM-VB model using BIC; cluster data with trained HMM-VB model; and align clustering results for different data sets using the same HMM-VB model.

To illustrate HDclust functionality we will use a standard R data set "faithful" from package 'datasets', and two synthetic data sets sim2 and sim3. sim2 is a 5-dimensional data set with 5000 data points. Samples were drawn from a 10-component Gaussian Mixture Model (GMM). By specific choice of the means, the data contains 10 distinct clusters. sim3 is a 40-dimensional data set. The first 10 dimensions were generated from a 3-component Gaussian Mixture Model (GMM). The remaining 30 dimensions were generated from a 5-component GMM. By specific design of the means, covariance matrices and transition probabilities, the data contain 5 distinct clusters. For more details on sim2 and sim3 see the reference.

library(HDclust)

Define a variable block structure

Dependence among the groups of variables is defined in a variable block structure. When there is only a single variable block in the structure, the model reduces to a regular GMM. Note, however, that clustering is performed by finding density modes, so results will be different from traditional GMM clustering.

# variable structure with one variable block of 3 variables and two mixture components, which corresponds to GMM model
Vb<- vb(1, dim=3, numst=2)
show(Vb)

In case of the variable block structure with more than one block, definition becomes slightly more complicated. A user has to provide dimensionality, number of mixture components and variable order for each variable block:

# variable structure with two blocks. Dimensionality of data is 40. First block contains 10 variable with 3 mixture components. Second block has 30 variables with 5 mixture components. Variable order is natural.
Vb <- vb(2, dim=40, bdim=c(10,30), numst=c(3,5), varorder=list(c(1:10),c(11:40)))
show(Vb)

Model selection

While dependence among the groups of variables can be inferred using the domain knowledge, it is hard to predict the number of mixture states in each variable block. hmmvbBIC() can search for an optimal number of states in the variable block structure. A user needs to provide all parameters of the variable block structure for which the search will be performed. The argument numst in variable block structure provided to hmmvbBIC() will be ignored during the search. The search minimizes Bayesian information criterion (BIC) and provides two options for the user.

By default the search changes the number of states in the variable blocks and estimates BIC for each model. In this case, number of states is kept the same for all variable blocks.

# by default number of states in each block is varied from 1 to 9
data("faithful")
Vb <- vb(1, dim=2, numst=1)
set.seed(12345)
modelBIC <- hmmvbBIC(faithful, VbStructure=Vb)
show(modelBIC)

Model selection results can be visualized in the following way:

plot(modelBIC, xlab='# mixture components per block', ylab='BIC')

Second option is to provide a list with configurations for the number of states in variable blocks. In this case, the algorithm computes BIC for each configuration and estimates the optimal configuration as the one with the smallest BIC value.

# user-provided configurations for the number of states in each block
data("sim3")
Vb <- vb(2, dim=40, bdim=c(10,30), numst=c(1,1), varorder=list(c(1:10),c(11:40)))
set.seed(12345)
configs = list(c(1,2), c(3,5))
modelBIC <- hmmvbBIC(sim3[,1:40], VbStructure=Vb, configList=configs)
show(modelBIC)

In both cases, the output is an object of class HMMVBBIC that contains HMM-VB parameters for the optimal configuration. The optimal HMM-VB can be accessed with getOptHMMVB method.

If you want to see more information like loglikelihood for each data point, you can use a get function like getLoglikehd().

# we just illustrate the loglikelihood of the first 10 data points beacause the data set is huge
getLoglikehd(modelBIC)[1:10]

Train an HMM-VB model

There are two ways to train an HMM-VB model. One way is to use a variable block structure motivated by domain knowledge.

data("sim3")
set.seed(12345)
Vb <- vb(2, dim=40, bdim=c(10,30), numst=c(3,5), varorder=list(c(1:10),c(11:40)))
hmmvb <- hmmvbTrain(sim3[1:40], VbStructure=Vb)
show(hmmvb)

Parameters of trained HMM-VB can be accessed by get functions:

Vb <- getVb(hmmvb) # variable block structure
hmmChain <- getHmmChain(hmmvb) # list with HMM models
diagCov <- getDiagCov(hmmvb) # indicator whether covariance matrices are diagonal 
bic <- getBIC(hmmvb) # BIC value

# below we show HMM-VB parameters for the first variable block
numst <- getNumst(hmmChain[[1]]) # number of mixture components in variable block
prenumst <- getPrenumst(hmmChain[[1]]) # number of mixture components in the previous variable block
hmmParam <- getHmmParam(hmmChain[[1]]) # list with priors, transition probabilities, means, covariance matrices and other parameters of all states of HMM

Train an HMM-VB model with unknown variable block structure

If a variable block structure is unknown, one can search for it with hmmvbTrain() function. The search is performed by a greedy algorithm. Thus multiple permutations of variables are needed for the best estimation of the variable block structure. With many permutations, the search becomes a time-consuming process, and to accelerate it, we advice to use as many cores as your machine has. This is done with parameter nthread, which we set to 4 in the example below, since our machine has 4 cores.

data("sim2")
set.seed(12345)
hmmvb <- hmmvbTrain(sim2[1:5], searchControl=vbSearchControl(nperm=5), nthread=4)
show(hmmvb)

Suggested variable block structure has a single variable block. That is in hand with the data, which was generated from a mixture of 10 Gaussian components. After finding the variable block structure, we recommend to run model selection to refine the number of mixture components:

modelBIC <- hmmvbBIC(sim2[1:5], VbStructure=getVb(hmmvb), numst=1:15, nthread=4)
show(modelBIC)

Clustering with HMM-VB

After training the HMM-VB model we can perform data clustering. If variable block structure is known, we can start with model selection to refine number of mixture components in variable blocks. The output object of type HMMVBBIC is then used in function hmmvbClust() with argument bicObj to perform clustering.

data("faithful")
VbStructure <- vb(nb=1, dim=2, numst=1)
set.seed(12345)
modelBIC <- hmmvbBIC(faithful, VbStructure)
clust <- hmmvbClust(faithful, bicObj=modelBIC)
show(clust)
plot(faithful[,1], faithful[,2], xlab='eruptions', ylab='waiting', col=getClsid(clust))

hmmvbClust() returns an object of type HMMVBclust that contains the clustering results, including Viterbi sequences and cluster modes.

clustParam <- getClustParam(clust)
mode <- clustParam$mode # a matrix with cluster modes
vseq <- clustParam$vseq # A list with integer vectors representing distinct Viterbi sequences for the dataset

If variable block structure is unknown, we first search for it, then refine number of mixture components with model selection, and perform the clustering.

# If variable block structure is unknown
data("sim2")
set.seed(12345)

# find variable block structure
hmmvb <- hmmvbTrain(sim2[,1:5], searchControl=vbSearchControl(nperm=5), nthread=4)

# refine number of states in variable block structure by model selection
modelBIC <- hmmvbBIC(sim2[,1:5], VbStructure=getVb(hmmvb), numst=1:15, nthread=4)
clust <- hmmvbClust(data=sim2[,1:5], bicObj=modelBIC)
show(clust)

Clustering results can also be plotted using two visualization algorithms.

By default results will be plotted using t-SNE algorithm from Rtsne package.

palette(c(palette(), "purple", "brown")) # extend palette to show all 10 clusters
plot(clust)

Results can also be shown in 2 dimensional space reduced by PCA:

plot(clust, method='PCA')

Cluster alignment

Imagine that you have already performed clustering with HMM-VB for one data set and now another similar data set is available. One can use existing HMM-VB model and align cluster labels of the second data set with the clusters obtained for the first data set:

data("sim3")

# split data set in two halves
X1 = sim3[1:500,]
X2 = sim3[501:1000,]

set.seed(12345)
Vb <- vb(2, 40, c(10,30), c(3,5), list(c(1:10),c(11:40)))

# train HMM-VB on dataset X1 and cluster data
hmmvb <- hmmvbTrain(X1[1:40], VbStructure=Vb)
clust1 <- hmmvbClust(X1[1:40], model=hmmvb)
show(clust1)

# cluster data set X2 using HMM-VB and cluster parameters for dataset X1
clust2 <- hmmvbClust(X2[1:40], model=hmmvb, rfsClust=getClustParam(clust1))

show(clust2)

If you also want to compute the loglikelihood for the new data set, it can be realized by setting the parameter getlikelh to be TRUE in clustControl():

clust2 <- hmmvbClust(X2[1:40], model=hmmvb, rfsClust=getClustParam(clust1), control=clustControl(getlikelh = TRUE))
# we just illustrate the loglikelihood of the first 10 data points beacause the data set is huge
getLoglikehd(clust2)[1:10]

Solution for slow data clustering with many changes in mode merging parameters

If data clustering takes a significant amount of time and many changes in mode merging parameters are required to achieve the desired clustering result, we recommend to use hmmvbFindModes() and clustModes(). Function hmmvbFindModes() finds all density modes in the data, and clustModes() clusters the modes using hierarchical clustering. Under the hood clustModes() uses dist(), hclust() and cutree() calls from the package stats, and a user can specify parameters for all of them with dist.args, hclust.args and cutree.args arguments.

data("sim3")

set.seed(12345)
Vb <- vb(2, 40, c(10,30), c(3,5), list(c(1:10),c(11:40)))

# train HMM-VB
hmmvb <- hmmvbTrain(sim3[1:40], VbStructure=Vb)

# find all density modes
modes <- hmmvbFindModes(sim3[1:40], model=hmmvb)
show(modes)

# cluster density modes
mergedModes <- clustModes(modes, cutree.args = list(h=1.0))

show(mergedModes)
plot(mergedModes)

References

Lin Lin and Jia Li, "Clustering with hidden Markov model on variable blocks," Journal of Machine Learning Research, 18(110):1-49, 2017.



Try the HDclust package in your browser

Any scripts or data that you put into this service are public.

HDclust documentation built on May 2, 2019, 9:20 a.m.