Merging mixture components

The most standard parametric approach in cluster analysis assumes data can be modelled by a finite mixture distribution. The approach has two steps; first, a finite mixture distribution with probability density function [ f(\;\cdot\; ; \pi_1, \dots, \pi_k, \theta_1, \dots, \theta_k) = \pi_1 (\;\cdot\; ; \theta_1) + \dots + \pi_k f(\;\cdot\; ; \theta_k), ] with $\sum_{j=1}^k = \pi_j =1$ is fitted to a sample $ X$, obtaining estimates $\hat{\pi}_1, \dots, \hat{\pi}_k$ and $\hat{\theta}_1 \dots \hat{\theta}_k$. After the fitting process, each observation $ x$ is assigned to the finite mixture component $j$, $1\leq j \leq k$, with $\hat{\pi}_j f( x ; \hat{\theta}_j)$ maximum.

We are going to work with the dataset ex4.1 used in Baudry et el. (2010) and available in package mclust. To fit a finite mixture of gaussian distributions we are going to use the same package.

library(mclust)
library(mixpack)
library(ggplot2)
library(dplyr)
data(Baudry_etal_2010_JCGS_examples)
qplot(data=ex4.1, X1, X2)

Finite mixture fitting

Function Mclust allows us to fit a mixture function to a dataset.

m <- Mclust(ex4.1)
summary(m)

Using the function dmixnorm_solution we can evaluate the probability density function and calculate the posterior probabilities

dens.mixt = dmixnorm_solution(ex4.1, solution = m)
(df <- lapply(1:6, function(i) 
  (m$parameters$pro[i] * dmixnorm_solution(ex4.1, solution = m, part=i)) %>%
    data.frame %>% {./dens.mixt} %>%
    setNames(sprintf('p%02d', i)) ) %>% bind_cols) %>% tbl_df

The posterior probabilities are also available in the object returned by function Mclust.

m$z %>% tbl_df
xlimits = seq(-3, 11, 0.05)
ylimits = seq(-3, 8, 0.05)
cm0 = expand.grid(X1 = xlimits, X2 = ylimits) %>% tbl_df %>%
  mutate(z = dmixnorm_solution(., solution=m))
ggplot() + 
  geom_point(data=ex4.1, aes(x=X1, y=X2),alpha=0.2) + 
  stat_contour(data = cm0, aes(x=X1, y=X2, z=z))
partition = list(1,2,3,4,5,6)
CN6 = lapply(partition, function(part){
  expand.grid(X1 = xlimits, X2 = ylimits) %>%
    tbl_df %>%
    mutate(z = dmixnorm_solution(., m, part = part),
           id = sprintf('{%s}',paste(part, collapse=',')))
}) %>% bind_rows

ggplot() + 
  geom_point(data=ex4.1, aes(x=X1, y=X2),alpha=0.2) + 
  stat_contour(data = CN6, aes(x=X1, y=X2, z=z, col=id))

Using a specific partition

partition = list(c(1,6,2),c(3,4),5)
CN6 = lapply(partition, function(part){
  expand.grid(X1 = xlimits, X2 = ylimits) %>%
    mutate(z = dmixnorm_solution(., m, part = part),
           id = sprintf('{%s}',paste(part, collapse=',')))
}) %>% bind_rows

ggplot() + 
  geom_point(data=ex4.1, aes(x=X1, y=X2),alpha=0.2) + 
  stat_contour(data = CN6, aes(x=X1, y=X2, z=z, col=id))


Try the mixpack package in your browser

Any scripts or data that you put into this service are public.

mixpack documentation built on May 29, 2017, 11:24 p.m.