require(graphstats)
require(mclust)
require(MASS)
require(ggplot2)

Here, we use the gclust function to select the optimal number of clusters for a Gaussian Mixture Model. This function uses the Bayesian Information Criterion (BIC) for data-based model selection.

A Mixture of 2 Normal Densities

We generate a mixture of 2 Gaussian densities, and try to estimate the number of components from simulated data.

# Mixture parameters
mu0 <- -1.5
mu1 <- 2
sigma0 <- 1
sigma1 <- 0.5
pi0 <- 0.5
pi1 <- 1 - pi0

# Defined separate and mixed PDFs as functions.
mixture_pdf <- function(x) { pi0*dnorm(x, mu0, sigma0) + pi1*dnorm(x, mu1, sigma1) }
norm0_pdf <- function(x) { pi0*dnorm(x, mu0, sigma0) }
norm1_pdf <- function(x) { pi1*dnorm(x, mu1, sigma1) }

# Display.
p <- ggplot(data.frame(x = c(-4, 4)), aes(x = x)) +
        stat_function(fun = mixture_pdf, size = 1.5) +
        stat_function(fun = norm0_pdf, colour = "deeppink", size = 1) +
        stat_function(fun = norm1_pdf, colour = "dodgerblue3", size = 1)
p + xlab("X") + ylab("PDF") + ggtitle("Mixture of Two Gaussian Densities")

Given sufficient data, we expect gclust to return 2 as the optimal number of components, when considering component numbers from 1 to 5.

# Condition on class sizes of pi0*n and pi1*n to sample data matrix X.
set.seed(123)
n <- 100
X0 <- rnorm(pi0*n, mu0, sigma0)
X1 <- rnorm(pi1*n, mu1, sigma1)
X <- as.matrix(c(X0, X1), nrow = n)

# Run gclust, which will consider models from K = 1 to 5.
G <- gclust(X, K = 5)
cat("The BIC-optimal number of components for the data is:", G, "\n")

The selected number of components can then be used as a parameter for a clustering algorithm such as EM.



neurodata/graphstats documentation built on May 14, 2019, 5:19 p.m.