View source: R/gaussian_mixture.R
gaussian_mixture | R Documentation |
Perform Gaussian mixture model clustering on a data matrix.
gaussian_mixture(data, k, max_iter = 10, details = FALSE, waiting = TRUE, ...)
data |
a set of observations, presented as a matrix-like object where every row is a new observation. |
k |
the number of clusters to find. |
max_iter |
the maximum number of iterations to perform. |
details |
a Boolean determining whether intermediate logs explaining how the algorithm works should be printed or not. |
waiting |
a Boolean determining whether the intermediate logs should be printed in chunks waiting for user input before printing the next or not. |
... |
additional arguments passed to |
The data given by data
is clustered by the model-based
algorithm that assumes every cluster follows a normal distribution, thus
the name "Gaussian Mixture".
The normal distributions are parameterized by their mean vector, covariance matrix and mixing proportion. Initially, the mean vector is set to the cluster centers obtained by performing a k-means clustering on the data, the covariance matrix is set to the covariance matrix of the data points belonging to each cluster and the mixing proportion is set to the proportion of data points belonging to each cluster. The algorithm then optimizes the gaussian models by means of the Expectation Maximization (EM) algorithm.
The EM algorithm is an iterative algorithm that alternates between two steps:
Compute how much is each observation expected to belong to each component of the GMM.
Recompute the GMM according to the expectations from the E-step in order to maximize them.
The algorithm stops when the changes in the expectations are sufficiently small or when a maximum number of iterations is reached.
A gaussian_mixture()
object. It is a list with the
following components:
cluster | a vector of integers (from 1:k ) indicating the
cluster to which each point belongs. |
mu | the final mean parameters. |
sigma | the final covariance matrices. |
lambda | the final mixing proportions. |
loglik | the final log likelihood. |
all.loglik | a vector of each iteration's log likelihood. |
iter | the number of iterations performed. |
size | a vector with the number of data points belonging to each cluster. |
Eduardo Ruiz Sabajanes, eduardo.ruizs@edu.uah.es
### !! This algorithm is very slow, so we'll only test it on some datasets !!
### Helper functions
dmnorm <- function(x, mu, sigma) {
k <- ncol(sigma)
x <- as.matrix(x)
diff <- t(t(x) - mu)
num <- exp(-1 / 2 * diag(diff %*% solve(sigma) %*% t(diff)))
den <- sqrt(((2 * pi)^k) * det(sigma))
num / den
}
test <- function(db, k) {
print(cl <- clustlearn::gaussian_mixture(db, k, 100))
x <- seq(min(db[, 1]), max(db[, 1]), length.out = 100)
y <- seq(min(db[, 2]), max(db[, 2]), length.out = 100)
plot(db, col = cl$cluster, asp = 1, pch = 20)
for (i in seq_len(k)) {
m <- cl$mu[i, ]
s <- cl$sigma[i, , ]
f <- function(x, y) cl$lambda[i] * dmnorm(cbind(x, y), m, s)
z <- outer(x, y, f)
contour(x, y, z, col = i, add = TRUE)
}
}
### Example 1
test(clustlearn::db1, 2)
### Example 2
# test(clustlearn::db2, 2)
### Example 3
test(clustlearn::db3, 3)
### Example 4
test(clustlearn::db4, 3)
### Example 5
test(clustlearn::db5, 3)
### Example 6
# test(clustlearn::db6, 3)
### Example 7 (with explanations, no plots)
cl <- clustlearn::gaussian_mixture(
clustlearn::db5[1:20, ],
3,
details = TRUE,
waiting = FALSE
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.