Single Data Source Mixture Model

Share:

Description

Fit a finite mixture model to a single source of data using one of several distributions.

Usage

1
2
3
4
mixmod(X, K, family=names(LC_FAMILY), prior=NULL, iter.max=LC_ITER_MAX, 
    dname=deparse(substitute(X)))
## S3 method for class 'mixmod'
print(x, ...)

Arguments

X

for univariate data, a vector; for multivariate data, a matrix or data frame. Must consist only of numeric values. Each element of the vector, or each row of the matrix or data frame, should represent an independent observation.

K

the number of components, an integer greater than or equal to 1. K=1 will result in the distribution specified by family being fitted to the entire data set, and is not particularly useful.

family

a string, one of the supported distribution family names given in LC_FAMILY. By default, "normal" is used. Partial matches are allowed.

prior

prior probability distribution on Y. This feature is under development and its use is not currently recommended.

iter.max

the maximum number of iterations for the EM algorithm, by default equal to LC_ITER_MAX.

dname

the name of the data.

x

an object of class mixmod.

...

further arguments to print.default.

Details

In the finite mixture model used here, a hidden categorical random variable Y, which can take on values from 1 to some positive integer K, generates the distribution of the observed random variable X, from which the observed X is assumed to be drawn. Specifically, mixmod fits a mixture model of the form

f(x) = sum_k p_k f_k(x)

where k = 1, …, K and each f_k(.) is a density function on the sample space of X. The p_k's, that is, the component probabilities, sum to 1.

The EM algorithm used in model fitting attempts to maximize the Q-value, that is, the expected complete data log-likelihood, for the model. The parameter values which maximize the Q-value also maximize the log-likelihood for the density given above.

Value

A list of class mixmod, having the following elements:

N

the length of the data, that is, length(X) if X is a vector, or nrow(X) if X is a matrix or data frame.

D

the width of the data, that is, 1 if X is a vector, or ncol(X) if X is a matrix or data frame.

K

the number of components in the mixture model.

X

the original data; if X was a data frame, it will have been converted to a matrix.

npar

the total number of parameters in the model.

npar.hidden

the number of parameters for the hidden component portion of the model.

npar.observed

the number of parameters for the observed data portion of the model.

iter

the number of iterations required to fit the model.

params

the parameters estimated for the model. This is a list with elements hidden and observed, corresponding to distribution for the hidden and observed portions of the model. hidden always has one element, prob, the vector of p_k's. The elements of observed depend on the distribution family chosen in fitting the model.

stats

a vector with named elements corresponding to the number of iterations, log-likelihood, Q-value, and BIC for the estimated parameters.

weights

a list with the single element W, the N-by-K matrix of weights used in the M-step of the EM algorithm for estimating the final set of parameters for the observed data portion of the model.

pdfs

a list with two elements: G, the N-by-K matrix of which the (n,k)th element is the estimated value of f_k(x_n), where x_n is the nth observation in X; and fX, the vector of length N of which the nth element is the estimated value of f(x_n).

posterior

the N-by-K matrix of which the (n,k)th element is the estimated posterior probability that the nth observation was generated by the kth component. Equal to the W element of weights.

assignment

the vector of length N of which the nth element is the most probable component to have generated the nth observation. In other words, assignment[n] = which.max(posterior[n,]).

iteration.params

a list of length iter giving the estimated parameters at each iteration of the algorithm.

iteration.stats

a data frame of iter rows giving iteration statistics, as in stats, at each iteration of the algorithm.

family

the name of the distribution family used in the model. See LC_FAMILY.

distn

the name of the actual distribution used in the model. See LC_FAMILY.

prior

the value of the prior parameter used in model fitting. See Arguments.

iter.max

the maximum number of distributions allowed in model fitting.

dname

the name of the data.

dattr

attributes of the data, used by model likelihood functions to determine if the data have been scaled or otherwise transformed.

kvec

a vector of integers from 1 to K.

Author(s)

Daniel Dvorkin

References

McLachlan, G.J. and Thriyambakam, K. (2008) The EM Algorithm and Extensions, John Wiley & Sons.

See Also

LC_FAMILY for distributions and families; mdmixmod for fitting multiple-data mixture models; reporting and likelihood for model reporting; rocinfo for performance evaluation; convergencePlot for behavior of the algorithm; simulation for simulating from the parameters of a model; packages mixtools and mclust.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
## Not run:  
data(CiData)
data(CiGene)
fit <- mixmod(CiData$expression, 3)
fit
# Normal mixture model ('mvnorm')
# Data 'CiData$expression' of size 10244-by-4 fitted to 3 components
# Model statistics:
#       iter       llik       qval        bic     iclbic 
#      42.00  -47499.54  -50052.71  -95405.40 -100511.73
plot(rocinfo(fit, CiGene$target))

## End(Not run)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.