# mixmod: Single Data Source Mixture Model In lcmix: Layered and chained mixture models

## Description

Fit a finite mixture model to a single source of data using one of several distributions.

## Usage

 ```1 2 3 4``` ```mixmod(X, K, family=names(LC_FAMILY), prior=NULL, iter.max=LC_ITER_MAX, dname=deparse(substitute(X))) ## S3 method for class 'mixmod' print(x, ...) ```

## Arguments

 `X` for univariate data, a vector; for multivariate data, a matrix or data frame. Must consist only of numeric values. Each element of the vector, or each row of the matrix or data frame, should represent an independent observation. `K` the number of components, an integer greater than or equal to 1. `K=1` will result in the distribution specified by `family` being fitted to the entire data set, and is not particularly useful. `family` a string, one of the supported distribution family names given in `LC_FAMILY`. By default, `"normal"` is used. Partial matches are allowed. `prior` prior probability distribution on Y. This feature is under development and its use is not currently recommended. `iter.max` the maximum number of iterations for the EM algorithm, by default equal to `LC_ITER_MAX`. `dname` the name of the data. `x` an object of class `mixmod`. `...` further arguments to `print.default`.

## Details

In the finite mixture model used here, a hidden categorical random variable Y, which can take on values from 1 to some positive integer K, generates the distribution of the observed random variable X, from which the observed `X` is assumed to be drawn. Specifically, `mixmod` fits a mixture model of the form

f(x) = sum_k p_k f_k(x)

where k = 1, …, K and each f_k(.) is a density function on the sample space of X. The p_k's, that is, the component probabilities, sum to 1.

The EM algorithm used in model fitting attempts to maximize the Q-value, that is, the expected complete data log-likelihood, for the model. The parameter values which maximize the Q-value also maximize the log-likelihood for the density given above.

## Value

A list of class `mixmod`, having the following elements:

 `N` the length of the data, that is, `length(X)` if `X` is a vector, or `nrow(X)` if `X` is a matrix or data frame. `D` the width of the data, that is, 1 if `X` is a vector, or `ncol(X)` if `X` is a matrix or data frame. `K` the number of components in the mixture model. `X` the original data; if `X` was a data frame, it will have been converted to a matrix. `npar` the total number of parameters in the model. `npar.hidden` the number of parameters for the hidden component portion of the model. `npar.observed` the number of parameters for the observed data portion of the model. `iter` the number of iterations required to fit the model. `params` the parameters estimated for the model. This is a list with elements `hidden` and `observed`, corresponding to distribution for the hidden and observed portions of the model. `hidden` always has one element, `prob`, the vector of p_k's. The elements of `observed` depend on the distribution family chosen in fitting the model. `stats` a vector with named elements corresponding to the number of iterations, log-likelihood, Q-value, and BIC for the estimated parameters. `weights` a list with the single element `W`, the N-by-K matrix of weights used in the M-step of the EM algorithm for estimating the final set of parameters for the observed data portion of the model. `pdfs` a list with two elements: `G`, the N-by-K matrix of which the (n,k)th element is the estimated value of f_k(x_n), where x_n is the nth observation in `X`; and `fX`, the vector of length `N` of which the nth element is the estimated value of f(x_n). `posterior` the N-by-K matrix of which the (n,k)th element is the estimated posterior probability that the nth observation was generated by the kth component. Equal to the `W` element of `weights`. `assignment` the vector of length N of which the nth element is the most probable component to have generated the nth observation. In other words, `assignment[n] = which.max(posterior[n,])`. `iteration.params` a list of length `iter` giving the estimated parameters at each iteration of the algorithm. `iteration.stats` a data frame of `iter` rows giving iteration statistics, as in `stats`, at each iteration of the algorithm. `family` the name of the distribution family used in the model. See `LC_FAMILY`. `distn` the name of the actual distribution used in the model. See `LC_FAMILY`. `prior` the value of the `prior` parameter used in model fitting. See Arguments. `iter.max` the maximum number of distributions allowed in model fitting. `dname` the name of the data. `dattr` attributes of the data, used by model likelihood functions to determine if the data have been scaled or otherwise transformed. `kvec` a vector of integers from 1 to K.

Daniel Dvorkin

## References

McLachlan, G.J. and Thriyambakam, K. (2008) The EM Algorithm and Extensions, John Wiley & Sons.

`LC_FAMILY` for distributions and families; `mdmixmod` for fitting multiple-data mixture models; `reporting` and `likelihood` for model reporting; `rocinfo` for performance evaluation; `convergencePlot` for behavior of the algorithm; `simulation` for simulating from the parameters of a model; packages `mixtools` and `mclust`.
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13``` ```## Not run: data(CiData) data(CiGene) fit <- mixmod(CiData\$expression, 3) fit # Normal mixture model ('mvnorm') # Data 'CiData\$expression' of size 10244-by-4 fitted to 3 components # Model statistics: # iter llik qval bic iclbic # 42.00 -47499.54 -50052.71 -95405.40 -100511.73 plot(rocinfo(fit, CiGene\$target)) ## End(Not run) ```