Multiple Data Mixture Models

Description

Fits a layered or chained mixture model to a list representing multiple sources of data, using a choice of distributions and number of components for each data source.

Usage

1
2
3
4
mdmixmod(X, K, K0=min(K), topology=LC_TOPOLOGY, family=NULL, prior=NULL, 
    prefit=TRUE, iter.max=LC_ITER_MAX, dname=deparse(substitute(X)))
## S3 method for class 'mdmixmod'
print(x, ...)

Arguments

X

a list of observed data sources; the elements must be numeric vectors, matrices, or data frames. Each element of X must have the same data length, that is, length of vectors or number of rows of matrices and data frames; but each element of X may be of arbitrary width, that is, number of columns of matrices and data frames.

K

the vector of numbers of mixture components for the hidden variables corresponding to each observed data source. If length(K) < length(X), the values in K will be recycled; thus K may be a scalar, in which case the same number of components will be used throughout. If length(K) > length(X), then K will be truncated and a warning will be given.

K0

the number of mixture components for the top-level hidden variable.

topology

one of the model topologies in LC_TOPOLOGY, either "layered" or "chained". By default, "layered" is used. Partial matches are allowed.

family

a vector of names of distribution families to be used in fitting the models for each observed data source; each element of family must be one of the distribution names in LC_FAMILY. The usual recycling and truncation rules are followed. By default, "normal" is used and recycled to c("normal", ..., "normal"). Partial matches are allowed.

prior

prior probability distribution on Y_0. This feature is under development and its use is not currently recommended.

prefit

logical; if TRUE, the marginal models will be fitted first and the resulting weights used for initialization.

iter.max

the maximum number of iterations for the EM algorithm, by default equal to LC_ITER_MAX.

dname

the name of the data.

x

an object of class mixmod.

...

further arguments to print.default.

Details

In the layered model, a top-level hidden categorical random variable Y_0, which can take on values from 1 to some positive integer K_0, generates categorical hidden random variables Y_1, …, Y_Z for some positive integer Z. For z = 1,…,Z, each Y_z can take on values from 1 to some positive integer K_z. In the chained model, Y_0 generates Y_1, which in turn generates Y_2, etc., up to Y_{Z-1}, which generates Y_Z.

In both models, the Y_z's generate the observed mixture random variables X_1, …, X_Z, from which the elements of the observed data X are assumed to be drawn. (That is, Z = length(X), the number of list elements in X.) The relationship between each Y_z and X_z is the same as the relationship between Y and X in mixmod.

As in mixmod, the EM algorithm attempts to maximize the Q-value, that is, the expected complete data (hidden and observed variables) log-likelihood.

Value

A list of class mdmixmod, a subclass of mixmod, having the following elements:

N

the length of the data, that is, length(X[[1]]) if X[[1]] is a vector, or nrow(X[[1]]) if X[[1]] is a matrix or data frame.

Z

the size of the data, that is, Z = length(X).

D

the vector of widths of the data, that is, D[z] = 1 if X[[z]] is a vector, or D[z] = ncol(X[[z]]) if X[[z]] is a matrix or data frame.

K

the vector of the numbers of components in the lower-level mixture models.

K0

the number of components in the top-level mixture model, that is, K_0.

X

the original data, with data frames converted to matrices. If the elements of X were not named, they will be named "X1",...,"XZ" here.

npar

the total number of parameters in the model.

npar.hidden

the number of parameters for the hidden component portion of the model.

npar.observed

the number of parameters for the observed data portion of the model.

iter

the number of iterations required to fit the model.

params

the parameters estimated for the model. This is a list with elements hidden and observed, corresponding to distribution for the hidden and observed portions of the model. hidden has elements prob0, the vector of probabilities for the possible values for Y_0, and cprob, the list of matrices of conditional probabilities for the possible values of the Y_z's. In the layered model, these are K_0-by-K_z matrices of which the (k_0,k_z)th element represents P(Y_z = k_z | Y_0 = k_0). In the chained model, the zth element of cprob is a K_{z-1}-by-K_z matrix of which the (k_{z-1},k_z)th element represents P(Y_z = k_z | Y_{z-1} = k_{z-1}). The chained model hidden also has elements probz, a list of vectors representing the marginal probabilities for the Y_z's, and rprob, representing the conditional probabilities of the Y_z's given the values of the Y_{z+1}'s. The elements of observed depend on the distribution family chosen in fitting the model.

stats

a vector with named elements corresponding to the number of iterations, log-likelihood, Q-value, and BIC for the estimated parameters.

weights

a list with elements U, a N-by-K_0 matrix; V, a Z-length list of K_0-length lists of N-by-K_z matrices in the layered model, or a Z-length list of K_{z-1}-length lists of N-by-K_z matrices in the layered model; and W, a Z-length list of N-by-K_z matrices; representing weights used in the M-step of the EM algorithm for estimating the final set of parameters for the observed data portion of the model.

pdfs

a list with elements G, alpha, beta, and gamma, representing various estimated density functions for the data. gamma represents the estimated density of the observed data across all data sources under the fitted mixture model.

posterior

the N-by-K_0 matrix of which the (n,k_0)th element is the estimated posterior probability that the nth observation (across all data sources) was generated by the k_0th component. Equal to the U element of weights.

assignment

the vector of length N of which the nth element is the most probable top-level component to have generated the nth observation. In other words, assignment[n] = which.max(posterior[n,]).

iteration.params

a list of length iter giving the estimated parameters at each iteration of the algorithm.

iteration.stats

a data frame of iter rows giving iteration statistics, as in stats, at each iteration of the algorithm.

topology

the topology of the model.

family

the vector of names of the distribution families used in the model. See LC_FAMILY.

distn

the vector of names of the actual distributions used in the model. See LC_FAMILY.

iter.max

the maximum number of distributions allowed in model fitting.

dname

the name of the data.

dattr

attributes of the data, used by model likelihood functions to determine if the data have been scaled or otherwise transformed.

zvec

the vector of names of X; if the elements of X are unnamed, names are assigned.

kvec

a list of which the zth element is a vector of integers from 1 to K_z.

k0vec

a vector of integers from 1 to K_0.

prior

the value of the prior parameter used in model fitting. See Arguments.

marginals

if prefit is TRUE, the marginal fits to the data, otherwise NULL.

Author(s)

Daniel Dvorkin

References

McLachlan, G.J. and Thriyambakam, K. (2008) The EM Algorithm and Extensions, John Wiley & Sons.

See Also

LC_FAMILY for distributions and families; mixmod for fitting single-data mixture models; reporting and likelihood for model reporting; rocinfo for performance evaluation; convergencePlot for behavior of the algorithm; simulation for simulating from the parameters of a model.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
## Not run: 
data(CiData)
data(CiGene)
fit <- mdmixmod(CiData, c(2,3,2), topology="chained",
           family=c("pvii", "norm", "pvii"))
fit
# Chained (PVII, normal, PVII) mixture model ('pvii', 'mvnorm', 'pvii')
# Data 'CiData' of size 10244-by-(1,4,1) fitted to 2 (2,3,2) components
# Model statistics:
#       iter       llik       qval        bic     iclbic 
#     377.00  -75859.81  -87065.28 -152310.62 -174721.56 
margs <- marginals(fit)
allFits <- c(list(chained=fit), margs)
plot(multiroc(allFits, CiGene$target))

## End(Not run)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.