flexmixedruns: Fitting mixed Gaussian/multinomial mixtures with flexmix
In fpc: Flexible Procedures for Clustering

flexmixedruns

R Documentation

Fitting mixed Gaussian/multinomial mixtures with flexmix

Description

flexmixedruns fits a latent class mixture (clustering) model where some variables are continuous and modelled within the mixture components by Gaussian distributions and some variables are categorical and modelled within components by independent multinomial distributions. The fit is by maximum likelihood estimation computed with the EM-algorithm. The number of components can be estimated by the BIC.

Note that at least one categorical variable is needed, but it is possible to use data without continuous variable.

Usage

flexmixedruns(x,diagonal=TRUE,xvarsorted=TRUE,
                          continuous,discrete,ppdim=NULL,initial.cluster=NULL,
                          simruns=20,n.cluster=1:20,verbose=TRUE,recode=TRUE,
                          allout=TRUE,control=list(minprior=0.001),silent=TRUE)

Arguments

`x`	data matrix or data frame. The data need to be organised case-wise, i.e., if there are categorical variables only, and 15 cases with values c(1,1,2) on the 3 variables, the data matrix needs 15 rows with values 1 1 2. (Categorical variables could take numbers or strings or anything that can be coerced to factor levels as values.)
`diagonal`	logical. If `TRUE`, Gaussian models are fitted restricted to diagonal covariance matrices. Otherwise, covariance matrices are unrestricted. `TRUE` is consistent with the "within class independence" assumption for the multinomial variables.
`xvarsorted`	logical. If `TRUE`, the continuous variables are assumed to be the first ones, and the categorical variables to be behind them.
`continuous`	vector of integers giving positions of the continuous variables. If `xvarsorted=TRUE`, a single integer, number of continuous variables.
`discrete`	vector of integers giving positions of the categorical variables. If `xvarsorted=TRUE`, a single integer, number of categorical variables.
`ppdim`	vector of integers specifying the number of (in the data) existing categories for each categorical variable. If `recode=TRUE`, this can be omitted and is computed automatically.
`initial.cluster`	this corresponds to the `cluster` parameter in `flexmix` and should only be specified if `simruns=1` and `n.cluster` is a single number. Either a matrix with `n.cluster` columns of initial cluster membership probabilities for each observation; or a factor or integer vector with the initial cluster assignments of observations at the start of the EM algorithm. Default is random assignment into `n.cluster` clusters.
`simruns`	integer. Number of starts of the EM algorithm with random initialisation in order to find a good global optimum.
`n.cluster`	vector of integers, numbers of components (the optimum one is found by minimising the BIC).
`verbose`	logical. If `TRUE`, some information about the different runs of the EM algorithm is given out.
`recode`	logical. If `TRUE`, the function `discrete.recode` is applied in order to recode categorical data so that the `lcmixed`-method can use it. Only set this to `FALSE` if your data already has that format (even it that case, `TRUE` doesn't do harm). If `recode=FALSE`, the categorical variables are assumed to be coded 1,2,3,...
`allout`	logical. If `TRUE`, the regular `flexmix`-output is given out for every single number of clusters, which can create a huge output object.
`control`	list of control parameters for `flexmix`, for details see the help page of `FLXcontrol-class`.
`silent`	logical. This is passed on to the `try`-function. If `FALSE`, error messages from failed runs of `flexmix` are suppressed. (The information that a `flexmix`-error occurred is still given out if `verbose=TRUE`).

Details

Sometimes flexmix produces errors because of degenerating covariance matrices, too small clusters etc. flexmixedruns tolerates these and treats them as non-optimal runs. (Higher simruns or different control may be required to get a valid solution.)

General documentation on flexmix can be found in Friedrich Leisch's "FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R", https://CRAN.R-project.org/package=flexmix

Value

A list with components

`optsummary`	summary object for `flexmix` object with optimal number of components.
`optimalk`	optimal number of components.
`errcount`	vector with numbers of EM runs for each number of components that led to flexmix errors.
`flexout`	if `allout=TRUE`, list of flexmix output objects for all numbers of components, for details see the help page of `flexmix-class`. Slots that can be used include for example `cluster` and `components`. So if `fo` is the `flexmixedruns`-output object, `fo$flexout[[fo$optimalk]]@cluster` gives a component number vector for the observations (maximum posterior rule), and `fo$flexout[[fo$optimalk]]@components` gives the estimated model parameters, which for `lcmixed` and therefore `flexmixedruns` are called center mean vector cov covariance matrix pp list of categorical variable-wise category probabilities If `allout=FALSE`, only the flexmix output object for the optimal number of components, i.e., the `[[fo$optimalk]]` indexing above can then be omitted.
`bicvals`	vector of values of the BIC for each number of components.
`ppdim`	vector of categorical variable-wise numbers of categories.
`discretelevels`	list of levels of the categorical variables belonging to what is treated by `flexmixedruns` as category 1, 2, 3 etc.

Author(s)

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en

References

Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.

Examples

  options(digits=3)
  set.seed(776655)
  v1 <- rnorm(100)
  v2 <- rnorm(100)
  d1 <- sample(1:5,100,replace=TRUE)
  d2 <- sample(1:4,100,replace=TRUE)
  ldata <- cbind(v1,v2,d1,d2)
  fr <- flexmixedruns(ldata,
    continuous=2,discrete=2,simruns=2,n.cluster=2:3,allout=FALSE)
  print(fr$optimalk)
  print(fr$optsummary)
  print(fr$flexout@cluster)
  print(fr$flexout@components)

fpc documentation built on Sept. 24, 2024, 9:07 a.m.