Home

/

CRAN

/

mixdir

/

mixdir: Cluster high dimensional categorical datasets

mixdir: Cluster high dimensional categorical datasets
In mixdir: Cluster High Dimensional Categorical Datasets

Description Usage Arguments Details Value References Examples

View source: R/mixdir.R

Cluster high dimensional categorical datasets

1
2
3

mixdir(X, n_latent = 3, alpha = NULL, beta = NULL,
  select_latent = FALSE, max_iter = 100, epsilon = 0.001,
  na_handle = c("ignore", "category"), repetitions = 1, ...)

`X`	A matrix or data.frame of size (N_ind x N_quest) that contains the categorical responses. The values can be characters, integers or factors. The most flexibility is provided if factors are used.
`n_latent`	The number of latent factors that are used to approximate the model. Default: 3.
`alpha`	A single number or a vector of two numbers in case select_latent=TRUE. If it is NULL alpha is initialized to 1. It serves as prior for the Dirichlet distributions over the latent groups. They serve as pseudo counts of individuals per group.
`beta`	A single number. If it is NULL beta is initialized to 0.1. It serves as a prior for the Dirichlet distributions over the categorical responses. Large numbers favor an equal distribution of responses for a question of the individuals in the same latent group, small numbers indicate that individuals of the same latent group usually answer a question the same way.
`select_latent`	A boolean that indicates if the exact number n_latent should be used or if a Dirichlet Process prior is used that shrinks the number of used latent variables appropriately (can be controlled with alpha=c(a1, a2) and beta). Default: FALSE.
`max_iter`	The maximum number of iterations.
`epsilon`	A number that indicates the numerical precision necessary to consider the algorithm converged.
`na_handle`	Either "ignore" or "category". If it is "category" all `NA`'s in the dataset are converted to the string "(Missing)" and treated as their own category. If it is "ignore" the `NA`'s are treated as missing completely at random and are ignored during the parameter updates.
`repetitions`	A number specifying how often to repeat the calculation with different initializations. Automatically selects the best run (i.e. max(ELBO)). Default: 1.
`...`	Additional parameters passed on to the underlying functions. The parameters are verbose, phi_init, zeta_init and if select_latent=FALSE omega_init or if select_latent=TRUE kappa1_init and kappa2_init.

The function uses a mixture of multinomials to fit the model. The full model specification is

lambda | alpha ~ DirichletProcess(alpha)

z_i | lambda ~ Multinomial(lambda)

U_{j,k} | beta ~ Dirichlet(beta)

X_{i,j} | U_j, z_i=k ~ Multinomial(U_{j,k})

In case that select_latent=FALSE the first line is replaced with

lambda | alpha ~ Dirichlet(alpha)

The initial inspiration came from Dunson and Xing (2009) who proposed a Gibbs sampling algorithm to solve this model. To speed up inference a variational inference approach was derived and implemented in this package.

A list that is tagged with the class "mixdir" containing 8 elements:

converged: a boolean indicator if the model has converged
convergence: a numerical vector with the ELBO of each iteration
ELBO: the final ELBO of the converged model
lambda: a numerical vector with the n_latent class probabilities
pred_class: an integer vector with the the most likely class assignment for each individual.
class_prob: a matrix of size n_ind x n_latent which has for each individual the probability to belong to class k.
category_prob: a list with one entry for each feature (i.e. column of X). Each entry is again a list with one entry for each class, that contains the probability of individuals of that class to answer with a specific response.
specific_params: A list whose content depends on the parameter select_latent. If select_latent=FALSE it contains the two entries omega and phi which are the Dirichlet hyperparameters that the model has fitted. If select_latent=TRUE it contains kappa1, kappa2 and phi, which are the hyperparameters for the Dirichlet Process and the Dirichlet of the answer.
na_handle: a string indicating the method used to handle missing values. This is important for subsequent calls to predict.mixdir.

1. C. Ahlmann-Eltze and C. Yau, "MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data", 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.

2. Dunson, D. B. and Xing, C. Nonparametric Bayes Modeling of Multivariate Categorical Data. J. Am. Stat. Assoc. 104, 1042–1051 (2009).

3. Blei, D. M., Ng, A. Y. and Jordan, M. I. Latent Dirichlet Allocation. J. Macine Learn. Res. 3, 993–1022 (2003).

4. Blei, D. M. and Jordan, M. I. Variational inference for Dirichlet process mixtures. Bayesian Anal. 1, 121–144 (2006).

1 2	data("mushroom") res <- mixdir(mushroom[1:30, ])

mixdir documentation built on Sept. 20, 2019, 5:04 p.m.

mixdir index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mixdir
Cluster High Dimensional Categorical Datasets

mixdir: Cluster high dimensional categorical datasets
In mixdir: Cluster High Dimensional Categorical Datasets

Description

Usage

Arguments

Details

Value

References

Examples

Related to mixdir in mixdir...

R Package Documentation

Browse R Packages

We want your feedback!

mixdir Cluster High Dimensional Categorical Datasets

mixdir: Cluster high dimensional categorical datasets In mixdir: Cluster High Dimensional Categorical Datasets

Description

Usage

Arguments

Details

Value

References

Examples

Related to mixdir in mixdir...

R Package Documentation

Browse R Packages

We want your feedback!

mixdir
Cluster High Dimensional Categorical Datasets

mixdir: Cluster high dimensional categorical datasets
In mixdir: Cluster High Dimensional Categorical Datasets