README.md

jDirichletMixtureModels

Package jDirichletMixtureModels provides utilities for clustering using Dirichlet Process mixture models in R. It supports a number of existing conjugate distribution pairs, as well as user-specified distributions.

Currently, the package only supports clustering using conjugate distributions using the methods in Markov Chain Sampling Methods for Dirichlet Process Mixture Models by Radford Neal. Clustering using non-conjugate distributions is under active development.

This package is an R wrapper of the Julia package DirechletMixtureModels, which it is being co-develped with. Julia is a high-level, high-performance dynamic programming language for numerical computing.

Getting Started

Installation

This package can be installed from Github by using devtools: ```{r, eval=FALSE} devtools::install_github("nsdumont/jDirichletMixtureModels")


### Basic Usage
To use the package, the first thing you need to do is initial setup. Call the `dmm.setup()` function after loading the package. Input to `dmm.setup()` should be the path name of the folder in which your Julia binary executable is located (inside the Julia application bundle).
Mac OSX users using the latest Julia version do not need to pass any input to `dmm.setup()`. Since by default, it will use the correct path  `/Applications/Julia-0.6.app/Contents/Resources/julia/bin`.

```{r, eval=FALSE}
library(jDirichletMixtureModels)
dmm.setup()

Next you must define a model. A number of conjugate models are provided in the package (see section Bulit-In Models for more information). To use a predefined model you must pass it's name and any requried parameters. ```{r, eval=FALSE} model <- dmm.model("UnivariateNormal") # Using the default parameters model <- dmm.model("MultivariateNormal", params=c(mu,kappa,T,nu)) # Using user defined parameters: they are the paramters for the Normal Wishart distribution

For more info:

?dmm.BaseModel


The rest of this section refers to package feautures in active development and not fully tested.

User defined models (both conjugate and non-conjugate) can also be used. Users must supply the following funtions:
- The probability density likelihood
- The sample posterior function
- The marginal likelihood

If these functions are R functions, models are defined as follows:
 ```{r, eval=FALSE}
model <- dmm.model(pdf_func, sample_func, marg_func, params, isconjugate)

# For more info:
?dmm.RModel

where params is a list of all hyperparameters. The functions pdf_func, sample_func, and marg_func must all take these parameters. The option isconjugate is either TRUE or FALSE.

Users can also define these functions as Julia functions with names pdf_name, sample_name, marg_name stored in file filename: ```{r, eval=FALSE} dmm.addfile(filename) model <- dmm.model(pdf_name, sample_name, marg_name, params, isconjugate)

For more info:

?dmm.JModel

All of these inputs must be strings


### Example

Suppose you have a dataset of 2D N-by-2 data you wish to find clusters over, with no prior knowledge of the distribution. If your data is stored as a 2D array called 'Xdata', you could run the following:

```{r, eval=FALSE}
  library(jDirichletMixtureModels)
  dmm.setup()

  model <- dmm.model(data = Xdata)

  states <- dmm.cluster(model, Xdata)

This will define a conjugate multivariate Normal model (multivariate Normal likelihood with Normal-Wishart prior) over your data, with hyper-parameters inferred from the data. It will then perform a run of the MCMC clustering algorithm over the data, returning a list of cluster states. This is a list of (effectively) I.I.D draws from the posterior over the clusters (default 5000 iterations with a burnin of 200).

A cluster state, states[[i]], consists of labeled data, states[[i]]$data, and cluster information, states[[i]]$clusters.

Suppose I wanted to take a random draw and see a summary of the clusters for that draw. I would then run the following:

  dmm.summarize(states[[1]]$clusters)

This will print a summary of the clusters in the first state in the list. Note that by default the states are randomized so that they may be used as IID draws. To see a plot of the data colored by cluster I would run:

  dmm.plot(states[[1]]$data)

Bulit-In Models

A complete guide to the bulit-in models avaible, their inputs, and their default inputs is currently being created. The current version is avaible in the resources/Models.pdf in this repository.

The avaible bulit-in models are: - "MultivariateNormal" (default): A multivariate Normal likelihood with Normal-Wishart prior. Requries paramters: prior mean, \lambda, scale matrix, \nu (see https://en.wikipedia.org/wiki/Normal-Wishart_distribution)

Note: parameters must be passed in the order they are listed in above.

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments



nsdumont/jDirichletMixtureModels documentation built on May 23, 2019, 2:51 p.m.