profBinary: Binary Product Partition Models

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/profBinary.R

Description

This function finds the most probable cluster partition in a binary product partition model (PPM). The Dirichlet process mixture of binary models is the default PPM.

Usage

1
2
profBinary(formula, data, clust, param, method="agglomerative",
           maxiter=1000, crit=1e-6, verbose=FALSE)

Arguments

formula

a one-sided formula specifying a set of binary response variables.

data

a dataframe where formula is evaluated

clust

optional vector of factors (or coercible to factors) indicating initial clustering among observations.

param

optional list containing the any of the named elements alpha, a0, and b0 corresponding to the prior parameters of the beta-binary Dirichlet process mixture. The prior parameters of the beta-binary Dirichlet process mixture should all be scalars.

method

character string indicating the optimization method to be used. Meaningful values for this string are "stochastic", "gibbs", "agglomerative", and "none".

  • The "stochastic" method is an iterative stochastic search utilizing ‘explode’ and ‘merge’ operations on the clusters of a partition. At the explode step, a randomly selected subset of observations are redistributed uniformly at random to an existing or new cluster. Each of the exploded observations are then merged with an existing cluster in a sequentially optimal fashion. Optimization involves computing a moving average of the relative change in the marginal posterior distribution over the possible clusters after each iteration. The optimization stopping criterion is the minumim value this quantity can take before stopping the optimization cycle. If the optimization cycle reaches the maximum allowable iterations before meeting the stopping criterion, a warning is issued.

  • The "gibbs" method implements the Polya urn Gibbs sampler. This method draws samples from the posterior distribution over the cluster partition in a sequential Gibbs fashion. The sample value with greatest posterior mass is returned. See MacEachern (1994) for details.

  • The "agglomerative" method initially places each observation into seperate clusters. At each iteration, two of the remaining clusters are merged, where the merged clusters are chosen such that the resulting increase in the posterior mass function is maximized. This is repeated until only one cluster remains. The MAP estimate is the cluster partition, among those considered, which maximizes the posterior mass function over the possible cluster partitions. See Ward (1963) for additional details.

  • The "fast" method is a modified version of Sequential Update and Greedy Search (SUGS). This SUGS algorithm assigns observations to clusters sequentially. Initally, the first observation forms a singleton cluster. Subsequent observations are assigned to an existing cluster, or form a new singleton cluster by optimizing the associated posterior probabilities, conditional on previous cluster assignments. See Wang and Dunson (2010) for additional details.

  • The "none" method is typically used in conjunction with the clust option to specify an initial cluster partition. If "none" is specified without clust, a simple algorithm is used to initialize the cluster partition. Otherwise, the cluster partition is initialized using the clust argument. The posterior statistics are then computed for initialized clusters.

maxiter

integer value specifying the maximum number of iterations for the optimization algorithm.

crit

numeric scalar constituting a stopping criterion for the "stochastic" and "gibbs" optimization methods.

verbose

logical value indicating whether the routine should be verbose in printing.

Details

This function fits a Dirichlet process mixture of binary models (DPMBM) using the profile method. This method will cluster binary observations vectors (rows of y) into clusters. The cluster partition is estimated by maximizing the marginal posterior distribution over all possible cluster partitions. Each cluster has an associated binary model. The binary model assigns Bernoulli probabilities independently to each binary valued outcome, corresponding to the columns of y. The prior parameters a0 and b0 assign a beta prior distribution to each outcome probability. Conditional on the estimated cluster partition, each outcome probability is beta distributed a posteriori. The function profBinary returns the associated posterior parameters of the beta destribution for each cluster and outcome probability.

Missing observations (NA) are removed automatically and a warning is issued. The return value contains the reduced observation matrix.

Value

An instance of the class profBinary containing the following objects

y

the numeric matrix of observations, where rows with missing observations (NA) are removed

param

the list of prior parameters

clust

a numeric vector of integers indicating cluster membership for each non-missing observation

a

a list of numeric vectors containing the posterior vector a for each cluster

b

a list of numeric vectors containing the posterior vector b for each cluster

logp

the unnormalized log value of the marginal posterior mass function for the cluster partition evaluated at clust

model

a model frame, resulting from a call to model.frame

Author(s)

Matt Shotwell

References

Ward, J. H. (1963) Heirarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58:236-244 MacEachern, S. N. (1994) Estimating Normal Means with Conjugate Style Dirichlet Process Prior. Communications in Statistics B 23:727-741

See Also

pci

Examples

1
library(profdpm)

profdpm documentation built on May 2, 2019, 4:56 p.m.