clusterCVR: Cluster CVR data with a EM algorithm

View source: R/clusterCVR.R

clusterCVRR Documentation

Cluster CVR data with a EM algorithm

Description

Compute cluster assignment probabilities by a EM algorithm. The required inputs are a numeric data matrix and the number of clusters.

Usage

clusterCVR(
  data,
  user_K = 3,
  loglik_thresh = 1e-05,
  runs = 1,
  n_iter = Inf,
  fast = FALSE,
  IIA = FALSE,
  init = "kmeans",
  subset = NULL,
  ignore_X = FALSE,
  recode_key = NULL,
  seed = 2138,
  verbose = TRUE,
  pi = NULL,
  mu = NULL,
  zeta_hat = NULL
)

.cluster(data, user_K, seed, n_iter, loglik_thresh, fast, IIA, init, verbose)

Arguments

data

the dataset, in list form, with the following slots.

y

A n by K matrix of split indicators

m

A n by K matrix of missingness indicators. 3 means no missing, 2 means only straight is available, and 1 means only split is available.

X

An optional n by P matrix of covariates with respondent-specific covariates.

n_u

An integer scalar for the number of voters

L

An integer scalar for the number of possible y values (0-indexed)

uy

A n by K matrix of unique profiles y, needed if fast = TRUE

user_K

the number of clusters to presume / compute

loglik_thresh

the threshold value for convergence. The EM will stop when the relative change in log likelihood is less than the threshold.

runs

Number of replications (with different starting values to run). Default is 1 but more than 1 is highly recommended if computing time is not prohibitive.

n_iter

manual limit to iterations

fast

summarize data to unique profiles, so estimation is faster? Currently only possible if IIA = FALSE. Defaults to FALSE.

IIA

assume that the data$y matrix is generated from a varying choice set as defined by data$m? Defaults to FALSE.

init

method of initialization

subset

A vector of row indices or row names to subset all the data by. Useful when wanting to test a small subset of the data without modifying the data list. If fast = TRUE, it will subset n_u and uy, If fast = FALSE, it will subset y.

ignore_X

Should X be set to NULL even if it is provided? Useful when switching between covariates and non-covariates case. Defaults to FALSE.

recode_key

A named vector to be passed on to dplyr::recode, in the form (old1 = new1, old2 = new2, ...)

seed

seed for initialization

verbose

Defaults to TRUE.

pi, mu, zeta_hat

initial values of the key parameters, if there are any good guesses. If left NULL, it will initialize based on the method in "init". Follow the format of the output.

Details

See fmt_mu_viz for a quick way to visualize the output.

Value

ests

The last iteration

iters

Stored iterations

aux

A list of stored items not specific to iterations. These include the initial values, parameters, total time data, and settings.

seeds_run

A vector of runs seeds that were used.

loglik_run

A vector of runs final loglikelihood estimates corresponding to each run of the model. Only The model with the highest log likelihood is stored.

Examples

em_full <- clusterCVR(simdata_full, init = "kmeans", runs = 2)

summary(em_full)

## Not run: 
 pars <- summ_params(em_full)
 graph_trend(pars, simdata_full)

## End(Not run)

em_miss  <- clusterCVR(simdata_miss, IIA = TRUE)


kuriwaki/clusterCVR documentation built on July 31, 2024, 8:28 p.m.