clusterCVR: Cluster CVR data with a EM algorithm
In kuriwaki/clusterCVR: A Clustering Algorithm for Cast Vote Records

clusterCVR

R Documentation

Cluster CVR data with a EM algorithm

Description

Compute cluster assignment probabilities by a EM algorithm. The required inputs are a numeric data matrix and the number of clusters.

Usage

clusterCVR(
  data,
  user_K = 3,
  loglik_thresh = 1e-05,
  runs = 1,
  n_iter = Inf,
  fast = FALSE,
  IIA = FALSE,
  init = "kmeans",
  subset = NULL,
  ignore_X = FALSE,
  recode_key = NULL,
  seed = 2138,
  verbose = TRUE,
  pi = NULL,
  mu = NULL,
  zeta_hat = NULL
)

.cluster(data, user_K, seed, n_iter, loglik_thresh, fast, IIA, init, verbose)

Arguments

`data`	the dataset, in list form, with the following slots. `y` A n by K matrix of split indicators `m` A n by K matrix of missingness indicators. 3 means no missing, 2 means only straight is available, and 1 means only split is available. `X` An optional n by P matrix of covariates with respondent-specific covariates. `n_u` An integer scalar for the number of voters `L` An integer scalar for the number of possible y values (0-indexed) `uy` A n by K matrix of unique profiles y, needed if fast = TRUE
`user_K`	the number of clusters to presume / compute
`loglik_thresh`	the threshold value for convergence. The EM will stop when the relative change in log likelihood is less than the threshold.
`runs`	Number of replications (with different starting values to run). Default is 1 but more than 1 is highly recommended if computing time is not prohibitive.
`n_iter`	manual limit to iterations
`fast`	summarize data to unique profiles, so estimation is faster? Currently only possible if IIA = FALSE. Defaults to `FALSE`.
`IIA`	assume that the data$y matrix is generated from a varying choice set as defined by data$m? Defaults to `FALSE`.
`init`	method of initialization
`subset`	A vector of row indices or row names to subset all the data by. Useful when wanting to test a small subset of the data without modifying the `data` list. If `fast = TRUE`, it will subset `n_u` and `uy`, If `fast = FALSE`, it will subset `y`.
`ignore_X`	Should X be set to NULL even if it is provided? Useful when switching between covariates and non-covariates case. Defaults to `FALSE`.
`recode_key`	A named vector to be passed on to `dplyr::recode`, in the form `(old1 = new1, old2 = new2, ...)`
`seed`	seed for initialization
`verbose`	Defaults to TRUE.
`pi`, `mu`, `zeta_hat`	initial values of the key parameters, if there are any good guesses. If left `NULL`, it will initialize based on the method in `"init"`. Follow the format of the output.

Details

See fmt_mu_viz for a quick way to visualize the output.

Value

ests: The last iteration
iters: Stored iterations
aux: A list of stored items not specific to iterations. These include the initial values, parameters, total time data, and settings.
seeds_run: A vector of runs seeds that were used.
loglik_run: A vector of runs final loglikelihood estimates corresponding to each run of the model. Only The model with the highest log likelihood is stored.

Examples

em_full <- clusterCVR(simdata_full, init = "kmeans", runs = 2)

summary(em_full)

## Not run: 
 pars <- summ_params(em_full)
 graph_trend(pars, simdata_full)

## End(Not run)

em_miss  <- clusterCVR(simdata_miss, IIA = TRUE)

kuriwaki/clusterCVR documentation built on July 31, 2024, 8:28 p.m.