build_mmm: Fit a Mixed Markov Model
In Nestimate: Network Estimation, Bootstrap, and Higher-Order Analysis

build_mmm

R Documentation

Fit a Mixed Markov Model

Description

Discovers latent subgroups with different transition dynamics using Expectation-Maximization. Each mixture component has its own transition matrix. Sequences are probabilistically assigned to components.

Usage

build_mmm(
  data,
  k = 2L,
  n_starts = 50L,
  max_iter = 200L,
  tol = 1e-06,
  smooth = 0.01,
  seed = NULL,
  covariates = NULL,
  covariate_effect = c("em", "posthoc"),
  estimator = c("auto", "firth", "multinom", "chisq")
)

Arguments

`data`	A data.frame (wide format), `netobject`, or `tna` model. For tna objects, extracts the stored data.
`k`	Integer. Whole finite number of mixture components, >= 2. Default: 2.
`n_starts`	Integer. Positive whole finite number of random restarts. Default: 50.
`max_iter`	Integer. Positive whole finite maximum EM iterations per start. Default: 200.
`tol`	Numeric. Finite positive convergence tolerance. Default: 1e-6.
`smooth`	Numeric. Finite non-negative Laplace smoothing constant. Default: 0.01.
`seed`	Integer or NULL. Random seed.
`covariates`	Optional. Covariates integrated into the EM algorithm to model covariate-dependent mixing proportions. Accepts a string, character vector, formula, or data.frame (same forms as `build_clusters`). For `netobject` or `cograph_network` input, names are resolved against `$metadata` first, so a typical call is `build_mmm(net, k = 3, covariates = "session_label")`. Unlike the post-hoc analysis in `build_clusters()`, these covariates directly influence cluster membership during EM estimation (see `covariate_effect`).
`covariate_effect`	How `covariates` enter the model. `"em"` (default) folds them into the EM as covariate-dependent mixing proportions, so they shape the cluster fit itself (and rows with missing covariates are dropped before fitting). `"posthoc"` fits a plain mixture on every sequence and uses the covariates only for the after-fit multinomial logit, so covariate values — and their missingness — never change which clusters are found. Ignored when `covariates` is `NULL`.
`estimator`	Multinomial fitter for the post-hoc covariate analysis (does not affect EM): `"auto"` (default) inspects the cluster x covariate cross-tab and falls back to `"firth"` only when any cell has fewer than 5 observations (separation risk), otherwise the much faster `"multinom"`; `"firth"` forces Firth's penalised likelihood via `brglm2::brmultinom` (finite under separation); `"multinom"` forces `nnet::multinom` (warns about separation risk); `"chisq"` runs descriptive tests (no logit). See `build_clusters` for full details.

Value

An object of class net_mmm with components:

data: The full N-row sequence frame used for estimation.
models: List of netobjects, one per component. Each component carries the rows assigned to that component in its $data slot, while its transition matrix is the EM-estimated component transition matrix.
k: Number of components.
mixing: Numeric vector of mixing proportions.
posterior: N x k matrix of posterior probabilities.
assignments: Integer vector of hard assignments (1..k).
quality: List: avepp (per-class), avepp_overall, entropy, relative_entropy, classification_error.
log_likelihood, BIC, AIC, ICL: Model fit statistics.
states: Character vector of state names.

Initial states

The first sequence column has special status: it is read directly as the per-sequence initial state (init_state[i] <- match(raw_data[i, state_cols[1L]], states)). The function does not scan forward to the first non-missing position, and it does not apply any na_syms-style symbol conversion (unlike build_clusters). The state vocabulary is built from the unique non-NA values across all columns, so if your data uses a sentinel character such as "*" or "%" for missing cells, that sentinel becomes a real state and the first column reads it as a valid initial state. If you want padded leading missings to be treated as missing, recode them to NA before calling build_mmm() (then match() returns NA, which the EM treats as an uninformative initial distribution), or left-trim the leading missings so each sequence's first column carries an observed state.

Examples

seqs <- data.frame(V1 = sample(c("A","B","C"), 30, TRUE),
                   V2 = sample(c("A","B","C"), 30, TRUE))
mmm <- build_mmm(seqs, k = 2, n_starts = 1, max_iter = 10, seed = 1)
mmm

seqs <- data.frame(
  V1 = sample(LETTERS[1:3], 30, TRUE), V2 = sample(LETTERS[1:3], 30, TRUE),
  V3 = sample(LETTERS[1:3], 30, TRUE), V4 = sample(LETTERS[1:3], 30, TRUE)
)
mmm <- build_mmm(seqs, k = 2, seed = 42)
print(mmm)
summary(mmm)

Nestimate documentation built on July 11, 2026, 1:09 a.m.