cluster_summary: Cluster Summary Statistics
In Nestimate: Network Estimation, Bootstrap, and Higher-Order Analysis

cluster_summary

R Documentation

Cluster Summary Statistics

Description

Aggregates node-level network weights to cluster-level summaries. Computes both between-cluster transitions (how clusters connect to each other) and within-cluster transitions (how nodes connect within each cluster).

Usage

cluster_summary(
  x,
  clusters = NULL,
  method = c("sum", "mean", "median", "max", "min", "density", "geomean"),
  type = c("tna", "cooccurrence", "semi_markov", "raw"),
  directed = TRUE,
  compute_within = TRUE
)

Arguments

`x`	Network input. Accepts multiple formats: matrix Numeric adjacency/weight matrix. Row and column names are used as node labels. Values represent edge weights (e.g., transition counts, co-occurrence frequencies, or probabilities). netobject A cograph network object. The function extracts the weight matrix from `x$weights` or converts via `to_matrix()`. Clusters can be auto-detected from node attributes. tna A tna object from the tna package. Extracts `x$weights`. cluster_summary If already a cluster_summary, returns unchanged.
`clusters`	Cluster/group assignments for nodes. Accepts multiple formats: NULL (default) Auto-detect from netobject. Looks for columns named 'clusters', 'cluster', 'groups', or 'group' in `x$nodes`. Throws an error if no cluster column is found. This option only works when `x` is a netobject. vector Cluster membership for each node, in the same order as the matrix rows/columns. Can be numeric (1, 2, 3) or character ("A", "B"). Cluster names will be derived from unique values. Example: `c(1, 1, 2, 2, 3, 3)` assigns first two nodes to cluster 1. data.frame A data frame where the first column contains node names and the second column contains group/cluster names. Example: `data.frame(node = c("A", "B", "C"), group = c("G1", "G1", "G2"))` named list Explicit mapping of cluster names to node labels. List names become cluster names, values are character vectors of node labels that must match matrix row/column names. Example: `list(Alpha = c("A", "B"), Beta = c("C", "D"))`
`method`	Aggregation method for combining edge weights within/between clusters. Controls how multiple node-to-node edges are summarized: "sum" (default) Sum of all edge weights. Best for count data (e.g., transition frequencies). Preserves total flow. "mean" Average edge weight. Best when cluster sizes differ and you want to control for size. Note: when input is already a transition matrix (rows sum to 1), "mean" avoids size bias. Example: cluster with 5 nodes won't have 5x the weight of cluster with 1 node. "median" Median edge weight. Robust to outliers. "max" Maximum edge weight. Captures strongest connection. "min" Minimum edge weight. Captures weakest connection. "density" Sum divided by number of possible edges. Normalizes by cluster size combinations. "geomean" Geometric mean of positive weights. Useful for multiplicative processes.
`type`	Post-processing applied to aggregated weights. Determines the interpretation of the resulting matrices: "tna" (default) Row-normalize so each row sums to 1. Creates transition probabilities suitable for Markov chain analysis. Interpretation: "Given I'm in cluster A, what's the probability of transitioning to cluster B?" Required for use with tna package functions. Diagonal represents within-cluster transition probability. "raw" No normalization. Returns aggregated counts/weights as-is. Use for frequency analysis or when you need raw counts. Compatible with igraph's contract + simplify output. "cooccurrence" Symmetrize the matrix: (A + t(A)) / 2. For undirected co-occurrence analysis. "semi_markov" Row-normalize with duration weighting. For semi-Markov process analysis.
`directed`	Logical. If `TRUE` (default), treat network as directed. A->B and B->A are separate edges. If `FALSE`, edges are undirected and the matrix is symmetrized before processing.
`compute_within`	Logical. If `TRUE` (default), compute within-cluster transition matrices for each cluster. Each cluster gets its own n_i x n_i matrix showing internal node-to-node transitions. Set to `FALSE` to skip this computation for better performance when only between-cluster summary is needed.

Details

This is the core function for Multi-Cluster Multi-Level (MCML) analysis. Use as_tna() to convert results to tna objects for further analysis with the tna package.

Workflow

Typical MCML analysis workflow:

# 1. Create network
net <- build_network(data, method = "relative")
net$nodes$clusters <- group_assignments

# 2. Compute cluster summary
cs <- cluster_summary(net, type = "tna")

# 3. Convert to tna models
tna_models <- as_tna(cs)

# 4. Analyze/visualize
plot(tna_models$macro)
tna::centralities(tna_models$macro)

Between-Cluster Matrix Structure

The macro$weights matrix has clusters as both rows and columns:

Off-diagonal (row i, col j): Aggregated weight from cluster i to cluster j
Diagonal (row i, col i): Within-cluster total (sum of internal edges in cluster i)

When type = "tna", rows sum to 1 and diagonal values represent "retention rate" - the probability of staying within the same cluster.

Choosing method and type

Input data	Recommended	Reason
Edge counts	method="sum", type="tna"	Preserves total flow, normalizes to probabilities
Transition matrix	method="mean", type="tna"	Avoids cluster size bias
Frequencies	method="sum", type="raw"	Keep raw counts for analysis
Correlation matrix	method="mean", type="raw"	Average correlations

Value

A cluster_summary object (S3 class) containing:

between

List with two elements:

weights: k x k matrix of cluster-to-cluster weights, where k is the number of clusters. Row i, column j contains the aggregated weight from cluster i to cluster j. Diagonal contains within-cluster totals. Processing depends on type.
inits: Numeric vector of length k. Initial state distribution across clusters, computed from column sums of the original matrix. Represents the proportion of incoming edges to each cluster.

within

Named list with one element per cluster. Each element contains:

weights: n_i x n_i matrix for nodes within that cluster. Shows internal transitions between nodes in the same cluster.
inits: Initial distribution within the cluster.

NULL if compute_within = FALSE.

clusters

Named list mapping cluster names to their member node labels. Example: list(A = c("n1", "n2"), B = c("n3", "n4", "n5"))

meta

List of metadata:

type: The type argument used ("tna", "raw", etc.)
method: The method argument used ("sum", "mean", etc.)
directed: Logical, whether network was treated as directed
n_nodes: Total number of nodes in original network
n_clusters: Number of clusters
cluster_sizes: Named vector of cluster sizes

Examples

# -----------------------------------------------------
# Basic usage with matrix and cluster vector
# -----------------------------------------------------
mat <- matrix(runif(100), 10, 10)
rownames(mat) <- colnames(mat) <- LETTERS[1:10]

clusters <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3)
cs <- cluster_summary(mat, clusters)

# Access results
cs$macro$weights    # 3x3 cluster transition matrix
cs$macro$inits      # Initial distribution
cs$clusters$`1`$weights # Within-cluster 1 transitions
cs$meta               # Metadata

# -----------------------------------------------------
# Named list clusters (more readable)
# -----------------------------------------------------
clusters <- list(
  Alpha = c("A", "B", "C"),
  Beta = c("D", "E", "F"),
  Gamma = c("G", "H", "I", "J")
)
cs <- cluster_summary(mat, clusters, type = "tna")
cs$macro$weights    # Rows/cols named Alpha, Beta, Gamma
cs$clusters$Alpha       # Within Alpha cluster

# -----------------------------------------------------
# Auto-detect clusters from netobject
# -----------------------------------------------------

seqs <- data.frame(
  V1 = sample(LETTERS[1:10], 30, TRUE), V2 = sample(LETTERS[1:10], 30, TRUE),
  V3 = sample(LETTERS[1:10], 30, TRUE)
)
net <- build_network(seqs, method = "relative")
cs2 <- cluster_summary(net, c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3))


# -----------------------------------------------------
# Different aggregation methods
# -----------------------------------------------------
cs_sum <- cluster_summary(mat, clusters, method = "sum")   # Total flow
cs_mean <- cluster_summary(mat, clusters, method = "mean") # Average
cs_max <- cluster_summary(mat, clusters, method = "max")   # Strongest

# -----------------------------------------------------
# Raw counts vs TNA probabilities
# -----------------------------------------------------
cs_raw <- cluster_summary(mat, clusters, type = "raw")
cs_tna <- cluster_summary(mat, clusters, type = "tna")

rowSums(cs_raw$macro$weights)  # Various sums
rowSums(cs_tna$macro$weights)  # All equal to 1

# -----------------------------------------------------
# Skip within-cluster computation for speed
# -----------------------------------------------------
cs_fast <- cluster_summary(mat, clusters, compute_within = FALSE)
cs_fast$clusters  # NULL

# -----------------------------------------------------
# Convert to tna objects for tna package
# -----------------------------------------------------
cs <- cluster_summary(mat, clusters, type = "tna")
tna_models <- as_tna(cs)
# tna_models$macro      # tna object
# tna_models$clusters$Alpha # tna object

Nestimate documentation built on April 20, 2026, 5:06 p.m.