term_per_cluster: Extract Terms and Segments for Document Clusters
In tall: Text Analysis for All

View source: R/reinert.R

term_per_cluster

R Documentation

Extract Terms and Segments for Document Clusters

Description

This function processes the results of a document clustering algorithm based on the Reinert method. It computes the terms and their significance for each cluster, as well as the associated document segments.

Usage

term_per_cluster(res, cutree = NULL, k = 1, negative = TRUE)

Arguments

`res`	A list containing the results of the Reinert clustering algorithm. Must include at least `dtm` (a document-term matrix) and `corresp_uce_uc_full` (a correspondence between segments and clusters).
`cutree`	A custom cutree structure. If `NULL`, the default `cutree_reinart` is used to determine cluster membership.
`k`	A vector of integers specifying the clusters to analyze. Default is `1`.
`negative`	Logical. If `TRUE`, include negative terms in the results. If `FALSE`, exclude them. Default is `TRUE`.

Details

The function integrates document-term matrix rows for missing segments, calculates term statistics for each cluster, and filters terms based on their significance. Terms can be excluded based on their significance (signExcluded).

Value

A list with the following components:

terms

A data frame of significant terms for each cluster. Columns include:

chi_square: Chi-squared statistic for the term.
p_value: P-value of the chi-squared test.
sign: Significance of the term (positive, negative, or none).
term: The term itself.
freq: Observed frequency of the term in the cluster.
indep: Expected frequency of the term under independence.
cluster: The cluster ID.

segments

A data frame of document segments associated with each cluster. Columns include:

uc: Unique segment identifier.
doc_id: Document ID for the segment.
cluster: Cluster ID.
segment: The text content of each segment.

Examples


data(mobydick)
res <- reinert(
  x = mobydick,
  k = 10,
  term = "token",
  segment_size = 40,
  min_segment_size = 5,
  min_split_members = 10,
  cc_test = 0.3,
  tsj = 3
)

tc <- term_per_cluster(res, cutree = NULL, k = 1:10, negative = FALSE)

head(tc$segments, 10)

head(tc$terms, 10)

tall documentation built on June 8, 2025, 11:08 a.m.