term_per_cluster: Extract Terms and Segments for Document Clusters

View source: R/reinert.R

term_per_clusterR Documentation

Extract Terms and Segments for Document Clusters

Description

This function processes the results of a document clustering algorithm based on the Reinert method. It computes the terms and their significance for each cluster, as well as the associated document segments.

Usage

term_per_cluster(res, cutree = NULL, k = 1, negative = TRUE)

Arguments

res

A list containing the results of the Reinert clustering algorithm. Must include at least dtm (a document-term matrix) and corresp_uce_uc_full (a correspondence between segments and clusters).

cutree

A custom cutree structure. If NULL, the default cutree_reinart is used to determine cluster membership.

k

A vector of integers specifying the clusters to analyze. Default is 1.

negative

Logical. If TRUE, include negative terms in the results. If FALSE, exclude them. Default is TRUE.

Details

The function integrates document-term matrix rows for missing segments, calculates term statistics for each cluster, and filters terms based on their significance. Terms can be excluded based on their significance (signExcluded).

Value

A list with the following components:

terms

A data frame of significant terms for each cluster. Columns include:

  • chi_square: Chi-squared statistic for the term.

  • p_value: P-value of the chi-squared test.

  • sign: Significance of the term (positive, negative, or none).

  • term: The term itself.

  • freq: Observed frequency of the term in the cluster.

  • indep: Expected frequency of the term under independence.

  • cluster: The cluster ID.

segments

A data frame of document segments associated with each cluster. Columns include:

  • uc: Unique segment identifier.

  • doc_id: Document ID for the segment.

  • cluster: Cluster ID.

  • segment: The text content of each segment.

Examples


data(mobydick)
res <- reinert(
  x = mobydick,
  k = 10,
  term = "token",
  segment_size = 40,
  min_segment_size = 5,
  min_split_members = 10,
  cc_test = 0.3,
  tsj = 3
)

tc <- term_per_cluster(res, cutree = NULL, k = 1:10, negative = FALSE)

head(tc$segments, 10)

head(tc$terms, 10)



tall documentation built on April 16, 2025, 5:10 p.m.