LexCHCca: Chronological Constrained Hierarchical Clustering on...
In Xplortext: Statistical Analysis of Textual Data

LexCHCca

R Documentation

Chronological Constrained Hierarchical Clustering on Correspondence Analysis Components (LexCHCca)

Description

Chronological constrained agglomerative hierarchical clustering on a corpus of documents

Usage

LexCHCca (object, ncp=5, nb.clust=0, min=2, max=NULL, nb.par=5, 
 graph=TRUE, proba=0.05, cut.test=FALSE, alpha.test =0.05, description=FALSE,
 nb.desc=5, size.desc=80)

Arguments

`object`	object of LexCA class
`ncp`	number of dimensions used from LexCA object (by default 5)
`nb.clust`	number of clusters only if no test (cut.test=FALSE). If 0 (or "click"), the tree is cut at the level the user clicks on. If -1 (or "auto"), the tree is automatically cut at the suggested level. If a (positive) integer, the tree is cut with nb.clust clusters (by default 0)
`min`	minimum number of clusters. Available only if cut.test=FALSE. (by default 3)
`max`	maximum number of clusters. Available only if cut.test=FALSE. (by default NULL; then max is computed as the minimum between 10 and the number of documents divided by 2)
`nb.par`	number of edited paragons (para) and specific documents labels (dist) (by default 5)
`graph`	if TRUE, graphs are displayed (by default TRUE)
`proba`	threshold on the p-value used to describe the clusters (by default 0.05)
`cut.test`	if FALSE (by default), Legendre test is not performed when joining two nodes. This test is used to determine whether two clusters should be joined or not; see details
`alpha.test`	threshold on the p-value used in selecting aggregation clusters for Legendre test (by default 0.05)
`description`	if TRUE, description of the clusters by the characteristic words/documents, paragon (para), specific documents (dist) and contextual variables if these latter have been selected in the previous LexCA function (by default FALSE)
`nb.desc`	number of paragons (para) and specific documents (dist) that are edited when describing the clusters (by default 5)
`size.desc`	maximum of characters when editing the paragons (para) and specific documents (dist) to describe the clusters (by default 80)

Details

LexCHCca starts from the document coordinates issued from a textual correspondence analysis. The hierarchical tree is built in such a way that only chronological contiguous nodes can be joined. The documents have to be ranked in their chronological order in the source-base (data frame format) before to apply the function (TextData format).

Legendre test allows to determine whether the fusion between two nodes based on their contiguity lead to a heterogenous new node (no homogeneity-between-clusters). If Legendre test is applied (cut.test=TRUE), the number of clusters is the number obtained by the test and nb.clust has not effects.

If no Legendre test is applied (cut.test= FALSE), the number of clusters is determined either a priori or from the constrained hierarchical tree structure.

The object $para contains the distance between each document and the centroid of its class.

The object $dist contains the distance between each document and the centroid of the farthest cluster.

The results of the description of the clusters and graphs are provided.

Value

Returns a list including:

`data.clust`	the active lexical table used in LexCA plus a new column called Clust_ containing the partition
`coord.clust`	coordinates table issued from CA plus a new column called weigths and another column called Clust_, corresponds to the partition
`centers`	coordinates of the gravity centers of the clusters
`description`	$des.word for description of the clusters of documents by their characteristic words, the paragons (des.doc$para) and specific documents (des.doc$dist) of each cluster; see details
`call`	list of internal objects. `call$t` giving the results for the hierarchical tree
`dendro`	hclust object. This allows for using the dendrogram in other packages
`phases`	details of the tracking of the agglomerative hierarchical process. In particular, the cut points (joining documents not allowed) can be identified
`sum.squares`	sum of squares decomposition for documents and clusters

Author(s)

Monica Bécue-Bertaut, Ramón Alvarez-Esteban ramon.alvarez@unileon.es, Josep-Antón Sánchez-Espigares, Belchin Kostov

References

Bécue-Bertaut, M., Kostov, B., Morin, A., & Naro, G. (2014). Rhetorical Strategy in Forensic Speeches: Multidimensional Statistics-Based Methodology. Journal of Classification,31, 85-106. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s00357-014-9148-9")}.

Husson F., Lê S., Pagès J. (2017). Exploratory Multivariate Analysis by Example Using R. Chapman & Hall/CRC. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1201/b21874")}.

Lebart L. (1978). Programme d'agrégation avec contraintes. Les Cahiers de l'Analyse des Données, 3, pp. 275–288.

Legendre, P. & Legendre, L. (1998), Numerical Ecology (2nd ed.), Amsterdam: Elsevier Science.

Murtagh F. (1985). Multidimensional Clustering Algorithms. Vienna: Physica-Verlag, COMPSTAT Lectures.

Examples

data(open.question)
res.TD<-TextData(open.question,var.text=c(9,10), var.agg="Age_Group", Fmin=10, Dmin=10, 
        stop.word.tm=TRUE)
res.LexCA<-LexCA(res.TD, graph=FALSE)
res.ccah<-LexCHCca(res.LexCA, nb.clust=4, min=3)

Xplortext documentation built on April 12, 2025, 2:13 a.m.