LexHCca: Hierarchical Clustering on Textual Correspondence Analysis...
In Xplortext: Statistical Analysis of Textual Data

LexHCca

R Documentation

Hierarchical Clustering on Textual Correspondence Analysis Coordinates (LexHCca)

Description

Agglomerative hierarchical clustering of documents or words issued from correspondence analysis coordinates

Usage

LexHCca(x, cluster.CA="docs",  type="agnes", ncp=5, nb.clust="click", min=2, 
   max=NULL, kk=Inf, consol=FALSE, iter.max=500, graph=TRUE, description=TRUE, 
   proba=0.05, nb.desc=5, size.desc=80,  marg.doc = "before", seed=12345,...)

Arguments

`x`	object of LexCA class
`cluster.CA`	if "rows" or "docs" cluster analysis is performed on documents; if "columns" or "words", cluster analysis is performed on words (by default "docs")
`type`	type of cluster; "agnes" (Agglomerative), "diana" (Divisive) (by default agnes)
`ncp`	number of dimensions used from LexCA object (by default 5)
`nb.clust`	number of clusters. If 0 (or "click"), the tree is cut at the level the user clicks on. If -1 (or "auto"), the tree is automatically cut at the suggested level. If a (positive) integer, the tree is cut with nb.clust clusters (by default "click")
`min`	minimum number of clusters (by default 2)
`max`	maximum number of clusters (by default NULL, then max is computed as the minimum between 10 and the number of documents divided by 2)
`kk`	in case the user wants to perform a Kmeans clustering previously to the hierarchical clustering (preprocessing step), kk is an integer corresponding to the number of clusters of this previous partition. Further, the hierarchical tree is constructed starting from the nodes of this partition as terminal elements. This is very useful when the number of elements to be classified is very large. By default, the value is Inf and no Kmeans preprocessing is performed
`consol`	if TRUE, a Kmeans consolidation step is performed after the hierarchical clustering (consolidation cannot be performed if kk is used and equals a number) (by default FALSE)
`iter.max`	maximum number of iterations in the consolidation step (by default 500)
`graph`	if TRUE, graphs are displayed (by default TRUE)
`description`	if TRUE, description of the clusters of documents or words by the axes, the characteristic words in the case of clustering documents or the characteristic documents in the case of clustering words. The documents or words considered as paragon (para) or specific (dist) are identified. In the case of clustering documents, contextual variables also characterize the clusters. These variables have to be selected in LexCA (by default TRUE)
`proba`	threshold on the p-value used in selecting the elements characterizing significantly the clusters (by default 0.05)
`nb.desc`	number of edited paragons (para) and specific documents labels (dist) (by default 5)
`size.desc`	text size of edited paragons (para) and specific documents (dist) when describing the clusters of documents (by default 80)
`marg.doc`	if after/before, frequencies after/before TextData selection are used as document weighting by characterization if description=TRUE (by default "before")
`seed`	Seed to obtain the same results in successive Kmeans (by default 12345)
`...`	other arguments from other methods

Details

LexHCca starts from the documents/words coordinates issued from correspondence analysis axes. Euclidean metric and Ward method are used.

If the agglomerative clustering starts from many elements (documents or words), it is possible to previously perform a Kmeans partition with kk clusters to further build the tree from these (weighted) kk clusters.

The object $para contains the distance between each document and the centroid of its class.

The object $dist contains the distance between each document and the centroid of the farthest cluster.

The results include a thorough description of the clusters. Graphs are provided.

Value

Returns a list including:

`data.clust`	the active lexical table used in LexCA plus a new column called Clust_ containing the partition
`coord.clust`	coordinates table issued from CA plus a new column called Clust_ containing the partition
`centers`	coordinates of the gravity centers of the clusters
`clust.count`	counts of documents/words belonging to each cluster and contribution of the clusters to the variability decomposition
`clust.content`	list of the document/word labels according to the cluster they belong to
`call`	list of internal objects. `call$t` giving the results for the hierarchical tree. See the second reference for more details
`description`	$desc.axes for description of the clusters by the characteristic axes ($axes) and eta-squared between axes and clusters ($quanti.var). $des.cluster.doc for description of the clusters by their characteristic words ($word), supplementary words ($wordsup) and, if contextual variables were considered in LexCA, description of the partition/clusters by qualitative ($qualisup) and quantitative ($quantisup) variables, paragons ($para) and specific words ($dist) of each cluster. $des.word.doc description of the clusters of words by their characteristic documents ($docs), paragons ($para) and specific documents ($dist) of each cluster.
`type`	Type of cluster used (by default agnes).
`coef.hclust`	agglomerative coefficient (Divisive coefficient for diana), measuring the clustering structure of the dataset.
`t$tree`	tree object to use with hclust function

Returns the hierarchical tree and the first CA map of the documents/words. The labels are colored according to the cluster.

Author(s)

Ramón Alvarez-Esteban ramon.alvarez@unileon.es, Monica Bécue-Bertaut, Josep-Anton Sánchez-Espigares

References

Bécue-Bertaut M. Textual Data Science with R. Chapman & Hall/CRC. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1201/9781315212661")}.

Husson F., Lê S., Pagès J. (2017). Exploratory Multivariate Analysis by Example Using R. Chapman & Hall/CRC. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1201/b21874")}.

Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. (D. Kluwer, Ed.). \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/978-94-017-1525-6")}.

Examples

data(open.question)	
res.TD<-TextData(open.question, var.text=c(9,10), Fmin=10, Dmin=10, stop.word.tm=TRUE,	
        context.quali=c("Gender","Age_Group","Education"), context.quanti=c("Age"))	
res.LexCA<-LexCA(res.TD, graph=FALSE, ncp=8)	
res.hcca<-LexHCca(res.LexCA, graph=FALSE, nb.clust=5)

Xplortext documentation built on June 16, 2026, 1:07 a.m.