LexHCca: Hierarchical Clustering on Textual Correspondence Analysis...

View source: R/LexHCca.R

LexHCcaR Documentation

Hierarchical Clustering on Textual Correspondence Analysis Coordinates (LexHCca)

Description

Agglomerative hierarchical clustering of documents or words issued from correspondence analysis coordinates

Usage

LexHCca(x, cluster.CA="docs", nb.clust="click", min=2, max=NULL, kk=Inf, 
   consol=FALSE, iter.max=500, graph=TRUE, description=TRUE, 
   proba=0.05, nb.desc=5, size.desc=80, seed=12345,...)

Arguments

x

object of LexCA class

cluster.CA

if "rows" or "docs" cluster analysis is performed on documents; if "columns" or "words", cluster analysis is performed on words (by default "docs")

nb.clust

number of clusters. If 0 (or "click"), the tree is cut at the level the user clicks on. If -1 (or "auto"), the tree is automatically cut at the suggested level. If a (positive) integer, the tree is cut with nb.clust clusters (by default "click")

min

minimum number of clusters (by default 2)

max

maximum number of clusters (by default NULL, then max is computed as the minimum between 10 and the number of documents divided by 2)

kk

in case the user wants to perform a Kmeans clustering previously to the hierarchical clustering (preprocessing step), kk is an integer corresponding to the number of clusters of this previous partition. Further, the hierarchical tree is constructed starting from the nodes of this partition as terminal elements. This is very useful when the number of elements to be classified is very large. By default, the value is Inf and no Kmeans preprocessing is performed

consol

if TRUE, a Kmeans consolidation step is performed after the hierarchical clustering (consolidation cannot be performed if kk is used and equals a number) (by default FALSE)

iter.max

maximum number of iterations in the consolidation step (by default 500)

graph

if TRUE, graphs are displayed (by default TRUE)

description

if TRUE, description of the clusters of documents or words by the axes, the characteristic words in the case of clustering documents or the characteristic documents in the case of clustering words. The documents or words considered as paragon (para) or specific (dist) are identified. In the case of clustering documents, contextual variables also characterize the clusters. These variables have to be selected in LexCA (by default TRUE)

proba

threshold on the p-value used in selecting the elements characterizing significantly the clusters (by default 0.05)

nb.desc

Maximum of characters when editing the paragons (para) and specific documents (dist) to describe the clusters (by default 80))

size.desc

text size of edited paragons (para) and specific documents (dist) when describing the clusters of documents (by default 80)

seed

Seed to obtain the same results in successive Kmeans (by default 12345)

...

other arguments from other methods

Details

LexHCca starts from the documents/words coordinates issued from correspondence analysis axes. Euclidean metric and Ward method are used.

If the agglomerative clustering starts from many elements (documents or words), it is possible to previously perform a Kmeans partition with kk clusters to further build the tree from these (weighted) kk clusters.

The object $para contains the distance between each document and the centroid of its class.

The object $dist contains the distance between each document and the centroid of the farthest cluster.

The results include a thorough description of the clusters. Graphs are provided.

Value

Returns a list including:

data.clust

the active lexical table used in LexCA plus a new column called Clust_ containing the partition

coord.clust

coordinates table issued from CA plus a new column called Clust_ containing the partition

centers

coordinates of the gravity centers of the clusters

clust.count

counts of documents/words belonging to each cluster and contribution of the clusters to the variability decomposition

clust.content

list of the document/word labels according to the cluster they belong to

call

list of internal objects. call$t giving the results for the hierarchical tree. See the second reference for more details

description

$desc.axes for description of the clusters by the characteristic axes ($axes) and eta-squared between axes and clusters ($quanti.var).

$des.cluster.doc for description of the clusters by their characteristic words ($word), supplementary words ($wordsup) and, if contextual variables were considered in LexCA, description of the partition/clusters by qualitative ($qualisup) and quantitative ($quantisup) variables, paragons ($para) and specific words ($dist) of each cluster.

$des.word.doc description of the clusters of words by their characteristic documents ($docs), paragons ($para) and specific documents ($dist) of each cluster.

Returns the hierarchical tree with a barplot of the successive inertia gains, and the first CA map of the documents/words. The labels are colored according to the cluster.

Author(s)

Ramón Alvarez-Esteban ramon.alvarez@unileon.es, Monica Bécue-Bertaut, Josep-Anton Sánchez-Espigares

References

Bécue-Bertaut M. Textual Data Science with R. Chapman & Hall/CRC. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1201/9781315212661")}.

Husson F., Lê S., Pagès J. (2017). Exploratory Multivariate Analysis by Example Using R. Chapman & Hall/CRC. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1201/b21874")}.

Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. (D. Kluwer, Ed.). \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/978-94-017-1525-6")}.

See Also

LexCA, plot.LexHCca

Examples

data(open.question)	
res.TD<-TextData(open.question, var.text=c(9,10), Fmin=10, Dmin=10, stop.word.tm=TRUE,	
        context.quali=c("Gender","Age_Group","Education"), context.quanti=c("Age"))	
res.LexCA<-LexCA(res.TD, graph=FALSE, ncp=8)	
res.hcca<-LexHCca(res.LexCA, graph=FALSE, nb.clust=5)	

Xplortext documentation built on Nov. 10, 2023, 1:06 a.m.