clusterizer: Text clustering

Description Usage Arguments Details Value Examples

View source: R/clusterizer.R

Description

Given a data frame with texts, documents (or features) clustering is returned

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
clusterizer(
  df,
  docid_field = NULL,
  text_field = NULL,
  min_docfreq = 0.5,
  max_docfreq = 99,
  tfidf = TRUE,
  element = "documents",
  k = NULL,
  k.max = NULL,
  nstart = 25,
  method = "kmeans",
  hc_method = NULL,
  return_fit = FALSE
)

Arguments

df

a data frame with at least a column with textual data and a column with documents' ID

docid_field

name of the column (in quotation marks) containing the IDs of the documents (default NULL)

text_field

name of the column (in quotation marks) containing textual data

min_docfreq

minimum values of a feature's document frequency, below which features will be removed (default 0.5 percentile)

max_docfreq

maximum values of a feature's document frequency, above which features will be removed (default 99 percentile)

tfidf

term frequency inverse document frequency weighting (default TRUE)

element

elements to cluster. Available options are "documents" and "features" (default "documents")

k

desired number of clusters (default NULL). If NULL, the silhouette method is used to estimate the appropriate number of clusters

k.max

max number of cluster if k is not specified (default NULL)

nstart

number of initial configurations (default 25)

method

clustering method. Available options are "kmeans" and "hclust" (default "kmeans)

hc_method

the agglomeration method to be used in case of "hclust" method. This should be one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). See hclust

return_fit

return the fitted cluster_model in the main environment

Details

the function is substantially a wrapper of functions available in quanteda and factoextra. Please refer to the available documentations of textstat_simil and eclust. The fitted cluster_model can be used to create kmeans plots with fviz_cluster or dendrogram (hclust) wiht fviz_dend

Value

an vector of cluster IDs. Silhouette information with clusters' size and average silhouette width are printed in console

Examples

1
2
3
## Not run: 
df$cluster <- clusterizer(df, docid_field = "documents", text_field = "texts", k = 10)
## End(Not run)

nicolarighetti/textools documentation built on Oct. 16, 2021, 11:20 p.m.