clusterizer: Text clustering
In nicolarighetti/textools: an R toolbox for text mining tasks

Description Usage Arguments Details Value Examples

View source: R/clusterizer.R

Given a data frame with texts, documents (or features) clustering is returned

clusterizer(
  df,
  docid_field = NULL,
  text_field = NULL,
  min_docfreq = 0.5,
  max_docfreq = 99,
  tfidf = TRUE,
  element = "documents",
  k = NULL,
  k.max = NULL,
  nstart = 25,
  method = "kmeans",
  hc_method = NULL,
  return_fit = FALSE
)

`df`	a data frame with at least a column with textual data and a column with documents' ID
`docid_field`	name of the column (in quotation marks) containing the IDs of the documents (default NULL)
`text_field`	name of the column (in quotation marks) containing textual data
`min_docfreq`	minimum values of a feature's document frequency, below which features will be removed (default 0.5 percentile)
`max_docfreq`	maximum values of a feature's document frequency, above which features will be removed (default 99 percentile)
`tfidf`	term frequency inverse document frequency weighting (default TRUE)
`element`	elements to cluster. Available options are "documents" and "features" (default "documents")
`k`	desired number of clusters (default NULL). If NULL, the silhouette method is used to estimate the appropriate number of clusters
`k.max`	max number of cluster if k is not specified (default NULL)
`nstart`	number of initial configurations (default 25)
`method`	clustering method. Available options are "kmeans" and "hclust" (default "kmeans)
`hc_method`	the agglomeration method to be used in case of "hclust" method. This should be one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). See hclust
`return_fit`	return the fitted cluster_model in the main environment

the function is substantially a wrapper of functions available in quanteda and factoextra. Please refer to the available documentations of textstat_simil and eclust. The fitted cluster_model can be used to create kmeans plots with fviz_cluster or dendrogram (hclust) wiht fviz_dend

an vector of cluster IDs. Silhouette information with clusters' size and average silhouette width are printed in console

1
2
3

## Not run: 
df$cluster <- clusterizer(df, docid_field = "documents", text_field = "texts", k = 10)
## End(Not run)

nicolarighetti/textools documentation built on Oct. 16, 2021, 11:20 p.m.

nicolarighetti/textools index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

nicolarighetti/textools
an R toolbox for text mining tasks

clusterizer: Text clustering
In nicolarighetti/textools: an R toolbox for text mining tasks

Description

Usage

Arguments

Details

Value

Examples

Related to clusterizer in nicolarighetti/textools...

R Package Documentation

Browse R Packages

We want your feedback!

nicolarighetti/textools an R toolbox for text mining tasks

clusterizer: Text clustering In nicolarighetti/textools: an R toolbox for text mining tasks

Description

Usage

Arguments

Details

Value

Examples

Related to clusterizer in nicolarighetti/textools...

R Package Documentation

Browse R Packages

We want your feedback!

nicolarighetti/textools
an R toolbox for text mining tasks

clusterizer: Text clustering
In nicolarighetti/textools: an R toolbox for text mining tasks