similR: similR
In rushkin/similR: Document similarity via word vectorization

Description Usage Arguments Value

View source: R/similR.R

Document similarity based on word vectorization

similR(toks1, toks2, vec = NULL, hotspots = NULL,
  window_weights = 1/(1:5), word_vectors_size = 300, x_max = 10,
  n_iter = 30, ik = 100, soft_ik = FALSE,
  clustering_algorithm = c("MacQueen", "Hartigan-Wong", "Lloyd",
  "Forgy"), clustering_itermax = 1000, similarity_method = c("cosine",
  "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamann",
  "faith", "correlation", "jsd"), keep_vec = FALSE,
  keep_hotspots = FALSE, ...)

`toks1`	tokens from a document corpus
`toks2`	tokens from another document corpus (if NULL, assumes toks2=toks1)
`vec`	if NULL, will calculate the vector embedding of words. If this lengthy calculation needs to be skipped - pass the prepared matrix of vectors here (for instance, taken from the previous run of this function with keep_vec=TRUE).
`window_weights`	weights of the window to use for co-occurrence of tokens.
`word_vectors_size`	dimensionality of vector space
`x_max`	max number of co-occurrences to use in the weighting function
`n_iter`	number of GloVe iterations
`ik`	initial number of clusters (if soft_ik is TRUE, the final one will be greater or equal than this)
`soft_ik`	whether to use xmeans instead of kmeans for hotspots
`clustering_algorithm`	as implemented in kmeans(), but note that 'Hartigan-Wong' fails for large data, so here we made 'MacQueen' the default.
`clustering_itermax`	as implemented in kmeans()
`similarity_method`	as implemented in quanteda::textstat_simil(), plus an extra method 'jsd' for Jensen-Shannon divergence.
`keep_vec`	whether to return the matrix of the word-vectors
`keep_hotspots`	whether to return the hotspots