similR: similR

Description Usage Arguments Value

View source: R/similR.R

Description

Document similarity based on word vectorization

Usage

1
2
3
4
5
6
7
8
similR(toks1, toks2, vec = NULL, hotspots = NULL,
  window_weights = 1/(1:5), word_vectors_size = 300, x_max = 10,
  n_iter = 30, ik = 100, soft_ik = FALSE,
  clustering_algorithm = c("MacQueen", "Hartigan-Wong", "Lloyd",
  "Forgy"), clustering_itermax = 1000, similarity_method = c("cosine",
  "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamann",
  "faith", "correlation", "jsd"), keep_vec = FALSE,
  keep_hotspots = FALSE, ...)

Arguments

toks1

tokens from a document corpus

toks2

tokens from another document corpus (if NULL, assumes toks2=toks1)

vec

if NULL, will calculate the vector embedding of words. If this lengthy calculation needs to be skipped - pass the prepared matrix of vectors here (for instance, taken from the previous run of this function with keep_vec=TRUE).

window_weights

weights of the window to use for co-occurrence of tokens.

word_vectors_size

dimensionality of vector space

x_max

max number of co-occurrences to use in the weighting function

n_iter

number of GloVe iterations

ik

initial number of clusters (if soft_ik is TRUE, the final one will be greater or equal than this)

soft_ik

whether to use xmeans instead of kmeans for hotspots

clustering_algorithm

as implemented in kmeans(), but note that 'Hartigan-Wong' fails for large data, so here we made 'MacQueen' the default.

clustering_itermax

as implemented in kmeans()

similarity_method

as implemented in quanteda::textstat_simil(), plus an extra method 'jsd' for Jensen-Shannon divergence.

keep_vec

whether to return the matrix of the word-vectors

keep_hotspots

whether to return the hotspots

Value

a list with simmat (the similarity matrix), vec and hotspots.


rushkin/similR documentation built on Sept. 26, 2019, 10:42 p.m.