Description Usage Arguments Value
Document similarity based on word vectorization
1 2 3 4 5 6 7 8 | similR(toks1, toks2, vec = NULL, hotspots = NULL,
window_weights = 1/(1:5), word_vectors_size = 300, x_max = 10,
n_iter = 30, ik = 100, soft_ik = FALSE,
clustering_algorithm = c("MacQueen", "Hartigan-Wong", "Lloyd",
"Forgy"), clustering_itermax = 1000, similarity_method = c("cosine",
"jaccard", "ejaccard", "dice", "edice", "simple matching", "hamann",
"faith", "correlation", "jsd"), keep_vec = FALSE,
keep_hotspots = FALSE, ...)
|
toks1 |
tokens from a document corpus |
toks2 |
tokens from another document corpus (if NULL, assumes toks2=toks1) |
vec |
if NULL, will calculate the vector embedding of words. If this lengthy calculation needs to be skipped - pass the prepared matrix of vectors here (for instance, taken from the previous run of this function with keep_vec=TRUE). |
window_weights |
weights of the window to use for co-occurrence of tokens. |
word_vectors_size |
dimensionality of vector space |
x_max |
max number of co-occurrences to use in the weighting function |
n_iter |
number of GloVe iterations |
ik |
initial number of clusters (if soft_ik is TRUE, the final one will be greater or equal than this) |
soft_ik |
whether to use xmeans instead of kmeans for hotspots |
clustering_algorithm |
as implemented in kmeans(), but note that 'Hartigan-Wong' fails for large data, so here we made 'MacQueen' the default. |
clustering_itermax |
as implemented in kmeans() |
similarity_method |
as implemented in quanteda::textstat_simil(), plus an extra method 'jsd' for Jensen-Shannon divergence. |
keep_vec |
whether to return the matrix of the word-vectors |
keep_hotspots |
whether to return the hotspots |
a list with simmat (the similarity matrix), vec and hotspots.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.