textmodel_lsa | R Documentation |
Train a Latent Semantic Analysis model (Deerwester et al., 1990) on a quanteda::tokens object.
textmodel_lsa(
x,
dim = 50,
min_count = 5L,
engine = c("RSpectra", "irlba", "rsvd"),
weight = "count",
verbose = FALSE,
...
)
x |
a quanteda::tokens object. |
dim |
the size of the word vectors. |
min_count |
the minimum frequency of the words. Words less frequent than
this in |
engine |
select the engine perform SVD to generate word vectors. |
weight |
weighting scheme passed to |
verbose |
if |
... |
additional arguments. |
Returns a textmodel_wordvector object with the following elements:
values |
a matrix for word vectors values. |
weights |
a matrix for word vectors weights. |
frequency |
the frequency of words in |
engine |
the SVD engine used. |
weight |
weighting scheme. |
concatenator |
the concatenator in |
call |
the command used to execute the function. |
version |
the version of the wordvector package. |
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.
library(quanteda)
library(wordvector)
# pre-processing
corp <- corpus_reshape(data_corpus_news2014)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>%
tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
padding = TRUE) %>%
tokens_tolower()
# train LSA
lsa <- textmodel_lsa(toks, dim = 50, min_count = 5, verbose = TRUE)
# find similar words
head(similarity(lsa, c("berlin", "germany", "france"), mode = "words"))
head(similarity(lsa, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
head(similarity(lsa, analogy(~ berlin - germany + france)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.