textmodel_lsa: Latent Semantic Analysis model
In wordvector: Word and Document Vector Models

View source: R/lsa.R

textmodel_lsa

R Documentation

Latent Semantic Analysis model

Description

Train a Latent Semantic Analysis model (Deerwester et al., 1990) on a quanteda::tokens object.

Usage

textmodel_lsa(
  x,
  dim = 50,
  min_count = 5L,
  engine = c("RSpectra", "irlba", "rsvd"),
  weight = "count",
  tolower = TRUE,
  verbose = FALSE,
  ...
)

Arguments

`x`	a quanteda::tokens or quanteda::tokens_xptr object.
`dim`	the size of the word vectors.
`min_count`	the minimum frequency of the words. Words less frequent than this in `x` are removed before training.
`engine`	select the engine perform SVD to generate word vectors.
`weight`	weighting scheme passed to `quanteda::dfm_weight()`.
`tolower`	if `TRUE` lower-case all the tokens before fitting the model.
`verbose`	if `TRUE`, print the progress of training.
`...`	additional arguments.

Value

Returns a textmodel_wordvector object with the following elements:

`values`	a matrix for word vectors values.
`weights`	a matrix for word vectors weights.
`frequency`	the frequency of words in `x`.
`engine`	the SVD engine used.
`weight`	weighting scheme.
`min_count`	the value of min_count.
`concatenator`	the concatenator in `x`.
`call`	the command used to execute the function.
`version`	the version of the wordvector package.

References

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.

Examples


library(quanteda)
library(wordvector)

# pre-processing
corp <- corpus_reshape(data_corpus_news2014)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
   tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% 
   tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
                 padding = TRUE) %>% 
   tokens_tolower()

# train LSA
lsa <- textmodel_lsa(toks, dim = 50, min_count = 5, verbose = TRUE)

# find similar words
head(similarity(lsa, c("berlin", "germany", "france"), mode = "words"))
head(similarity(lsa, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
head(similarity(lsa, analogy(~ berlin - germany + france)))

wordvector documentation built on June 20, 2025, 9:08 a.m.