textmodel_lsa: Latent Semantic Analysis model

View source: R/lsa.R

textmodel_lsaR Documentation

Latent Semantic Analysis model

Description

Train a Latent Semantic Analysis model (Deerwester et al., 1990) on a quanteda::tokens object.

Usage

textmodel_lsa(
  x,
  dim = 50,
  min_count = 5L,
  engine = c("RSpectra", "irlba", "rsvd"),
  weight = "count",
  tolower = TRUE,
  verbose = FALSE,
  ...
)

Arguments

x

a quanteda::tokens or quanteda::tokens_xptr object.

dim

the size of the word vectors.

min_count

the minimum frequency of the words. Words less frequent than this in x are removed before training.

engine

select the engine perform SVD to generate word vectors.

weight

weighting scheme passed to quanteda::dfm_weight().

tolower

if TRUE lower-case all the tokens before fitting the model.

verbose

if TRUE, print the progress of training.

...

additional arguments.

Value

Returns a textmodel_wordvector object with the following elements:

values

a matrix for word vectors values.

weights

a matrix for word vectors weights.

frequency

the frequency of words in x.

engine

the SVD engine used.

weight

weighting scheme.

min_count

the value of min_count.

concatenator

the concatenator in x.

call

the command used to execute the function.

version

the version of the wordvector package.

References

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.

Examples


library(quanteda)
library(wordvector)

# pre-processing
corp <- corpus_reshape(data_corpus_news2014)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
   tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% 
   tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
                 padding = TRUE) %>% 
   tokens_tolower()

# train LSA
lsa <- textmodel_lsa(toks, dim = 50, min_count = 5, verbose = TRUE)

# find similar words
head(similarity(lsa, c("berlin", "germany", "france"), mode = "words"))
head(similarity(lsa, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
head(similarity(lsa, analogy(~ berlin - germany + france)))


wordvector documentation built on June 20, 2025, 9:08 a.m.