textmodel_lsa: Latent Semantic Analysis model

View source: R/lsa.R

textmodel_lsaR Documentation

Latent Semantic Analysis model

Description

Train a Latent Semantic Analysis model (Deerwester et al., 1990) on a quanteda::tokens object.

Usage

textmodel_lsa(
  x,
  dim = 50,
  min_count = 5L,
  engine = c("RSpectra", "irlba", "rsvd"),
  weight = "count",
  verbose = FALSE,
  ...
)

Arguments

x

a quanteda::tokens object.

dim

the size of the word vectors.

min_count

the minimum frequency of the words. Words less frequent than this in x are removed before training.

engine

select the engine perform SVD to generate word vectors.

weight

weighting scheme passed to quanteda::dfm_weight().

verbose

if TRUE, print the progress of training.

...

additional arguments.

Value

Returns a textmodel_wordvector object with the following elements:

values

a matrix for word vectors values.

weights

a matrix for word vectors weights.

frequency

the frequency of words in x.

engine

the SVD engine used.

weight

weighting scheme.

concatenator

the concatenator in x.

call

the command used to execute the function.

version

the version of the wordvector package.

References

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.

Examples


library(quanteda)
library(wordvector)

# pre-processing
corp <- corpus_reshape(data_corpus_news2014)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
   tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% 
   tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
                 padding = TRUE) %>% 
   tokens_tolower()

# train LSA
lsa <- textmodel_lsa(toks, dim = 50, min_count = 5, verbose = TRUE)

# find similar words
head(similarity(lsa, c("berlin", "germany", "france"), mode = "words"))
head(similarity(lsa, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
head(similarity(lsa, analogy(~ berlin - germany + france)))


wordvector documentation built on April 12, 2025, 2:23 a.m.