textmodel_lsa: Latent Semantic Analysis
In quanteda.textmodels: Scaling Models and Classifiers for Textual Data

textmodel_lsa

R Documentation

Latent Semantic Analysis

Description

Fit the Latent Semantic Analysis scaling model to a dfm, which may be weighted (for instance using quanteda::dfm_tfidf()).

Usage

textmodel_lsa(x, nd = 10, margin = c("both", "documents", "features"))

Arguments

`x`	the dfm on which the model will be fit
`nd`	the number of dimensions to be included in output
`margin`	margin to be smoothed by the SVD

Details

svds in the RSpectra package is applied to enable the fast computation of the SVD.

Value

a textmodel_lsa class object, a list containing:

sk a numeric vector containing the d values from the SVD
docs document coordinates from the SVD (u)
features feature coordinates from the SVD (v)
matrix_low_rank the multiplication of udv'
data the input data as a CSparseMatrix from the Matrix package

Note

The number of dimensions nd retained in LSA is an empirical issue. While a reduction in k can remove much of the noise, keeping too few dimensions or factors may lose important information.

Author(s)

Haiyan Wang and Kohei Watanabe

References

Rosario, B. (2000). Latent Semantic Indexing: An Overview. Technical report INFOSYS 240 Spring Paper, University of California, Berkeley.

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): 391.

Examples

library("quanteda")
dfmat <- dfm(tokens(data_corpus_irishbudget2010))
# create an LSA space and return its truncated representation in the low-rank space
tmod <- textmodel_lsa(dfmat[1:10, ])
head(tmod$docs)

# matrix in low_rank LSA space
tmod$matrix_low_rank[,1:5]

# fold queries into the space generated by dfmat[1:10,]
# and return its truncated versions of its representation in the new low-rank space
pred <- predict(tmod, newdata = dfmat[11:14, ])
pred$docs_newspace

quanteda.textmodels documentation built on April 12, 2025, 1:43 a.m.