In this vignette, we show how to perform Latent Semantic Analysis using the quanteda package based on Grossman and Frieder's Information Retrieval, Algorithms and Heuristics.
LSA decomposes document-feature matrix into a reduced vector space that is assumed to reflect semantic structure.
New documents or queries can be 'folded-in' to this constructed latent semantic space for downstream tasks.
library("quanteda")
txt <- c(d1 = "Shipment of gold damaged in a fire", d2 = "Delivery of silver arrived in a silver truck", d3 = "Shipment of gold arrived in a truck" ) dfmat <- txt |> tokens() |> dfm() dfmat
library("quanteda.textmodels") tmod_lsa <- textmodel_lsa(dfmat)
The new document vector coordinates in the reduced 2-dimensional space is:
tmod_lsa$docs[, 1:2]
Now the new unseen document can be represented in the reduced 2-dimensional space. The unseen query document:
dfmat_test <- tokens("gold silver truck") |> dfm() |> dfm_match(features = featnames(dfmat)) dfmat_test pred_lsa <- predict(tmod_lsa, newdata = dfmat_test) pred_lsa$docs_newspace[, 1:2]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.