In quanteda/quanteda: Quantitative Analysis of Textual Data

In this vignette, we show how to perform Latent Semantic Analysis using the quanteda package based on Grossman and Frieder's Information Retrieval, Algorithms and Heuristics.

LSA decomposes document-feature matrix into a reduced vector space that is assumed to reflect semantic structure.

New documents or queries can be 'folded-in' to this constructed latent semantic space for downstream tasks.

library("quanteda")

Create a document-feature matrix

txt <- c(d1 = "Shipment of gold damaged in a fire",
         d2 = "Delivery of silver arrived in a silver truck",
         d3 = "Shipment of gold arrived in a truck" )

dfmat <- txt |> 
    tokens() |> 
    dfm()
dfmat

Construct the LSA model

library("quanteda.textmodels")
tmod_lsa <- textmodel_lsa(dfmat)

The new document vector coordinates in the reduced 2-dimensional space is:

tmod_lsa$docs[, 1:2]

Apply the constructed LSA model to new data

Now the new unseen document can be represented in the reduced 2-dimensional space. The unseen query document:

dfmat_test <- tokens("gold silver truck") |> 
    dfm() |> 
    dfm_match(features = featnames(dfmat))
dfmat_test

pred_lsa <- predict(tmod_lsa, newdata = dfmat_test)
pred_lsa$docs_newspace[, 1:2]