In this vignette, we show how to perform Latent Semantic Analysis using the quanteda package based on Grossman and Frieder's Information Retrieval, Algorithms and Heuristics.

LSA decomposes document-feature matrix into a reduced vector space that is assumed to reflect semantic structure.

New documents or queries can be 'folded-in' to this constructed latent semantic space for downstream tasks.

library("quanteda")

Create a document-feature matrix

txt <- c(d1 = "Shipment of gold damaged in a fire",
         d2 = "Delivery of silver arrived in a silver truck",
         d3 = "Shipment of gold arrived in a truck" )

dfmat <- txt |> 
    tokens() |> 
    dfm()
dfmat

Construct the LSA model

library("quanteda.textmodels")
tmod_lsa <- textmodel_lsa(dfmat)

The new document vector coordinates in the reduced 2-dimensional space is:

tmod_lsa$docs[, 1:2]

Apply the constructed LSA model to new data

Now the new unseen document can be represented in the reduced 2-dimensional space. The unseen query document:

dfmat_test <- tokens("gold silver truck") |> 
    dfm() |> 
    dfm_match(features = featnames(dfmat))
dfmat_test

pred_lsa <- predict(tmod_lsa, newdata = dfmat_test)
pred_lsa$docs_newspace[, 1:2]


quanteda/quanteda documentation built on May 5, 2024, 8:33 p.m.