In this vignette, we show how to perform Latent Semantic Analysis using the quanteda package based on Grossman and Frieder's Information Retrieval, Algorithms and Heuristics.

LSA decomposes document-feature matrix into a reduced vector space that is assumed to reflect semantic structure.

New documents or queries can be 'folded-in' to this constructed latent semantic space for downstream tasks.

library(quanteda)

Create a document-feature matrix

txt <- c(d1="Shipment of gold damaged in a fire",
         d2="Delivery of silver arrived in a silver truck",
         d3="Shipment of gold arrived in a truck" )

mydfm <- dfm(txt)
mydfm

Construct the LSA model

mylsa <- textmodel_lsa(mydfm)

the new document vector coordinates in the reduced 2-dimensional space is:

mylsa$docs[, 1:2]

Apply the constructed LSA model to new data

Now the new unseen document can be represented in the reduced 2-dimensional space. The unseen query document:

querydfm <- dfm(c("gold silver truck")) %>%
    dfm_select(pattern = mydfm)
querydfm
newq <- predict(mylsa, querydfm)
newq$docs_newspace[, 1:2]


koheiw/quanteda.core documentation built on Sept. 21, 2020, 3:44 p.m.