lsa | R Documentation |
Calculates a latent semantic space from a given document-term matrix.
lsa( x, dims=dimcalc_share() )
x |
a document-term matrix (recommeded to be of class textmatrix), containing documents in colums, terms in rows and occurrence frequencies in the cells. |
dims |
either the number of dimensions or a configuring function. |
LSA combines the classical vector space model — well known in textmining — with a Singular Value Decomposition (SVD), a two-mode factor analysis. Thereby, bag-of-words representations of texts can be mapped into a modified vector space that is assumed to reflect semantic structure.
With lsa()
a new latent semantic space can
be constructed over a given document-term matrix. To ease
comparisons of terms and documents with common
correlation measures, the space can be converted into
a textmatrix of the same format as y
by calling as.textmatrix()
.
To add more documents or queries to this latent semantic
space in order to keep them from influencing the original
factor distribution (i.e., the latent semantic structure calculated
from a primary text corpus), they can be ‘folded-in’ later on
(with the function fold_in()
).
Background information (see also Deerwester et al., 1990):
A document-term matrix M is constructed
with textmatrix()
from a given text base of n documents
containing m terms.
This matrix M of the size m \times n is then decomposed via a
singular value decomposition into: term vector matrix T (constituting
left singular vectors), the document vector matrix D (constituting
right singular vectors) being both orthonormal, and the diagonal matrix
S (constituting singular values).
M = T S t(D)
These matrices are then reduced to the given number of dimensions k=dims to result into truncated matrices Tk, Sk and Dk — the latent semantic space.
Mk = t\[,1:k\] s\[1:k,1:k\] t(d\[,1:k\])
If these matrices Tk, Sk, Dk were multiplied, they would give a new matrix Mk (of the same format as M, i.e., rows are the same terms, columns are the same documents), which is the least-squares best fit approximation of M with k singular values.
In the case of folding-in, i.e., multiplying new documents into a given latent semantic space, the matrices Tk and Sk remain unchanged and an additional Dk is created (without replacing the old one). All three are multiplied together to return a (new and appendable) document-term matrix Mnew in the term-order of M.
LSAspace |
a list with components (Tk, Sk, Dk), representing the latent semantic space. |
Fridolin Wild fridolin.wild@wu-wien.ac.at
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990) Indexing by Latent Semantic Analysis. In: Journal of the American Society for Information Science 41(6), pp. 391–407.
Landauer, T., Foltz, P., and Laham, D. (1998) Introduction to Latent Semantic Analysis. In: Discourse Processes 25, pp. 259–284.
as.textmatrix
, fold_in
, textmatrix
, gw_idf
, dimcalc_share
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") ) write( c("ham", "mouse", "sushi"), file=paste(td, "D2", sep="/") ) write( c("dog", "pet", "pet"), file=paste(td, "D3", sep="/") ) # LSA data(stopwords_en) myMatrix = textmatrix(td, stopwords=stopwords_en) myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix) myLSAspace = lsa(myMatrix, dims=dimcalc_share()) as.textmatrix(myLSAspace) # clean up unlink(td, recursive=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.