# Create a vector space with Latent Semantic Analysis (LSA)

### Description

Calculates a latent semantic space from a given document-term matrix.

### Usage

1 | ```
lsa( x, dims=dimcalc_share() )
``` |

### Arguments

`x` |
a document-term matrix (recommeded to be of class textmatrix), containing documents in colums, terms in rows and occurrence frequencies in the cells. |

`dims` |
either the number of dimensions or a configuring function. |

### Details

LSA combines the classical vector space model — well known in textmining — with a Singular Value Decomposition (SVD), a two-mode factor analysis. Thereby, bag-of-words representations of texts can be mapped into a modified vector space that is assumed to reflect semantic structure.

With `lsa()`

a new latent semantic space can
be constructed over a given document-term matrix. To ease
comparisons of terms and documents with common
correlation measures, the space can be converted into
a textmatrix of the same format as `y`

by calling `as.textmatrix()`

.

To add more documents or queries to this latent semantic
space in order to keep them from influencing the original
factor distribution (i.e., the latent semantic structure calculated
from a primary text corpus), they can be ‘folded-in’ later on
(with the function `fold_in()`

).

Background information (see also Deerwester et al., 1990):

A document-term matrix *M* is constructed
with `textmatrix()`

from a given text base of *n* documents
containing *m* terms.
This matrix *M* of the size *m \times n* is then decomposed via a
singular value decomposition into: term vector matrix *T* (constituting
left singular vectors), the document vector matrix *D* (constituting
right singular vectors) being both orthonormal, and the diagonal matrix
*S* (constituting singular values).

*M = T S t(D)*

These matrices are then reduced to the given number of dimensions *k=dims*
to result into truncated matrices *Tk*, *Sk* and *Dk*
— the latent semantic space.

*Mk = t\[,1:k\] s\[1:k,1:k\] t(d\[,1:k\])*

If these matrices *Tk, Sk, Dk* were multiplied, they would give a new
matrix *Mk* (of the same format as *M*, i.e., rows are the
same terms, columns are the same documents), which is the least-squares best
fit approximation of *M* with *k* singular values.

In the case of folding-in, i.e., multiplying new documents into a given
latent semantic space, the matrices *Tk* and *Sk* remain unchanged
and an additional *Dk* is created (without replacing the old one).
All three are multiplied together to return a (new and appendable)
document-term matrix *Mnew* in the term-order of *M*.

### Value

`LSAspace` |
a list with components ( |

### Author(s)

Fridolin Wild fridolin.wild@wu-wien.ac.at

### References

Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990) *Indexing by Latent Semantic Analysis*. In: Journal of the American Society for Information Science 41(6), pp. 391–407.

Landauer, T., Foltz, P., and Laham, D. (1998) *Introduction to Latent Semantic Analysis*. In: Discourse Processes 25, pp. 259–284.

### See Also

`as.textmatrix`

, `fold_in`

, `textmatrix`

, `gw_idf`

, `dimcalc_share`

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | ```
# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
write( c("ham", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
write( c("dog", "pet", "pet"), file=paste(td, "D3", sep="/") )
# LSA
data(stopwords_en)
myMatrix = textmatrix(td, stopwords=stopwords_en)
myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix)
myLSAspace = lsa(myMatrix, dims=dimcalc_share())
as.textmatrix(myLSAspace)
# clean up
unlink(td, recursive=TRUE)
``` |