Ex-post folding-in of textmatrices into an existing latent semantic space

Description

Additional documents can be mapped into a pre-exisiting latent semantic space without influencing the factor distribution of the space. Applied, when additional documents must not influence the calculated existing latent semantic factor structure.

Usage

1
fold_in( docvecs, LSAspace )

Arguments

LSAspace

a latent semantic space generated by createLSAspace.

docvecs

a textmatrix.

Details

To keep additional documents from influencing the factor distribution calculated previously from a particular text basis, they can be folded-in after the singular value decomposition performed in lsa().

Background Information: For folding-in, a pseudo document vector mi of the new documents is calculated into as shown in the equations (1) and (2) (cf. Berry et al., 1995):

(1) di = t(v) Tk Sk^(-1)

(2) mi = Tk Sk t(di)

The document vector t(v) in equation~(1) is identical to an additional column of an input textmatrix M with the term frequencies of the essay to be folded-in. Tk and Sk are the truncated matrices from the SVD applied through lsa() on a given text collection to construct the latent semantic space. The resulting vector mi from equation~(2) is identical to an additional column in the textmatrix representation of the latent semantic space (as produced by as.textmatrix()). Be careful when using weighting schemes: you may want to use the global weights of the training textmatrix also for your new data that you fold-in!

Value

textmatrix

a textmatrix representation of the additional documents in the latent semantic space.

Author(s)

Fridolin Wild f.wild@open.ac.uk

See Also

textmatrix, lsa, as.textmatrix

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# create a first textmatrix with some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") )
matrix1 = textmatrix(td, minWordLength=1)
unlink(td, recursive=TRUE)

# create a second textmatrix with some more files
td = tempfile()
dir.create(td)
write( c("cat", "mouse", "mouse"), file=paste(td, "A1", sep="/") )
write( c("nothing", "mouse", "monster"), file=paste(td, "A2", sep="/") )
write( c("cat", "monster", "monster"), file=paste(td, "A3", sep="/") )
matrix2 = textmatrix(td, vocabulary=rownames(matrix1), minWordLength=1)
unlink(td, recursive=TRUE)

# create an LSA space from matrix1
space1 = lsa(matrix1, dims=dimcalc_share())
as.textmatrix(space1)

# fold matrix2 into the space generated by matrix1
fold_in( matrix2, space1)