multidocs: Comparison of sentence sets

View source: R/multidocs.r

multidocsR Documentation

Comparison of sentence sets

Description

Computes cosine values between sets of sentences and/or documents

Usage

multidocs(x,y=x,chars=10,tvectors=tvectors,remove.punctuation=TRUE,
stopwords = NULL,method ="Add")

Arguments

x

a character vector containing different sentences/documents

y

a character vector containing different sentences/documents (y = x by default)

chars

an integer specifying how many letters (starting from the first) of each sentence/document are to be printed in the row.names and col.names of the output matrix

tvectors

the semantic space in which the computation is to be done (a numeric matrix where every row is a word vector)

remove.punctuation

removes punctuation from x and y; TRUE by default

stopwords

a character vector defining a list of words that are not used to compute the document/sentence vector for x and y

method

the compositional model to compute the document vector from its word vectors. The default option method = "Add" computes the document vector as the vector sum. With method = "Multiply", the document vector is computed via element-wise multiplication (see compose).

Details

In the traditional LSA approach, the vector D for a document (or a sentence) consisting of the words (t1, . , tn) is computed as

D = \sum\limits_{i=1}^n t_n

This is the default method (method="Add") for this function. Alternatively, this function provided the possibility of computing the document vector from its word vectors using element-wise multiplication (see Mitchell & Lapata, 2010 and compose).

This function computes the cosines between two sets of documents (or sentences).

The format of x (or y) should be of the kind x <- c("this is the first text","here is another text") (or y <- c("this is a third text","and here is yet another text"))

A note will be displayed whenever not all words of one input string are found in the semantic space. Caution: In that case, the function will still produce a result, by omitting the words not found in the semantic space. Depending on the specific requirements of a task, this may compromise the results. Please check your input when you receive this message.

A warning message will be displayed whenever no word of one input string is found in the semantic space.

Value

A list of three elements:

cosmat

A numeric matrix giving the cosines between the input sentences/documents

xdocs

A legend for the row.names of cosmat

ydocs

A legend for the col.names of cosmat

Author(s)

Fritz Guenther

References

Landauer, T.K., & Dumais, S.T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104, 211-240.

Dennis, S. (2007). How to use the LSA Web Site. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis (pp. 35-56). Mahwah, NJ: Erlbaum.

Mitchell, J., & Lapata, M. (2010). Composition in Distributional Models of Semantics. Cognitive Science, 34, 1388-1429.

http://wordvec.colorado.edu/

See Also

cosine, Cosine, multicos, costring

Examples

data(wonderland)
multidocs(x = c("alice was beginning to get very tired.",
                "the red queen greeted alice."),
          y = c("the mad hatter and the mare hare are having a party.",
                "the hatter sliced the cup of tea in half."), 
      tvectors=wonderland)

LSAfun documentation built on Nov. 18, 2023, 1:10 a.m.