multicostring: Sentence x Vector Comparison

View source: R/multicostring.r

multicostringR Documentation

Sentence x Vector Comparison

Description

Computes cosines between a sentence/ document and multiple words

Usage

multicostring(x,y,tvectors=tvectors,split=" ",remove.punctuation=TRUE, 
stopwords = NULL, method ="Add")

Arguments

x

a character vector specifying a sentence/ document (or also a single word)

y

a character vector specifying multiple single words

tvectors

the semantic space in which the computation is to be done (a numeric matrix where every row is a word vector)

split

a character vector defining the character used to split the documents into words (white space by default)

remove.punctuation

removes punctuation from x and y; TRUE by default

stopwords

a character vector defining a list of words that are not used to compute the document/sentence vector for x

method

the compositional model to compute the document vector from its word vectors. The default option method = "Add" computes the document vector as the vector sum. With method = "Multiply", the document vector is computed via element-wise multiplication (see compose).

Details

The format of x (or y) can be of the kind x <- "word1 word2 word3" , but also of the kind x <- c("word1", "word2", "word3"). This allows for simple copy&paste-inserting of text, but also for using character vectors, e.g. the output of neighbors.

Both x and y can also just consist of one single word. In the traditional LSA approach, the vector D for the document (or sentence) x consisting of the words (t1, . , tn) is computed as

D = \sum\limits_{i=1}^n t_n

This is the default method (method="Add") for this function. Alternatively, this function provided the possibility of computing the document vector from its word vectors using element-wise multiplication (see Mitchell & Lapata, 2010 and compose). See also costring).

A note will be displayed whenever not all words of one input string are found in the semantic space. Caution: In that case, the function will still produce a result, by omitting the words not found in the semantic space. Depending on the specific requirements of a task, this may compromise the results. Please check your input when you receive this message.

A warning message will be displayed whenever no word of one input string is found in the semantic space.

Value

A numeric giving the cosine between the input sentences/documents

Author(s)

Fritz Guenther

References

Landauer, T.K., & Dumais, S.T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104, 211-240.

Dennis, S. (2007). How to use the LSA Web Site. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis (pp. 35-56). Mahwah, NJ: Erlbaum.

Mitchell, J., & Lapata, M. (2010). Composition in Distributional Models of Semantics. Cognitive Science, 34, 1388-1429.

http://wordvec.colorado.edu/

See Also

cosine, Cosine, multicos, costring

Examples

data(wonderland)

multicostring("alice was beginning to get very tired.",
        "a white rabbit with a clock ran close to her.",
        tvectors=wonderland)

multicostring("suddenly, a cat appeared in the woods",
names(neighbors("cheshire",n=20,tvectors=wonderland)), 
tvectors=wonderland)

LSAfun documentation built on Nov. 18, 2023, 1:10 a.m.