cs.matrix: Cosine Similarity Matrix

Description Usage Arguments Details Value References See Also Examples

View source: R/cs.matrix.R

Description

Creates the design matrix of cosine similarities from textual observations and a vector of words.

Usage

1
2
3
4
5
6
7
8
9
cs.matrix(
  x,
  words,
  word_embeddings,
  method = "max",
  parallel = FALSE,
  n.cluster = NULL,
  sparse = FALSE
)

Arguments

x

a tibble containing 2 columns; line and word. The 'line' column contains the observation number that the word from the 'word' column appears in. See 'Examples'.

words

a character vector of words that will represent the columns of the resulting matrix.

word_embeddings

named list of word embeddings. See formatWordEmbeddings.

method

function to apply across each column. Options include c("max", "sum", "mean").

parallel

logical, indicating if the matrix should be calculated in parallel.

n.cluster

integer, the number of clusters to use if parallel=TRUE.

sparse

logical, indicating if a sparse matrix should be returned.

Details

A function to create a design matrix of cosine similarities from textual observations and a vector of words. The resulting matrix will be of dimension unique(x$line) \times length(words).

Consider 2 words with word embedding representations a and b. Then the cosine similarity is defined as

sim_cos(a,b)=(a \cdot b)/(|| a ||_2 \cdot || b ||_2)

.

If method = "max", for a given line with m words, each row of the returned matrix is defined as max_{i=1,...,m}(sim_cos(a_j, b_i)). method = "sum" or method = "mean" are defined in a similar fashion.

Value

a (sparse) matrix of cosine similarities

References

Goldberg, Y. (2017) Neural Network Methods for Natural Language Processing. San Rafael, CA: Morgan & Claypool Publishers.

See Also

cs, formatWordEmbeddings

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
## Not run: 
require(dplyr)
require(tidytext)

word_embeddings <- formatWordEmbeddings(embedding_matrix_example, normalize = TRUE)


sentences <- data.frame("Description" = c("Statistics is great!",
                                          "My dog is fluffy.",
                                          "What is your favorite class?"),
                        stringsAsFactors = FALSE)
x <- tibble(line = 1:nrow(sentences), text = sentences$Description) %>%
  unnest_tokens(word, text)

cs.matrix(x, words = c("stats", "cat"), word_embeddings)

## End(Not run)

scottmanski/TAGAM documentation built on Aug. 3, 2020, 10:50 a.m.