cs.matrix: Cosine Similarity Matrix
In scottmanski/TAGAM:

Description Usage Arguments Details Value References See Also Examples

View source: R/cs.matrix.R

Creates the design matrix of cosine similarities from textual observations and a vector of words.

cs.matrix(
  x,
  words,
  word_embeddings,
  method = "max",
  parallel = FALSE,
  n.cluster = NULL,
  sparse = FALSE
)

`x`	a tibble containing 2 columns; line and word. The 'line' column contains the observation number that the word from the 'word' column appears in. See 'Examples'.
`words`	a character vector of words that will represent the columns of the resulting matrix.
`word_embeddings`	named list of word embeddings. See `formatWordEmbeddings`.
`method`	function to apply across each column. Options include `c("max", "sum", "mean")`.
`parallel`	logical, indicating if the matrix should be calculated in parallel.
`n.cluster`	integer, the number of clusters to use if `parallel=TRUE`.
`sparse`	logical, indicating if a sparse matrix should be returned.

A function to create a design matrix of cosine similarities from textual observations and a vector of words. The resulting matrix will be of dimension unique(x$line) \times length(words).

Consider 2 words with word embedding representations a and b. Then the cosine similarity is defined as

sim_cos(a,b)=(a \cdot b)/(|| a ||_2 \cdot || b ||_2)

.

If method = "max", for a given line with m words, each row of the returned matrix is defined as max_{i=1,...,m}(sim_cos(a_j, b_i)). method = "sum" or method = "mean" are defined in a similar fashion.

a (sparse) matrix of cosine similarities

Goldberg, Y. (2017) Neural Network Methods for Natural Language Processing. San Rafael, CA: Morgan & Claypool Publishers.

cs, formatWordEmbeddings

## Not run: 
require(dplyr)
require(tidytext)

word_embeddings <- formatWordEmbeddings(embedding_matrix_example, normalize = TRUE)


sentences <- data.frame("Description" = c("Statistics is great!",
                                          "My dog is fluffy.",
                                          "What is your favorite class?"),
                        stringsAsFactors = FALSE)
x <- tibble(line = 1:nrow(sentences), text = sentences$Description) %>%
  unnest_tokens(word, text)

cs.matrix(x, words = c("stats", "cat"), word_embeddings)

## End(Not run)