View source: R/utils-textnets.R
doc_similarity | R Documentation |
Given a document-term matrix (DTM) this function returns the similarities between documents using a specified method (see details). The result is a square document-by-document similarity matrix (DSM), equivalent to a weighted adjacency matrix in network analysis.
doc_similarity(x, y = NULL, method, wv = NULL)
x |
Document-term matrix with terms as columns. |
y |
Optional second matrix (default = |
method |
Character vector indicating similarity method, including projection, cosine, wmd, and centroid (see Details). |
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. Required for "wmd" and "centroid" similarities. |
Document similarity methods include:
projection: finds the one-mode projection matrix from the two-mode DTM
using tcrossprod()
which measures the shared vocabulary overlap
cosine: compares row vectors using cosine similarity
jaccard: compares proportion of common words to unique words in both documents
wmd: word mover's distance to compare documents (requires word embedding vectors), using linear-complexity relaxed word mover's distance
centroid: represents each document as a centroid of their respective vocabulary, then uses cosine similarity to compare centroid vectors (requires word embedding vectors)
Dustin Stoltz
# load example word embeddings
data(ft_wv_sample)
# load example text
data(jfk_speech)
# minimal preprocessing
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)
# create DTM
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)
dsm_prj <- doc_similarity(dtm, method = "projection")
dsm_cos <- doc_similarity(dtm, method = "cosine")
dsm_wmd <- doc_similarity(dtm, method = "wmd", wv = ft_wv_sample)
dsm_cen <- doc_similarity(dtm, method = "centroid", wv = ft_wv_sample)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.