doc_similarity: Find a similarities between documents

View source: R/utils-textnets.R

doc_similarityR Documentation

Find a similarities between documents

Description

Given a document-term matrix (DTM) this function returns the similarities between documents using a specified method (see details). The result is a square document-by-document similarity matrix (DSM), equivalent to a weighted adjacency matrix in network analysis.

Usage

doc_similarity(x, y = NULL, method, wv = NULL)

Arguments

x

Document-term matrix with terms as columns.

y

Optional second matrix (default = NULL).

method

Character vector indicating similarity method, including projection, cosine, wmd, and centroid (see Details).

wv

Matrix of word embedding vectors (a.k.a embedding model) with rows as words. Required for "wmd" and "centroid" similarities.

Details

Document similarity methods include:

  • projection: finds the one-mode projection matrix from the two-mode DTM using tcrossprod() which measures the shared vocabulary overlap

  • cosine: compares row vectors using cosine similarity

  • jaccard: compares proportion of common words to unique words in both documents

  • wmd: uses word mover's distance to compare documents (requires word embedding vectors)

  • centroid: represents each document as a centroid of their respective vocabulary, then uses cosine similarity to compare centroid vectors (requires word embedding vectors)

Author(s)

Dustin Stoltz


text2map documentation built on July 9, 2023, 6:35 p.m.