most_similar: Measure similarities over matrices

View source: R/most_similar.R

most_similarR Documentation

Measure similarities over matrices

Description

Measures similarities over term-document and word-context matrices.

Usage

most_similar(mat, vec, method = "cosine", margin = 1, fullResults = F)

Arguments

mat

A term-document or word-context matrix.

vec

The vector or term you are evaluating. If you enter a single term (typically a keyword or document name), that term must be entered with quotation marks. If you have defined a new vector to analyze, enter the name of the vector without quotes.

method

A character string: 'cosine', 'euclidean', 'pearson' or 'covariance', which names the mathematical similarity test to be performed. Default is 'cosine'. The most common other method is 'euclidean'.

margin

Numeric value: 1 or 2. If 1, calculations are performed over the rows. If 2, over the columns.

fullResults

Logical value. Default is false.

Value

If fullResults is true, all results are included in a full-length vector. If false, only the 12 most similar terms will be displayed. (In general, include the full results when you plan to use the vector for further evaluation. Display partial results when you're just glancing over the top hits.)

What it does

Each one of these similarity measurements captures slightly different relationships and will generate somewhat different output. In general, cosine similarity and Pearson correlations (which are very similar functions) are best for estimating synonyms. Covariance and Euclidean distance tend to find more various kinds of relationships. Keep in mind that the relationship between semantic similarity measures and qualitative assumptions about word meaning remains underdetermined in the research. Identifying if and how similarity scores can map contours of meaning in a document collection should be considered a question not yet answered. empson was designed to help humanists think this problem through. The statistical tests included here were chosen for their simplicity.

When working with a term-document matrix, selecting margin = 1 will find similarity of words, and margin = 2 will find similar documents. When working with a word-context matrix, margin = 2 will read across the columns, and so will be limited only to those words for which empson built concordances.

Examples

# For most similar words in a word-context matrix
data(eebo)
most_similar(mat = eebo, vec = "rights") 
most_similar(mat = eebo, vec = "rights", method = "euclidean")

# For most similar words of a composite vector
compvec = eebo["mind",] - eebo["soul",]
most_similar(mat = eebo, vec = compvec)

# For most similar documents in a term-document matrix
data(shakespeare)
most_similar(mat = shakespeare, vec = "TN", margin = 2)

# For full results
most_similar(mat = eebo, vec = "rights", fullResults = TRUE)


michaelgavin/litmath documentation built on Oct. 20, 2023, 9:20 a.m.