similarity: Measure similarities over matrices
In michaelgavin/empson: Vector-space modeling for historical documents

Description Usage Arguments Value What it does Examples

Measures similarities over term-document and word-context matrices.

1 2	similarity(mat, vec, method = "cosine", margin = 1, fullResults = F, threshold = 0)

`mat`	A term-document or word-context matrix.
`vec`	The vector or term you are evaluating. If you enter a single term (typically a keyword or document name), that term must be entered with quotation marks. If you have defined a new vector to analyze, enter the name of the vector without quotes.
`method`	A character string: 'cosine', 'euclidean', 'pearson' or 'covariance', which names the mathematical similarity test to be performed. Default is 'cosine'. The most common other method is 'euclidean'.
`margin`	Numeric value: 1 or 2. If 1, calculations are performed over the rows. If 2, over the columns.
`fullResults`	Logical value. Default is false.
`threshold`	Numeric value: 0 to 100. Default is 0.

If fullResults is true, all results are included in a full-length vector. If false, only the 12 most similar terms with frequency above the threshold will be displayed. (In general, include the full results when you plan to use the vector for further evaluation. Display partial results when you're just glancing over the top hits.)

Each one of these similarity measurements captures slightly different relationships and will generate somewhat different output. In general, cosine similarity and Pearson correlations (which are very similar functions) are best for estimating synonyms. Covariance and Euclidean distance tend to find more various kinds of relationships. Keep in mind that the relationship between semantic similarity measures and qualitative assumptions about word meaning remains underdetermined in the research. Identifying if and how similarity scores can map contours of meaning in a document collection should be considered a question not yet answered. empson was designed to help humanists think this problem through. The statistical tests included here were chosen for their simplicity.

When working with a term-document matrix, selecting margin = 1 will find similarity of words, and margin = 2 will find similar documents. When working with a word-context matrix, margin = 2 will read across the columns, and so will be limited only to those words for which empson built concordances.

The threshold parameter filters out low-frequency words for human reading, limiting results to the frequency percentile of the threshold. For example, if threshold = 50, only words with above-average frequency will be included in the displayed results. If set to 0, there is no threshold and all results are returned. Raising the threshold to 90 or 95 will limit results to only higher-frequency words. Often this is desirable if you're looking for human- readable output, though it's worth keeping in mind that the conceptual relations at play among high-frequency and low-frequency terms are underdetermined. Filtering out low-frequency words often 'improves' the outputs of topic models and similarity measurements, in that it restricts the output to words people use often enough to feel comfortable interpreting across contexts. Whether that comfort is trustworthy or misleading is not known.

# For most similar words in a word-context matrix
data(eebo)
similarity(mat = eebo, vec = "rights") 
similarity(mat = eebo, vec = "rights", threshold = 95)
similarity(mat = eebo, vec = "rights", method = "euclidean")

# For most similar words of a composite vector
compvec = eebo["mind",] - eebo["soul",]
similarity(mat = eebo, vec = compvec)

# For most similar documents in a term-document matrix
data(shakespeare)
similarity(mat = shakespeare, vec = "TN", margin = 2)

# For full results
similarity(mat = eebo, vec = "rights", fullResults = TRUE)