most_similar | R Documentation |
Measures similarities over term-document and word-context matrices.
most_similar(mat, vec, method = "cosine", margin = 1, fullResults = F)
mat |
A term-document or word-context matrix. |
vec |
The vector or term you are evaluating. If you enter a single term (typically a keyword or document name), that term must be entered with quotation marks. If you have defined a new vector to analyze, enter the name of the vector without quotes. |
method |
A character string: 'cosine', 'euclidean', 'pearson' or 'covariance', which names the mathematical similarity test to be performed. Default is 'cosine'. The most common other method is 'euclidean'. |
margin |
Numeric value: 1 or 2. If 1, calculations are performed over the rows. If 2, over the columns. |
fullResults |
Logical value. Default is false. |
If fullResults
is true, all results are included in a full-length vector. If
false, only the 12 most similar terms will
be displayed. (In general, include the full results when you plan to use the
vector for further evaluation. Display partial results when you're just
glancing over the top hits.)
Each one of these similarity measurements captures slightly different relationships
and will generate somewhat different output. In general, cosine similarity and Pearson
correlations (which are very similar functions) are best for estimating synonyms. Covariance
and Euclidean distance tend to find more various kinds of relationships. Keep in mind that
the relationship between semantic similarity measures and qualitative assumptions
about word meaning remains underdetermined in the research. Identifying if and how similarity
scores can map contours of meaning in a document collection should be considered a question
not yet answered. empson
was designed to help humanists think this problem through. The
statistical tests included here were chosen for their simplicity.
When working with a term-document matrix, selecting margin = 1
will find similarity of
words, and margin = 2
will find similar documents. When working with a word-context
matrix, margin = 2
will read across the columns, and so will be limited only to those
words for which empson
built concordances.
# For most similar words in a word-context matrix
data(eebo)
most_similar(mat = eebo, vec = "rights")
most_similar(mat = eebo, vec = "rights", method = "euclidean")
# For most similar words of a composite vector
compvec = eebo["mind",] - eebo["soul",]
most_similar(mat = eebo, vec = compvec)
# For most similar documents in a term-document matrix
data(shakespeare)
most_similar(mat = shakespeare, vec = "TN", margin = 2)
# For full results
most_similar(mat = eebo, vec = "rights", fullResults = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.