compute similarities between documents and/or features

Share:

Description

Compute similarities between documents and/or features from a dfm. Uses the similarity measures defined in simil. See pr_DB for available distance measures, or how to create your own.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
similarity(x, selection = NULL, n = NULL, margin = c("documents",
  "features"), method = "correlation", sorted = TRUE, normalize = FALSE)

## S4 method for signature 'dfm'
similarity(x, selection = NULL, n = NULL,
  margin = c("documents", "features"), method = "correlation",
  sorted = TRUE, normalize = FALSE)

## S3 method for class 'similMatrix'
as.matrix(x, ...)

## S3 method for class 'similMatrix'
print(x, digits = 4, ...)

Arguments

x

a dfm object

selection

character or character vector of document names or feature labels from the dfm

n

the top n most similar items will be returned, sorted in descending order. If n is NULL, return all items.

margin

identifies the margin of the dfm on which similarity will be computed: documents for documents or features for word/term features.

method

a valid method for computing similarity from pr_DB

sorted

sort results in descending order if TRUE

normalize

a deprecated argument retained (temporarily) for legacy reasons. If you want to compute similarity on a "normalized" dfm objects (e.g. x), wrap it in weight(x, "relFreq").

...

unused

digits

decimal places to display similarity values

Value

a named list of the selection labels, with a sorted named vector of similarity measures.

Note

The method for computing feature similarities can be quite slow when there are large numbers of feature types. Future implementations will hopefully speed this up.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# create a dfm from inaugural addresses from Reagan onwards
presDfm <- dfm(subset(inaugCorpus, Year > 1980), ignoredFeatures = stopwords("english"),
               stem = TRUE)

# compute some document similarities
(tmp <- similarity(presDfm, margin = "documents"))
# output as a matrix
as.matrix(tmp)
# for specific comparisons
similarity(presDfm, "1985-Reagan", n = 5, margin = "documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n = 5, margin = "documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents", method = "cosine")
similarity(presDfm, "2005-Bush", margin = "documents", method = "eJaccard", sorted = FALSE)

# compute some term similarities
similarity(presDfm, c("fair", "health", "terror"), method="cosine", margin = "features", 20)

## Not run: 
# compare to tm
require(tm)
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
tdm <- TermDocumentMatrix(crude)
findAssocs(tdm, c("oil", "opec", "xyz"), c(0.75, 0.82, 0.1))
# in quanteda
quantedaDfm <- new("dfmSparse", Matrix::Matrix(t(as.matrix(tdm))))
similarity(quantedaDfm, c("oil", "opec", "xyz"), margin = "features", n = 14)
corMat <- as.matrix(proxy::simil(as.matrix(quantedaDfm), by_rows = FALSE))
round(head(sort(corMat[, "oil"], decreasing = TRUE), 14), 2)
round(head(sort(corMat[, "opec"], decreasing = TRUE), 9), 2)

## End(Not run)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.