textstat_simil: Similarity and distance computation between documents or...

Description Usage Arguments Details Value Note References See Also Examples

View source: R/textstat_simil.R

Description

These functions compute matrixes of distances and similarities between documents or features from a dfm and return a dist object (or a matrix if specific targets are selected). They are fast and robust because they operate directly on the sparse dfm objects.

Usage

1
2
3
4
5
6
7
8
textstat_simil(x, selection = NULL, margin = c("documents",
  "features"), method = c("correlation", "cosine", "jaccard", "ejaccard",
  "dice", "edice", "hamman", "simple matching", "faith"), upper = FALSE,
  diag = FALSE)

textstat_dist(x, selection = NULL, margin = c("documents", "features"),
  method = c("euclidean", "kullback", "manhattan", "maximum", "canberra",
  "minkowski"), upper = FALSE, diag = FALSE, p = 2)

Arguments

x

a dfm object

selection

a valid index for document or feature names from x, to be selected for comparison

margin

identifies the margin of the dfm on which similarity or difference will be computed: "documents" for documents or "features" for word/term features.

method

method the similarity or distance measure to be used; see Details.

upper

whether the upper triangle of the symmetric V \times V matrix is recorded. Only used when value = "dist".

diag

whether the diagonal of the distance matrix should be recorded. . Only used when value = "dist".

p

The power of the Minkowski distance.

Details

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamman", and "faith".

textstat_dist options are: "euclidean" (default), "kullback". "manhattan", "maximum", "canberra", and "minkowski".

Value

By default, textstat_simil and textstat_dist return dist class objects if selection is NULL, otherwise, a matrix is returned matching distances to the documents or features identified in the selection.

These can be transformed into a list format using as.list.dist, if that format is preferred.

Note

If you want to compute similarity on a "normalized" dfm object (controlling for variable document lengths, for methods such as correlation for which different document lengths matter), then wrap the input dfm in dfm_weight(x, "prop").

References

"kullback" is the Kullback-Leibler distance, which assumes that P(x_i) = 0 implies P(y_i)=0, and in case either P(x_i) or P(y_i) equals to zero, then P(x_i) * log(p(x_i)/p(y_i)) is assumed to be zero as the limit value. The formula is:

∑{P(x)*log(P(x)/p(y))}

All other measures are described in the proxy package.

See Also

textstat_dist, as.list.dist, dist, as.dist

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# similarities for documents
mt <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords("english"))
(s1 <- textstat_simil(mt, method = "cosine", margin = "documents"))
as.matrix(s1)
as.list(s1)

# similarities for for specific documents
textstat_simil(mt, "2017-Trump", margin = "documents")
textstat_simil(mt, "2017-Trump", method = "cosine", margin = "documents")
textstat_simil(mt, c("2009-Obama" , "2013-Obama"), margin = "documents")

# compute some term similarities
s2 <- textstat_simil(mt, c("fair", "health", "terror"), method = "cosine",
                      margin = "features")
head(as.matrix(s2), 10)
as.list(s2, n = 8)

# create a dfm from inaugural addresses from Reagan onwards
mt <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990), 
               remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
               
# distances for documents 
(d1 <- textstat_dist(mt, margin = "documents"))
as.matrix(d1)

# distances for specific documents
textstat_dist(mt, "2017-Trump", margin = "documents")
(d2 <- textstat_dist(mt, c("2009-Obama" , "2013-Obama"), margin = "documents"))
as.list(d1)

quanteda documentation built on Nov. 2, 2018, 1:05 a.m.