textstat_simil: Similarity and distance computation between documents or...

Description Usage Arguments Details Value Note References See Also Examples

View source: R/textstat_simil.R

Description

These functions compute matrixes of distances and similarities between documents or features from a dfm and return a dist object (or a matrix if specific targets are selected). They are fast and robust because they operate directly on the sparse dfm objects.

Usage

1
2
3
4
5
6
7
8
textstat_simil(x, selection = NULL, margin = c("documents",
  "features"), method = c("correlation", "cosine", "jaccard", "ejaccard",
  "dice", "edice", "hamman", "simple matching", "faith"), upper = FALSE,
  diag = FALSE)

textstat_dist(x, selection = NULL, margin = c("documents", "features"),
  method = c("euclidean", "kullback", "manhattan", "maximum", "canberra",
  "minkowski"), upper = FALSE, diag = FALSE, p = 2)

Arguments

x

a dfm object

selection

a valid index for document or feature names (depending on margin) from x, to be selected for comparison

margin

identifies the margin of the dfm on which similarity or difference will be computed: "documents" for documents or "features" for word/term features.

method

method the similarity or distance measure to be used; see Details.

upper

whether the upper triangle of the symmetric V \times V matrix is recorded. Only used when value = "dist".

diag

whether the diagonal of the distance matrix should be recorded. . Only used when value = "dist".

p

The power of the Minkowski distance.

Details

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamman", and "faith".

textstat_dist options are: "euclidean" (default), "kullback". "manhattan", "maximum", "canberra", and "minkowski".

Value

By default, textstat_simil and textstat_dist return dist class objects if selection is NULL, otherwise, a matrix is returned matching distances to the documents or features identified in the selection.

These can be transformed into a list format using as.list.dist, if that format is preferred.

Note

If you want to compute similarity on a "normalized" dfm object (controlling for variable document lengths, for methods such as correlation for which different document lengths matter), then wrap the input dfm in dfm_weight(x, "prop").

References

"kullback" is the Kullback-Leibler distance, which assumes that P(x_i) = 0 implies P(y_i)=0, and in case either P(x_i) or P(y_i) equals to zero, then P(x_i) * log(p(x_i)/p(y_i)) is assumed to be zero as the limit value. The formula is:

∑{P(x)*log(P(x)/p(y))}

All other measures are described in the proxy package.

See Also

textstat_dist, as.matrix.simil, as.list.dist, dist, as.dist

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# similarities for documents
dfmat <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), 
          remove_punct = TRUE, remove = stopwords("english"))
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)
as.list(tstat1)

# similarities for for specific documents
textstat_simil(dfmat, selection = "2017-Trump", margin = "documents")
textstat_simil(dfmat, selection = "2017-Trump", method = "cosine", margin = "documents")
textstat_simil(dfmat, selection = c("2009-Obama" , "2013-Obama"), margin = "documents")

# compute some term similarities
tstat2 <- textstat_simil(dfmat, selection = c("fair", "health", "terror"), method = "cosine",
                      margin = "features")
head(as.matrix(tstat2), 10)
as.list(tstat2, n = 8)

# create a dfm from inaugural addresses from Reagan onwards
dfmat <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990), 
               remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
               
# distances for documents 
(tstat1 <- textstat_dist(dfmat, margin = "documents"))
as.matrix(tstat1)

# distances for specific documents
textstat_dist(dfmat, "2017-Trump", margin = "documents")
(tstat2 <- textstat_dist(dfmat, c("2009-Obama" , "2013-Obama"), margin = "documents"))
as.list(tstat2)

quanteda/quanteda documentation built on Feb. 16, 2019, 5:45 a.m.