Description Usage Arguments Details Value Note See Also Examples
View source: R/textstat_simil.R
These functions compute matrixes of distances and similarities between
documents or features from a dfm()
and return a matrix of
similarities or distances in a sparse format. These methods are fast
and robust because they operate directly on the sparse dfm objects.
The output can easily be coerced to an ordinary matrix, a data.frame of
pairwise comparisons, or a dist format.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | textstat_simil(
x,
y = NULL,
selection = NULL,
margin = c("documents", "features"),
method = c("correlation", "cosine", "jaccard", "ejaccard", "dice", "edice", "hamman",
"simple matching"),
min_simil = NULL,
...
)
textstat_dist(
x,
y = NULL,
selection = NULL,
margin = c("documents", "features"),
method = c("euclidean", "manhattan", "maximum", "canberra", "minkowski"),
p = 2,
...
)
## S3 method for class 'textstat_proxy'
as.list(x, sorted = TRUE, n = NULL, diag = FALSE, ...)
## S3 method for class 'textstat_proxy'
as.data.frame(
x,
row.names = NULL,
optional = FALSE,
diag = FALSE,
upper = FALSE,
...
)
|
x, y |
a dfm objects; |
selection |
(deprecated - use |
margin |
identifies the margin of the dfm on which similarity or
difference will be computed: |
method |
character; the method identifying the similarity or distance measure to be used; see Details. |
min_simil |
numeric; a threshold for the similarity values below which similarity values will not be returned |
... |
unused |
p |
The power of the Minkowski distance. |
sorted |
sort results in descending order if |
n |
the top |
diag |
logical; if |
row.names |
|
optional |
logical. If |
upper |
logical; if |
textstat_simil
options are: "correlation"
(default),
"cosine"
, "jaccard"
, "ejaccard"
, "dice"
,
"edice"
, "simple matching"
, and "hamman"
.
textstat_dist
options are: "euclidean"
(default),
"manhattan"
, "maximum"
, "canberra"
,
and "minkowski"
.
A sparse matrix from the Matrix package that will be symmetric
unless y
is specified.
These can be transformed easily into a list format using as.list()
, which
returns a list for each unique element of the second of the pairs,
as.dist()
to be transformed into a dist object, or
as.matrix()
to convert it into an ordinary matrix.
as.data.list
for a textstat_simil
or
textstat_dist
object returns a list equal in length to the columns of the
simil or dist object, with the rows and their values as named elements. By default,
this list excludes same-time pairs (when diag = FALSE
) and sorts the values
in descending order (when sorted = TRUE
).
as.data.frame
for a textstat_simil
or
textstat_dist
object returns a data.frame of pairwise combinations
and the and their similarity or distance value.
If you want to compute similarity on a "normalized" dfm object
(controlling for variable document lengths, for methods such as correlation
for which different document lengths matter), then wrap the input dfm in
[dfm_weight](x, "prop")
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | # similarities for documents
dfmat <- dfm(corpus_subset(data_corpus_inaugural, Year > 2000),
remove_punct = TRUE, remove = stopwords("english"))
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)
as.list(tstat1)
as.list(tstat1, diag = TRUE)
# min_simil
(tstat2 <- textstat_simil(dfmat, method = "cosine", margin = "documents", min_simil = 0.6))
as.matrix(tstat2)
# similarities for for specific documents
textstat_simil(dfmat, dfmat["2017-Trump", ], margin = "documents")
textstat_simil(dfmat, dfmat["2017-Trump", ], method = "cosine", margin = "documents")
textstat_simil(dfmat, dfmat[c("2009-Obama", "2013-Obama"), ], margin = "documents")
# compute some term similarities
tstat3 <- textstat_simil(dfmat, dfmat[, c("fair", "health", "terror")], method = "cosine",
margin = "features")
head(as.matrix(tstat3), 10)
as.list(tstat3, n = 6)
# distances for documents
(tstat4 <- textstat_dist(dfmat, margin = "documents"))
as.matrix(tstat4)
as.list(tstat4)
as.dist(tstat4)
# distances for specific documents
textstat_dist(dfmat, dfmat["2017-Trump", ], margin = "documents")
(tstat5 <- textstat_dist(dfmat, dfmat[c("2009-Obama" , "2013-Obama"), ], margin = "documents"))
as.matrix(tstat5)
as.list(tstat5)
## Not run:
# plot a dendrogram after converting the object into distances
plot(hclust(as.dist(tstat4)))
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.