A set of functions which take two sets or bag of words and measure their similarity or dissimilarity.
1 2 3 4 5 6 7
The first set (or bag) to be compared. The origin bag for directional comparisons.
The second set (or bag) to be compared. The destination bag for directional comparisons.
jaccard_dissimilarity provide the Jaccard measures of similarity or
dissimilarity for two sets. The coefficients will be numbers between
1. For the similarity coefficient, the higher the
number the more similar the two sets are. When applied to two documents of
TextReuseTextDocument, the hashes in those documents
are compared. But this function can be passed objects of any class accepted
by the set functions in base R. So it is possible, for instance, to pass
this function two character vectors comprised of word, line, sentence, or
paragraph tokens, or those character vectors hashed as integers.
The Jaccard similarity coeffecient is defined as follows:
length(intersect(a, b)) / length(union(a, b))
The Jaccard dissimilarity is simply
1 - J(A, B)
bags rather than sets, so that the result is a fraction where the numerator
is the sum of each matching element counted the minimum number of times it
appears in each bag, and the denominator is the sum of the lengths of both
bags. The maximum value for the Jaccard bag similarity is
ratio_of_matches finds the ratio between the number of
b that are also in
a and the total number of items
b. Note that this similarity measure is directional: it measures
b borrows from
a, but says nothing about how much of
a borrows from
Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (Cambridge University Press, 2011).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
jaccard_similarity(1:6, 3:10) jaccard_dissimilarity(1:6, 3:10) a <- c("a", "a", "a", "b") b <- c("a", "a", "b", "b", "c") jaccard_similarity(a, b) jaccard_bag_similarity(a, b) ratio_of_matches(a, b) ratio_of_matches(b, a) ny <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse") ca_match <- system.file("extdata/legal/ca1851-match.txt", package = "textreuse") ca_nomatch <- system.file("extdata/legal/ca1851-nomatch.txt", package = "textreuse") ny <- TextReuseTextDocument(file = ny, meta = list(id = "ny")) ca_match <- TextReuseTextDocument(file = ca_match, meta = list(id = "ca_match")) ca_nomatch <- TextReuseTextDocument(file = ca_nomatch, meta = list(id = "ca_nomatch")) # These two should have higher similarity scores jaccard_similarity(ny, ca_match) ratio_of_matches(ny, ca_match) # These two should have lower similarity scores jaccard_similarity(ny, ca_nomatch) ratio_of_matches(ny, ca_nomatch)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.