Duplicates | R Documentation |
Detect Duplicates
Detect Duplicates
Class for duplicate detection.
The class implements a procedure described by Fritz Kliche, Andre Blessing, Urlich Heid and Jonathan Sonntag in the paper "The eIdentity Text ExplorationWorkbench" presented at LREC 2014 (see http://www.lrec-conf.org/proceedings/lrec2014/pdf/332_Paper.pdf).
To detect duplicates, choices are made as follows:
If two similar articles have been published on the same day, the shorter article will be considered the duplicate;
if two similar articles were published on different days, the article that appeared later will be considered the duplicate.
Different partition_bundle
-objects can be passed into the
detect
-method successively. The field duplicates
will be
appended by the duplicates that are newly detected.
corpus
ID of the CWB corpus (derived from partition_bundle
).
char_regex
Regular expression defining the characters to keep.
char_count
Count of the characters in the partition_bundle
.
n
Number of days before and after a document was published.
p_attribute
the p-attribute used (defaults to "word")
s_attribute
the s-attribute of the date of a text in the corpus
sample
size of the sample of the partition_bundle
that the character count is based on
threshold
Minimum similarity value to consider two texts as duplicates.
duplicates
A data.table
with documents considered as duplicates.
similarities
a simple_triplet_matrix
with similarities of texts
date_preprocessor
function to rework dates if not in the DD-MM-YYYY standard format
annotation
a data.table
with corpus positions.
new()
Initialize object of class Duplicates
.
Duplicates$new( corpus, char_regex = "[a-zA-Z]", p_attribute = "word", s_attribute = "text_date", date_preprocessor = NULL, sample = 1000L, n = 1L, threshold = 0.9 )
corpus
ID of the CWB corpus that will be explored.
char_regex
a regex defining the characters to keep
p_attribute
The p-attribute to evaluate.
s_attribute
the s-attribute providing the date
date_preprocessor
A function used to preprocess dates as extracted
from s_attribute
.
sample
number of documents to define a subset of partition_bundle
to
speed up character count
n
number of days before and after a document was published
threshold
numeric (0 < x < 1), the minimum similarity to qualify two documents as duplicates
get_comparisons()
Identify documents that will be compared (based on date of documents).
Duplicates$get_comparisons( x, reduce = TRUE, verbose = FALSE, progress = TRUE, mc = FALSE )
x
a partition_bundle
object defining the documents that will be
compared to detect duplicates
reduce
A logical
value, whether to drop one half of matrix.
verbose
logical, whether to be verbose
progress
logical, whether to show progress bar
mc
logical, whether to use multicore
similarities_matrix_to_dt()
Turn similarities of documents into a data.table that identifies original document and duplicate.
Duplicates$similarities_matrix_to_dt( x, similarities, mc = FALSE, progress = TRUE, verbose = TRUE )
x
a partition_bundle
object defining the documents that will be
compared to detect duplicates
similarities
A TermDocumentMatrix
with cosine similarities.
mc
logical, whether to use multicore
progress
logical, whether to show progress bar
verbose
logical, whether to be verbose
detect()
Wrapper that implements the entire workflow for duplicate detection.
Duplicates$detect( x, n = 5L, character_selection = 1:12, how = "coop", verbose = TRUE, mc = FALSE, progress = TRUE )
x
A partition_bundle
or subcorpus_bundle
object.
n
The number of characters to use for shingling (integer
value),
passed as argument n
into polmineR::ngrams()
. Defaults to 5, in
line with Kliche et al. 2014: 695.
character_selection
Numeric/integer vector used for indexing
$char_count
to select the characters to keep. Defaults to 1:12, in
line with Kliche et al. 2014: 695.
how
Implementation used to compute similarities - passed into
cosine_similarity()
.
verbose
logical, whether to be verbose
mc
logical, whether to use multicore
progress
logical, whether to show progress bar
The updated content of slot $duplicates
is returned invisibly.
annotate()
Turn data.table with duplicates into file with corpus positions and annotation of duplicates, generate cwb-s-encode command and execute it, if wanted.
Duplicates$annotate(s_attribute)
s_attribute
the s-attribute providing the date
encode()
´ Add structural attributes to CWB corpus based on the annotation data that has been generated (data.table in field annotation).
Duplicates$encode( exec = FALSE, filenames = list(duplicate = tempfile(), original = tempfile()) )
exec
A logical
value, whether to execute system command.
filenames
List of filenames.
clone()
The objects of this class are cloneable with this method.
Duplicates$clone(deep = FALSE)
deep
Whether to make a deep clone.
library(polmineR) if ("NADIRASZ" %in% corpus()$corpus){ D <- Duplicates$new( corpus = "NADIRASZ", char_regex = "[a-zA-ZäöüÄÖÜ]", p_attribute = "word", s_attribute = "article_date", date_preprocessor = NULL, sample = 50L, n = 1L, threshold = 0.6 # default is 0.9 ) article_bundle <- corpus("NADIRASZ") |> subset(article_date == "2000-01-01") |> split(s_attribute = "article_id") D$detect(x = article_bundle, mc = 3L) # To inspect result D$duplicates if (interactive()){ for (i in 1L:nrow(D$duplicates)){ print(i) corpus("NADIRASZ") %>% subset(article_id == !!D$duplicates[i][["name"]]) %>% read() %>% show() readline() corpus("NADIRASZ") %>% subset(article_id == !!D$duplicates[i][["duplicate_name"]]) %>% read() %>% show() readline() } } }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.