| Duplicates | R Documentation |
Detect Duplicates
Detect Duplicates
Class for duplicate detection.
The class implements a procedure described by Fritz Kliche, Andre Blessing, Urlich Heid and Jonathan Sonntag in the paper "The eIdentity Text ExplorationWorkbench" presented at LREC 2014 (see http://www.lrec-conf.org/proceedings/lrec2014/pdf/332_Paper.pdf).
To detect duplicates, choices are made as follows:
If two similar articles have been published on the same day, the shorter article will be considered the duplicate;
if two similar articles were published on different days, the article that appeared later will be considered the duplicate.
Different partition_bundle-objects can be passed into the
detect-method successively. The field duplicates will be
appended by the duplicates that are newly detected.
corpusID of the CWB corpus (derived from partition_bundle).
char_regexRegular expression defining the characters to keep.
char_countCount of the characters in the partition_bundle.
nNumber of days before and after a document was published.
p_attributethe p-attribute used (defaults to "word")
s_attributethe s-attribute of the date of a text in the corpus
samplesize of the sample of the partition_bundle that the character count is based on
thresholdMinimum similarity value to consider two texts as duplicates.
duplicatesA data.table with documents considered as duplicates.
similaritiesa simple_triplet_matrix with similarities of texts
date_preprocessorfunction to rework dates if not in the DD-MM-YYYY standard format
annotationa data.table with corpus positions.
new()Initialize object of class Duplicates.
Duplicates$new( corpus, char_regex = "[a-zA-Z]", p_attribute = "word", s_attribute = "text_date", date_preprocessor = NULL, sample = 1000L, n = 1L, threshold = 0.9 )
corpusID of the CWB corpus that will be explored.
char_regexa regex defining the characters to keep
p_attributeThe p-attribute to evaluate.
s_attributethe s-attribute providing the date
date_preprocessorA function used to preprocess dates as extracted
from s_attribute.
samplenumber of documents to define a subset of partition_bundle to
speed up character count
nnumber of days before and after a document was published
thresholdnumeric (0 < x < 1), the minimum similarity to qualify two documents as duplicates
get_comparisons()Identify documents that will be compared (based on date of documents).
Duplicates$get_comparisons( x, reduce = TRUE, verbose = FALSE, progress = TRUE, mc = FALSE )
xa partition_bundle object defining the documents that will be
compared to detect duplicates
reduceA logical value, whether to drop one half of matrix.
verboselogical, whether to be verbose
progresslogical, whether to show progress bar
mclogical, whether to use multicore
similarities_matrix_to_dt()Turn similarities of documents into a data.table that identifies original document and duplicate.
Duplicates$similarities_matrix_to_dt( x, similarities, mc = FALSE, progress = TRUE, verbose = TRUE )
xa partition_bundle object defining the documents that will be
compared to detect duplicates
similaritiesA TermDocumentMatrix with cosine similarities.
mclogical, whether to use multicore
progresslogical, whether to show progress bar
verboselogical, whether to be verbose
detect()Wrapper that implements the entire workflow for duplicate detection.
Duplicates$detect( x, n = 5L, character_selection = 1:12, how = "coop", verbose = TRUE, mc = FALSE, progress = TRUE )
xA partition_bundle or subcorpus_bundle object.
nThe number of characters to use for shingling (integer value),
passed as argument n into polmineR::ngrams(). Defaults to 5, in
line with Kliche et al. 2014: 695.
character_selectionNumeric/integer vector used for indexing
$char_count to select the characters to keep. Defaults to 1:12, in
line with Kliche et al. 2014: 695.
howImplementation used to compute similarities - passed into
cosine_similarity().
verboselogical, whether to be verbose
mclogical, whether to use multicore
progresslogical, whether to show progress bar
The updated content of slot $duplicates is returned invisibly.
annotate()Turn data.table with duplicates into file with corpus positions and annotation of duplicates, generate cwb-s-encode command and execute it, if wanted.
Duplicates$annotate(s_attribute)
s_attributethe s-attribute providing the date
encode()´ Add structural attributes to CWB corpus based on the annotation data that has been generated (data.table in field annotation).
Duplicates$encode( exec = FALSE, filenames = list(duplicate = tempfile(), original = tempfile()) )
execA logical value, whether to execute system command.
filenamesList of filenames.
clone()The objects of this class are cloneable with this method.
Duplicates$clone(deep = FALSE)
deepWhether to make a deep clone.
library(polmineR)
if ("NADIRASZ" %in% corpus()$corpus){
D <- Duplicates$new(
corpus = "NADIRASZ",
char_regex = "[a-zA-ZäöüÄÖÜ]",
p_attribute = "word",
s_attribute = "article_date",
date_preprocessor = NULL,
sample = 50L,
n = 1L,
threshold = 0.6 # default is 0.9
)
article_bundle <- corpus("NADIRASZ") |>
subset(article_date == "2000-01-01") |>
split(s_attribute = "article_id")
D$detect(x = article_bundle, mc = 3L)
# To inspect result
D$duplicates
if (interactive()){
for (i in 1L:nrow(D$duplicates)){
print(i)
corpus("NADIRASZ") %>%
subset(article_id == !!D$duplicates[i][["name"]]) %>%
read() %>%
show()
readline()
corpus("NADIRASZ") %>%
subset(article_id == !!D$duplicates[i][["duplicate_name"]]) %>%
read() %>%
show()
readline()
}
}
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.