Duplicates: Detect Duplicates

DuplicatesR Documentation

Detect Duplicates

Description

Detect Duplicates

Detect Duplicates

Details

Class for duplicate detection.

The class implements a procedure described by Fritz Kliche, Andre Blessing, Urlich Heid and Jonathan Sonntag in the paper "The eIdentity Text ExplorationWorkbench" presented at LREC 2014 (see http://www.lrec-conf.org/proceedings/lrec2014/pdf/332_Paper.pdf).

To detect duplicates, choices are made as follows:

  • If two similar articles have been published on the same day, the shorter article will be considered the duplicate;

  • if two similar articles were published on different days, the article that appeared later will be considered the duplicate.

Different partition_bundle-objects can be passed into the detect-method successively. The field duplicates will be appended by the duplicates that are newly detected.

Public fields

corpus

ID of the CWB corpus (derived from partition_bundle).

char_regex

Regular expression defining the characters to keep.

char_count

Count of the characters in the partition_bundle.

n

Number of days before and after a document was published.

p_attribute

the p-attribute used (defaults to "word")

s_attribute

the s-attribute of the date of a text in the corpus

sample

size of the sample of the partition_bundle that the character count is based on

threshold

Minimum similarity value to consider two texts as duplicates.

duplicates

A data.table with documents considered as duplicates.

similarities

a simple_triplet_matrix with similarities of texts

date_preprocessor

function to rework dates if not in the DD-MM-YYYY standard format

annotation

a data.table with corpus positions.

Methods

Public methods


Method new()

Initialize object of class Duplicates.

Usage
Duplicates$new(
  corpus,
  char_regex = "[a-zA-Z]",
  p_attribute = "word",
  s_attribute = "text_date",
  date_preprocessor = NULL,
  sample = 1000L,
  n = 1L,
  threshold = 0.9
)
Arguments
corpus

ID of the CWB corpus that will be explored.

char_regex

a regex defining the characters to keep

p_attribute

The p-attribute to evaluate.

s_attribute

the s-attribute providing the date

date_preprocessor

A function used to preprocess dates as extracted from s_attribute.

sample

number of documents to define a subset of partition_bundle to speed up character count

n

number of days before and after a document was published

threshold

numeric (0 < x < 1), the minimum similarity to qualify two documents as duplicates


Method get_comparisons()

Identify documents that will be compared (based on date of documents).

Usage
Duplicates$get_comparisons(
  x,
  reduce = TRUE,
  verbose = FALSE,
  progress = TRUE,
  mc = FALSE
)
Arguments
x

a partition_bundle object defining the documents that will be compared to detect duplicates

reduce

A logical value, whether to drop one half of matrix.

verbose

logical, whether to be verbose

progress

logical, whether to show progress bar

mc

logical, whether to use multicore


Method similarities_matrix_to_dt()

Turn similarities of documents into a data.table that identifies original document and duplicate.

Usage
Duplicates$similarities_matrix_to_dt(
  x,
  similarities,
  mc = FALSE,
  progress = TRUE,
  verbose = TRUE
)
Arguments
x

a partition_bundle object defining the documents that will be compared to detect duplicates

similarities

A TermDocumentMatrix with cosine similarities.

mc

logical, whether to use multicore

progress

logical, whether to show progress bar

verbose

logical, whether to be verbose


Method detect()

Wrapper that implements the entire workflow for duplicate detection.

Usage
Duplicates$detect(
  x,
  n = 5L,
  character_selection = 1:12,
  how = "coop",
  verbose = TRUE,
  mc = FALSE,
  progress = TRUE
)
Arguments
x

A partition_bundle or subcorpus_bundle object.

n

The number of characters to use for shingling (integer value), passed as argument n into polmineR::ngrams(). Defaults to 5, in line with Kliche et al. 2014: 695.

character_selection

Numeric/integer vector used for indexing $char_count to select the characters to keep. Defaults to 1:12, in line with Kliche et al. 2014: 695.

how

Implementation used to compute similarities - passed into cosine_similarity().

verbose

logical, whether to be verbose

mc

logical, whether to use multicore

progress

logical, whether to show progress bar

Returns

The updated content of slot $duplicates is returned invisibly.


Method annotate()

Turn data.table with duplicates into file with corpus positions and annotation of duplicates, generate cwb-s-encode command and execute it, if wanted.

Usage
Duplicates$annotate(s_attribute)
Arguments
s_attribute

the s-attribute providing the date


Method encode()

´ Add structural attributes to CWB corpus based on the annotation data that has been generated (data.table in field annotation).

Usage
Duplicates$encode(
  exec = FALSE,
  filenames = list(duplicate = tempfile(), original = tempfile())
)
Arguments
exec

A logical value, whether to execute system command.

filenames

List of filenames.


Method clone()

The objects of this class are cloneable with this method.

Usage
Duplicates$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

library(polmineR)

if ("NADIRASZ" %in% corpus()$corpus){
  D <- Duplicates$new(
    corpus = "NADIRASZ",
    char_regex = "[a-zA-ZäöüÄÖÜ]",
    p_attribute = "word",
    s_attribute = "article_date",
    date_preprocessor = NULL,
    sample = 50L,
    n = 1L,
    threshold = 0.6 # default is 0.9
  )

  article_bundle <- corpus("NADIRASZ") |>
    subset(article_date == "2000-01-01") |> 
    split(s_attribute = "article_id")

  D$detect(x = article_bundle, mc = 3L)
  
  # To inspect result
  D$duplicates
  
  if (interactive()){
    for (i in 1L:nrow(D$duplicates)){
    
      print(i)
      
      corpus("NADIRASZ") %>%
        subset(article_id == !!D$duplicates[i][["name"]]) %>%
        read() %>%
        show()
        
      readline()
  
      corpus("NADIRASZ") %>%
        subset(article_id == !!D$duplicates[i][["duplicate_name"]]) %>%
        read() %>%
        show()
        
      readline()
    }
  }
}

PolMine/polmineR.misc documentation built on Nov. 23, 2022, 9:01 p.m.