Duplicates: Detect Duplicates
In PolMine/polmineR.misc: polmineR.misc

Duplicates

R Documentation

Detect Duplicates

Description

Detect Duplicates

Details

Class for duplicate detection.

The class implements a procedure described by Fritz Kliche, Andre Blessing, Urlich Heid and Jonathan Sonntag in the paper "The eIdentity Text ExplorationWorkbench" presented at LREC 2014 (see http://www.lrec-conf.org/proceedings/lrec2014/pdf/332_Paper.pdf).

To detect duplicates, choices are made as follows:

If two similar articles have been published on the same day, the shorter article will be considered the duplicate;
if two similar articles were published on different days, the article that appeared later will be considered the duplicate.

Different partition_bundle-objects can be passed into the detect-method successively. The field duplicates will be appended by the duplicates that are newly detected.

Public fields

corpus: ID of the CWB corpus (derived from partition_bundle).
char_regex: Regular expression defining the characters to keep.
char_count: Count of the characters in the partition_bundle.
n: Number of days before and after a document was published.
p_attribute: the p-attribute used (defaults to "word")
s_attribute: the s-attribute of the date of a text in the corpus
sample: size of the sample of the partition_bundle that the character count is based on
threshold: Minimum similarity value to consider two texts as duplicates.
duplicates: A data.table with documents considered as duplicates.
similarities: a simple_triplet_matrix with similarities of texts
date_preprocessor: function to rework dates if not in the DD-MM-YYYY standard format
annotation: a data.table with corpus positions.

Methods

Method `new()`

Initialize object of class Duplicates.

Usage

Duplicates$new(
  corpus,
  char_regex = "[a-zA-Z]",
  p_attribute = "word",
  s_attribute = "text_date",
  date_preprocessor = NULL,
  sample = 1000L,
  n = 1L,
  threshold = 0.9
)

Arguments

corpus: ID of the CWB corpus that will be explored.
char_regex: a regex defining the characters to keep
p_attribute: The p-attribute to evaluate.
s_attribute: the s-attribute providing the date
date_preprocessor: A function used to preprocess dates as extracted from s_attribute.
sample: number of documents to define a subset of partition_bundle to speed up character count
n: number of days before and after a document was published
threshold: numeric (0 < x < 1), the minimum similarity to qualify two documents as duplicates

Method `get_comparisons()`

Identify documents that will be compared (based on date of documents).

Usage

Duplicates$get_comparisons(
  x,
  reduce = TRUE,
  verbose = FALSE,
  progress = TRUE,
  mc = FALSE
)

Arguments

x: a partition_bundle object defining the documents that will be compared to detect duplicates
reduce: A logical value, whether to drop one half of matrix.
verbose: logical, whether to be verbose
progress: logical, whether to show progress bar
mc: logical, whether to use multicore

Method `similarities_matrix_to_dt()`

Turn similarities of documents into a data.table that identifies original document and duplicate.

Usage

Duplicates$similarities_matrix_to_dt(
  x,
  similarities,
  mc = FALSE,
  progress = TRUE,
  verbose = TRUE
)

Arguments

x: a partition_bundle object defining the documents that will be compared to detect duplicates
similarities: A TermDocumentMatrix with cosine similarities.
mc: logical, whether to use multicore
progress: logical, whether to show progress bar
verbose: logical, whether to be verbose

Method `detect()`

Wrapper that implements the entire workflow for duplicate detection.

Usage

Duplicates$detect(
  x,
  n = 5L,
  character_selection = 1:12,
  how = "coop",
  verbose = TRUE,
  mc = FALSE,
  progress = TRUE
)

Arguments

x: A partition_bundle or subcorpus_bundle object.
n: The number of characters to use for shingling (integer value), passed as argument n into polmineR::ngrams(). Defaults to 5, in line with Kliche et al. 2014: 695.
character_selection: Numeric/integer vector used for indexing $char_count to select the characters to keep. Defaults to 1:12, in line with Kliche et al. 2014: 695.
how: Implementation used to compute similarities - passed into cosine_similarity().
verbose: logical, whether to be verbose
mc: logical, whether to use multicore
progress: logical, whether to show progress bar

Returns

The updated content of slot $duplicates is returned invisibly.

Method `annotate()`

Turn data.table with duplicates into file with corpus positions and annotation of duplicates, generate cwb-s-encode command and execute it, if wanted.

Usage

Duplicates$annotate(s_attribute)

Arguments

s_attribute: the s-attribute providing the date

Method `encode()`

´ Add structural attributes to CWB corpus based on the annotation data that has been generated (data.table in field annotation).

Usage

Duplicates$encode(
  exec = FALSE,
  filenames = list(duplicate = tempfile(), original = tempfile())
)

Arguments

exec: A logical value, whether to execute system command.
filenames: List of filenames.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

Duplicates$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples

library(polmineR)

if ("NADIRASZ" %in% corpus()$corpus){
  D <- Duplicates$new(
    corpus = "NADIRASZ",
    char_regex = "[a-zA-ZäöüÄÖÜ]",
    p_attribute = "word",
    s_attribute = "article_date",
    date_preprocessor = NULL,
    sample = 50L,
    n = 1L,
    threshold = 0.6 # default is 0.9
  )

  article_bundle <- corpus("NADIRASZ") |>
    subset(article_date == "2000-01-01") |> 
    split(s_attribute = "article_id")

  D$detect(x = article_bundle, mc = 3L)
  
  # To inspect result
  D$duplicates
  
  if (interactive()){
    for (i in 1L:nrow(D$duplicates)){
    
      print(i)
      
      corpus("NADIRASZ") %>%
        subset(article_id == !!D$duplicates[i][["name"]]) %>%
        read() %>%
        show()
        
      readline()
  
      corpus("NADIRASZ") %>%
        subset(article_id == !!D$duplicates[i][["duplicate_name"]]) %>%
        read() %>%
        show()
        
      readline()
    }
  }
}

PolMine/polmineR.misc documentation built on Nov. 23, 2022, 9:01 p.m.

PolMine/polmineR.misc index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PolMine/polmineR.misc
polmineR.misc

Duplicates: Detect Duplicates
In PolMine/polmineR.misc: polmineR.misc

Detect Duplicates

Description

Details

Public fields

Methods

Public methods

Method `new()`

Usage

Arguments

Method `get_comparisons()`

Usage

Arguments

Method `similarities_matrix_to_dt()`

Usage

Arguments

Method `detect()`

Usage

Arguments

Returns

Method `annotate()`

Usage

Arguments

Method `encode()`

Usage

Arguments

Method `clone()`

Usage

Arguments

Examples

Related to Duplicates in PolMine/polmineR.misc...

R Package Documentation

Browse R Packages

We want your feedback!

PolMine/polmineR.misc polmineR.misc

Duplicates: Detect Duplicates In PolMine/polmineR.misc: polmineR.misc

Detect Duplicates

Description

Details

Public fields

Methods

Public methods

Method new()

Usage

Arguments

Method get_comparisons()

Usage

Arguments

Method similarities_matrix_to_dt()

Usage

Arguments

Method detect()

Usage

Arguments

Returns

Method annotate()

Usage

Arguments

Method encode()

Usage

Arguments

Method clone()

Usage

Arguments

Examples

Related to Duplicates in PolMine/polmineR.misc...

R Package Documentation

Browse R Packages

We want your feedback!

PolMine/polmineR.misc
polmineR.misc

Duplicates: Detect Duplicates
In PolMine/polmineR.misc: polmineR.misc

Method `new()`

Method `get_comparisons()`

Method `similarities_matrix_to_dt()`

Method `detect()`

Method `annotate()`

Method `encode()`

Method `clone()`