delete_duplicates: Delete duplicate (or similar) documents from a document term...

View source: R/deduplicate.r

delete_duplicatesR Documentation

Delete duplicate (or similar) documents from a document term matrix

Description

Delete duplicate (or similar) documents from a document term matrix. Duplicates are defined by: having high content similarity, occuring within a given time distance and being published by the same source.

Usage

delete_duplicates(
  dtm,
  date_var = NULL,
  hour_window = c(-24, 24),
  group_var = NULL,
  measure = c("cosine", "overlap_pct"),
  similarity = 1,
  keep = "first",
  tf_idf = FALSE,
  dup_csv = NULL,
  verbose = F
)

Arguments

dtm

A quanteda dfm.

date_var

The name of the column in docvars(dtm) that specifies the document date. The values should be of type POSIXlt or POSIXct

hour_window

A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours.

group_var

Optionally, column name in docvars(dtm) that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared.

measure

The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), and the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document).

similarity

A threshold for similarity. Documents of which similarity is equal or higher are deleted

keep

A character indicating whether to keep the 'first' or 'last' published of duplicate documents.

tf_idf

If TRUE, weight the dtm with tf_idf before comparing documents. The original (non-weighted) DTM is returned.

dup_csv

Optionally, a path for writing a csv file with the duplicates edgelist. For each duplicate pair it is noted if "from" or "to" is the duplicate, or if "both" are duplicates (of other documents)

verbose

If TRUE, report progress

Details

Note that this can also be used to delete "updates" of articles (e.g., on news sites, news agencies). This should be considered if the temporal order of publications is relevant for the analysis.

Value

A dtm with the duplicate documents deleted

Examples

## example with very low similarity threshold (normally not recommended!)
dtm2 = delete_duplicates(rnewsflow_dfm, similarity = 0.5, keep='first', tf_idf = TRUE)

RNewsflow documentation built on May 31, 2023, 6:53 p.m.