tCorpus-cash-deduplicate: Deduplicate documents
In corpustools: Managing, Querying and Analyzing Tokenized Text

tCorpus$deduplicate

R Documentation

Deduplicate documents

Description

Deduplicate documents based on similarity scores. Can be used to filter out identical documents, but also similar documents.

Note that deduplication occurs by reference (tCorpus_modify_by_reference) unless copy is set to TRUE.

Usage:

## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).

deduplicate(feature='token', date_col=NULL, meta_cols=NULL, hour_window=NULL, min_docfreq=2, max_docfreq_pct=0.5, measure=c('cosine','overlap_pct'), similarity=1, keep=c('first','last', 'random'), weight=c('norm_tfidf', 'tfidf', 'termfreq','docfreq'), ngrams=NA, print_duplicates=F, copy=F)

Arguments

`feature`	the column name of the feature that is to be used for the comparison.
`date_col`	The column name for a column with a date vector (in POSIXct). If given together with hour_window, only documents within the given hour_window will be compared.
`meta_cols`	a vector with names for columns in the meta data. If given, documents are only considered duplicates if the values of these columns are identical (in addition to having a high similarity score)
`hour_window`	A vector of length 1 or 2. If length is 1, the same value is used for the left and right side of the window. If length is 2, the first and second value determine the left and right side. For example, the value 12 will compare each document to all documents between the previous and next 12 hours, and c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours.
`min_docfreq`	a minimum document frequency for features. This is mostly to lighten computational load. Default is 2, because terms that occur once cannot overlap across documents
`max_docfreq_pct`	a maximum document frequency percentage for features. High frequency terms contain little information for identifying duplicates. Default is 0.5 (i.e. terms that occur in more than 50 percent of documents are ignored),
`lowercase`	If True, make feature lowercase
`measure`	the similarity measure. Currently supports cosine similarity (symmetric) and overlap_pct (asymmetric)
`similarity`	the similarity threshold used to determine whether two documents are duplicates. Default is 1, meaning 100 percent identical.
`keep`	select either 'first', 'last' or 'random'. Determines which document of duplicates to delete. If a date is given, 'first' and 'last' specify whether the earliest or latest document is kept.
`weight`	a weighting scheme for the document-term matrix. Default is term-frequency inverse document frequency with normalized rows (document length).
`ngrams`	an integer. If given, ngrams of this length are used
`print_deduplicates`	if TRUE, print ids of duplicates that are deleted
`verbose`	if TRUE, report progress
`copy`	If TRUE, the method returns a new tCorpus object instead of deduplicating the current one by reference.

Examples

d = data.frame(text = c('a b c d e',
                        'e f g h i j k',
                        'a b c'),
               date = as.POSIXct(c('2010-01-01','2010-01-01','2012-01-01')))
tc = create_tcorpus(d)

tc$meta
dedup = tc$deduplicate(feature='token', date_col = 'date', similarity = 0.8, copy=TRUE)
dedup$meta

dedup = tc$deduplicate(feature='token', date_col = 'date', similarity = 0.8, keep = 'last',
                       copy=TRUE)
dedup$meta

corpustools documentation built on May 31, 2023, 8:45 p.m.