remove_similar: Removes similar documents based on text similarity

Description Usage Arguments Value

View source: R/deduplication_functions.R

Description

Removes documents from a data frame that are highly similar to other documents in the same data frame.

Usage

1
remove_similar(data, distance_data, id_column, distance_column, cutoff)

Arguments

data

the data frame containing all documents

distance_data

a data frame with document identification and distance information

id_column

the name or index of the column in the distance dataset that contains document IDs

distance_column

the name or index of the column in the distance dataset that contains distance scores

cutoff

the maximum distance at which documents should be considered duplicates

Value

the documents data frame with duplicate documents removed


elizagrames/synthesisr documentation built on May 26, 2019, 10:34 a.m.