dupes_find: Identify duplicate entries

Description Usage Arguments Value Examples

Description

For a tabular set of publication records, identifies potential sets of duplicate entries and labels them with a unique identifier.

Usage

1
2
dupes_find(x, match_cols, approx_match = FALSE, string_dist = 5,
  min_length = 10, simplify_match = TRUE)

Arguments

x

The dataset in which duplicate entries will be identified

match_cols

Column(s) that will be used to search for duplicate records

approx_match

Whether to perform a duplicate search using string distances or exact values

string_dist

When using approximate matching, the string distance cutoff at which records will be assumed duplicated

min_length

The minimum length for the combined matching string produced by match_cols at which a record will be considered for matching

simplify_match

Whether to perform duplicate searches after removing all non alpha-numeric characters from the reference string generated from match_cols

Value

An updated version of x, with one column specifying the final string used to search for duplicates (matching_col) and another column containing unique identifiers for each set of duplicates (match_ID).

Examples

1
2
3
4
5
6
7
## Not run: 
test <- rbind(form_mm_recs, form_mm_recs)
test <- dupes_find(test, c(1, 3))
dupes <- dupes_return(test)
out <- dupes_rm(test)

## End(Not run)

graggsd/sysreviewR documentation built on May 16, 2019, 2:52 a.m.