View source: R/deduplication_functions.R
find_duplicates | R Documentation |
Identifies duplicate bibliographic entries using different duplicate detection methods.
find_duplicates(
data,
method = "exact",
group_by,
threshold,
to_lower = FALSE,
rm_punctuation = FALSE
)
data |
A character vector containing duplicate bibliographic entries. |
method |
A string indicating how matching should be calculated. Either
|
group_by |
An optional vector, data.frame or list containing data to use
as 'grouping' variables; that is, categories within which duplicates should
be sought. Defaults to NULL, in which case all entries are compared against
all others. Ignored if |
threshold |
Numeric: the cutoff threshold for deciding if two strings
are duplicates. Sensible values depend on the |
to_lower |
Logical: Should all entries be converted to lower case before
calculating string distance? Defaults to |
rm_punctuation |
Logical: Should punctuation should be removed before
calculating string distance? Defaults to |
Returns a vector of duplicate matches, with attributes
listing
methods used.
string_
or fuzz_
for suitable functions
to pass to methods
; extract_unique_references
and
deduplicate
for higher-level functions.
my_df <- data.frame(
title = c(
"EviAtlas: a tool for visualising evidence synthesis databases",
"revtools: An R package to support article screening for evidence synthesis",
"An automated approach to identifying search terms for systematic reviews",
"Reproducible, flexible and high-throughput data extraction from primary literature",
"eviatlas:tool for visualizing evidence synthesis databases.",
"REVTOOLS a package to support article-screening for evidence synthsis"
),
year = c("2019", "2019", "2019", "2019", NA, NA),
authors = c("Haddaway et al", "Westgate",
"Grames et al", "Pick et al", NA, NA),
stringsAsFactors = FALSE
)
# run deduplication
dups <- find_duplicates(
my_df$title,
method = "string_osa",
rm_punctuation = TRUE,
to_lower = TRUE
)
extract_unique_references(my_df, matches = dups)
# or, in one line:
deduplicate(my_df, "title",
method = "string_osa",
rm_punctuation = TRUE,
to_lower = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.