Description Usage Arguments Value Note See Also Examples
View source: R/find_duplicates.R
Identify potential duplicates within a data.frame
.
1 2 3 | find_duplicates(data, match_variable, group_variables,
match_function, method, threshold,
to_lower = FALSE, remove_punctuation = FALSE)
|
data |
a |
match_variable |
a length-1 integer or string listing the column in which duplicates should be sought. Defaults to doi where available, followed by title. If neither are found the function will fail. |
group_variables |
an optional vector listing the columns to use as grouping variables; that is, categories withing which duplicates should be sought (see 'note'). Optionally NULL to compare all entries against one another. |
match_function |
a function to calculate dissimilarity between strings. Defaults to "exact" if doi's are available or "stringdist" otherwise. |
method |
the required 'method' option that corresponds with |
threshold |
an upper limit above which similar articles are not recognized as duplicates. Defaults to 5 for stringdist and 0.1 for fuzzdist. Ignored if |
to_lower |
logical: should text be made lower case prior to searching? Defaults to FALSE. |
remove_punctuation |
logical: should punctuation be removed prior to searching? Defaults to FALSE. |
an integer vector, in which entries with the same integer have been selected as duplicates by the selected algorithm.
find_duplicates
runs a while
loop. It starts by checking the first entry of data
against every other entry for potential duplicates. If any matches are found, those entries are excluded from consideration. The loop then continues until all entries have been checked. In order to work, this function requires the data
and match_variable
arguments be specified. The remaining arguments affects how duplicates are identified, and can also strongly influence the speed of the outcome.
The argument group_variables
specifies variables that contain supplementary information that can reduce the number of entries that need to be searched. For example, you might want to only match article titles if they occur within the same journal, or in the same year. The more variables you specify, the fewer pairs of matches that have to be tested to locate duplicates, greatly increasing the speed of the algorithm. Conversely, if no variables are specified, then each entry is checked against every other entry that has yet to be excluded from the dataset. This is fine for small datasets, but massively increases computation time for large datasets.
Missing values are handled differently. Entries that are NA
for match_variable
are always labelled as unique values, and are not checked for duplicates against the rest of the dataset. However, entries of group_variables
that are NA
are included in every comparison.
find_duplicates
contains three 'built-in' methods for string matching. "stringdist" calls the function of the same name from the package stringdist
; ditto for "fuzzdist" which is in revtools
, but based on the Python library fuzzywuzzy
. "exact" simply searches for exact matches. In principle you could call any function for string matching, so long as it accepts the arguments a
, b
and method
(see documentation on stringdist for details), and returns a measure of distance (i.e. not similarity).
Finally, to_lower
and remove_punctuation
specify whether to transform the text prior to searching for duplicates.
screen_duplicates
and extract_unique_references
for manual and automated screening (respectively) of results from this function.
1 2 3 4 5 6 7 8 9 10 11 | # import data
file_location <- system.file(
"extdata",
"avian_ecology_bibliography.ris",
package = "revtools")
x <- read_bibliography(file_location)
# generate then locate some 'fake' duplicates
x_duplicated <- rbind(x, x[1:5,])
x_check <- find_duplicates(x_duplicated)
# returns a vector of potential matches
|
sh: 1: wc: Permission denied
Could not detect number of cores, defaulting to 1.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.