find_duplicates: Locate duplicated information within a data.frame
In revtools: Tools to Support Evidence Synthesis

Description Usage Arguments Value Note See Also Examples

Identify potential duplicates within a data.frame.

1
2
3

find_duplicates(data, match_variable, group_variables,
  match_function, method, threshold,
  to_lower = FALSE, remove_punctuation = FALSE)

`data`	a `data.frame` containing data to be matched
`match_variable`	a length-1 integer or string listing the column in which duplicates should be sought. Defaults to doi where available, followed by title. If neither are found the function will fail.
`group_variables`	an optional vector listing the columns to use as grouping variables; that is, categories withing which duplicates should be sought (see 'note'). Optionally NULL to compare all entries against one another.
`match_function`	a function to calculate dissimilarity between strings. Defaults to "exact" if doi's are available or "stringdist" otherwise.
`method`	the required 'method' option that corresponds with `match_function`. Defaults to NULL if `match_function` is "exact", "osa" for match_function == "stringdist", or "fuzz_m_ratio" for match_function == "fuzzdist".
`threshold`	an upper limit above which similar articles are not recognized as duplicates. Defaults to 5 for stringdist and 0.1 for fuzzdist. Ignored if `match_function` == "exact".
`to_lower`	logical: should text be made lower case prior to searching? Defaults to FALSE.
`remove_punctuation`	logical: should punctuation be removed prior to searching? Defaults to FALSE.

an integer vector, in which entries with the same integer have been selected as duplicates by the selected algorithm.

find_duplicates runs a while loop. It starts by checking the first entry of data against every other entry for potential duplicates. If any matches are found, those entries are excluded from consideration. The loop then continues until all entries have been checked. In order to work, this function requires the data and match_variable arguments be specified. The remaining arguments affects how duplicates are identified, and can also strongly influence the speed of the outcome.

The argument group_variables specifies variables that contain supplementary information that can reduce the number of entries that need to be searched. For example, you might want to only match article titles if they occur within the same journal, or in the same year. The more variables you specify, the fewer pairs of matches that have to be tested to locate duplicates, greatly increasing the speed of the algorithm. Conversely, if no variables are specified, then each entry is checked against every other entry that has yet to be excluded from the dataset. This is fine for small datasets, but massively increases computation time for large datasets.

Missing values are handled differently. Entries that are NA for match_variable are always labelled as unique values, and are not checked for duplicates against the rest of the dataset. However, entries of group_variables that are NA are included in every comparison.

find_duplicates contains three 'built-in' methods for string matching. "stringdist" calls the function of the same name from the package stringdist; ditto for "fuzzdist" which is in revtools, but based on the Python library fuzzywuzzy. "exact" simply searches for exact matches. In principle you could call any function for string matching, so long as it accepts the arguments a, b and method (see documentation on stringdist for details), and returns a measure of distance (i.e. not similarity).

Finally, to_lower and remove_punctuation specify whether to transform the text prior to searching for duplicates.

screen_duplicates and extract_unique_references for manual and automated screening (respectively) of results from this function.

# import data
file_location <- system.file(
  "extdata",
  "avian_ecology_bibliography.ris",
  package = "revtools")
x <- read_bibliography(file_location)

# generate then locate some 'fake' duplicates
x_duplicated <- rbind(x, x[1:5,])
x_check <- find_duplicates(x_duplicated)
# returns a vector of potential matches

sh: 1: wc: Permission denied
Could not detect number of cores, defaulting to 1.

revtools documentation built on Jan. 8, 2020, 5:10 p.m.

revtools index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

revtools
Tools to Support Evidence Synthesis

find_duplicates: Locate duplicated information within a data.frame
In revtools: Tools to Support Evidence Synthesis

Description

Usage

Arguments

Value

Note

See Also

Examples

Example output

Related to find_duplicates in revtools...

R Package Documentation

Browse R Packages

We want your feedback!

revtools Tools to Support Evidence Synthesis

find_duplicates: Locate duplicated information within a data.frame In revtools: Tools to Support Evidence Synthesis

Description

Usage

Arguments

Value

Note

See Also

Examples

Example output

Related to find_duplicates in revtools...

R Package Documentation

Browse R Packages

We want your feedback!

revtools
Tools to Support Evidence Synthesis

find_duplicates: Locate duplicated information within a data.frame
In revtools: Tools to Support Evidence Synthesis