find_duplicates: Locate duplicated information within a data.frame

Description Usage Arguments Value Note See Also Examples

View source: R/find_duplicates.R

Description

Identify potential duplicates within a data.frame.

Usage

1
2
3
find_duplicates(data, match_variable, group_variables,
  match_function, method, threshold,
  to_lower = FALSE, remove_punctuation = FALSE)

Arguments

data

a data.frame containing data to be matched

match_variable

a length-1 integer or string listing the column in which duplicates should be sought. Defaults to doi where available, followed by title. If neither are found the function will fail.

group_variables

an optional vector listing the columns to use as grouping variables; that is, categories withing which duplicates should be sought (see 'note'). Optionally NULL to compare all entries against one another.

match_function

a function to calculate dissimilarity between strings. Defaults to "exact" if doi's are available or "stringdist" otherwise.

method

the required 'method' option that corresponds with match_function. Defaults to NULL if match_function is "exact", "osa" for match_function == "stringdist", or "fuzz_m_ratio" for match_function == "fuzzdist".

threshold

an upper limit above which similar articles are not recognized as duplicates. Defaults to 5 for stringdist and 0.1 for fuzzdist. Ignored if match_function == "exact".

to_lower

logical: should text be made lower case prior to searching? Defaults to FALSE.

remove_punctuation

logical: should punctuation be removed prior to searching? Defaults to FALSE.

Value

an integer vector, in which entries with the same integer have been selected as duplicates by the selected algorithm.

Note

find_duplicates runs a while loop. It starts by checking the first entry of data against every other entry for potential duplicates. If any matches are found, those entries are excluded from consideration. The loop then continues until all entries have been checked. In order to work, this function requires the data and match_variable arguments be specified. The remaining arguments affects how duplicates are identified, and can also strongly influence the speed of the outcome.

The argument group_variables specifies variables that contain supplementary information that can reduce the number of entries that need to be searched. For example, you might want to only match article titles if they occur within the same journal, or in the same year. The more variables you specify, the fewer pairs of matches that have to be tested to locate duplicates, greatly increasing the speed of the algorithm. Conversely, if no variables are specified, then each entry is checked against every other entry that has yet to be excluded from the dataset. This is fine for small datasets, but massively increases computation time for large datasets.

Missing values are handled differently. Entries that are NA for match_variable are always labelled as unique values, and are not checked for duplicates against the rest of the dataset. However, entries of group_variables that are NA are included in every comparison.

find_duplicates contains three 'built-in' methods for string matching. "stringdist" calls the function of the same name from the package stringdist; ditto for "fuzzdist" which is in revtools, but based on the Python library fuzzywuzzy. "exact" simply searches for exact matches. In principle you could call any function for string matching, so long as it accepts the arguments a, b and method (see documentation on stringdist for details), and returns a measure of distance (i.e. not similarity).

Finally, to_lower and remove_punctuation specify whether to transform the text prior to searching for duplicates.

See Also

screen_duplicates and extract_unique_references for manual and automated screening (respectively) of results from this function.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# import data
file_location <- system.file(
  "extdata",
  "avian_ecology_bibliography.ris",
  package = "revtools")
x <- read_bibliography(file_location)

# generate then locate some 'fake' duplicates
x_duplicated <- rbind(x, x[1:5,])
x_check <- find_duplicates(x_duplicated)
# returns a vector of potential matches

Example output

sh: 1: wc: Permission denied
Could not detect number of cores, defaulting to 1.

revtools documentation built on Jan. 8, 2020, 5:10 p.m.