find_potential_dups: Identify potential duplicates based on title and year

View source: R/find_potential_dups.R

find_potential_dupsR Documentation

Identify potential duplicates based on title and year

Description

Identify potential duplicates based on title and year

Usage

find_potential_dups(
  CitDat,
  minSimilarity = 0.6,
  potDupAfterObvDup = TRUE,
  maxNumberOfComp = 1e+06,
  quiet = FALSE
)

Arguments

CitDat

A dataframe/tibble returned by find_obvious_dups or handle_obvious_dups.

minSimilarity

Minimum similarity (between 0 and 1). Default is 0.6. (TO DO)

potDupAfterObvDup

If TRUE (default), the newly created column pot_dup_id is moved right next to the obv_dup_id column.

maxNumberOfComp

Maximum number of clean_title similarity calculations to be made. It is set to 1,000,000 by default (which corresponds to ~ 1414 clean_titles). TO DO: Document while-loop.

quiet

If TRUE, all output will be suppressed.

Details

[Maturing]
Currently this only works for files that were generated while Citavi was set to "English" so that column names are "Short Title" etc.

Value

A tibble containing one new column: pot_dup_id.

Examples

example_path <- example_file("3dupsin5refs/3dupsin5refs.ctv6")
CitDat <- read_Citavi_ctv6(example_path) %>%
   find_obvious_dups() %>%
   find_potential_dups()

CitDat %>%
   dplyr::select(clean_title_id, obv_dup_id, pot_dup_id)

# check similarity yourself - it's a single typo:
CitDat %>%
   dplyr::select(clean_title)


SchmidtPaul/CitaviR documentation built on Jan. 31, 2023, 5 a.m.