find_potential_dups: Identify potential duplicates based on title and year
In SchmidtPaul/CitaviR: A set of tools for dealing with Citavi data

find_potential_dups

R Documentation

Identify potential duplicates based on title and year

Description

Identify potential duplicates based on title and year

Usage

find_potential_dups(
  CitDat,
  minSimilarity = 0.6,
  potDupAfterObvDup = TRUE,
  maxNumberOfComp = 1e+06,
  quiet = FALSE
)

Arguments

`CitDat`	A dataframe/tibble returned by `find_obvious_dups` or `handle_obvious_dups`.
`minSimilarity`	Minimum similarity (between 0 and 1). Default is 0.6. (TO DO)
`potDupAfterObvDup`	If TRUE (default), the newly created column `pot_dup_id` is moved right next to the `obv_dup_id` column.
`maxNumberOfComp`	Maximum number of clean_title similarity calculations to be made. It is set to 1,000,000 by default (which corresponds to ~ 1414 clean_titles). TO DO: Document while-loop.
`quiet`	If `TRUE`, all output will be suppressed.

Details

Currently this only works for files that were generated while Citavi was set to "English" so that column names are "Short Title" etc.

Value

A tibble containing one new column: pot_dup_id.

Examples

example_path <- example_file("3dupsin5refs/3dupsin5refs.ctv6")
CitDat <- read_Citavi_ctv6(example_path) %>%
   find_obvious_dups() %>%
   find_potential_dups()

CitDat %>%
   dplyr::select(clean_title_id, obv_dup_id, pot_dup_id)

# check similarity yourself - it's a single typo:
CitDat %>%
   dplyr::select(clean_title)

SchmidtPaul/CitaviR documentation built on Jan. 31, 2023, 5 a.m.