orm_dedup: Automatic deduplication of bibliographic records

View source: R/orm_dedup.R

orm_dedupR Documentation

Automatic deduplication of bibliographic records

Description

orm_dedup() removes duplicate records using a three-step progressive pipeline:

  1. Exact DOI match — most reliable signal; decisive for records with DOIs.

  2. Normalised title match — removes punctuation, accents, case, and extra spaces before comparing; catches the same article listed with minor typographic differences across databases.

  3. Fuzzy match — compares title + year + first author using Optimal String Alignment distance; catches near-identical records that escape exact matching (e.g. different journal abbreviations, truncated author lists).

Only records that remain ambiguous after all three steps are flagged for optional manual review. These are saved to dedup_log.csv.

Usage

orm_dedup(
  refs,
  fuzzy_threshold = 0.9,
  lang = getOption("orisma.lang", "en"),
  verbose = getOption("orisma.verbose", TRUE),
  save_log = TRUE
)

Arguments

refs

An orisma_refs object returned by orm_load().

fuzzy_threshold

Numeric (0–1). Similarity threshold for fuzzy matching. Default 0.90 (90% similarity = duplicate). Increase for stricter matching, decrease for more aggressive deduplication.

lang

Character. "en" or "es". Overrides orisma.lang option.

verbose

Logical. Print progress? Default TRUE.

save_log

Logical. Save dedup_log.csv to working directory? Default TRUE.

Value

An orisma_refs tibble with duplicates removed. Attributes record deduplication statistics for inclusion in the PRISMA log.

Examples

## Not run: 
refs    <- orm_load("my_references/")
deduped <- orm_dedup(refs)

# More aggressive fuzzy matching
deduped <- orm_dedup(refs, fuzzy_threshold = 0.85)

# Spanish messages, no log file
deduped <- orm_dedup(refs, lang = "es", save_log = FALSE)

## End(Not run)


orisma documentation built on May 19, 2026, 1:07 a.m.