normalize_citations: Normalize and match bibliographic citations

View source: R/apply_citation_matching.R

normalize_citationsR Documentation

Normalize and match bibliographic citations

Description

This function performs advanced normalization and fuzzy matching of bibliographic citations to identify and group citations that refer to the same work but are formatted differently. It uses a multi-phase approach combining string normalization, blocking strategies, hierarchical clustering, and post-processing to achieve both speed and accuracy on large citation datasets.

Usage

normalize_citations(CR_vector, threshold = 0.9, method = "jw", min_chars = 20)

Arguments

CR_vector

Character vector containing bibliographic citations to be normalized and matched.

threshold

Numeric value between 0 and 1 indicating the similarity threshold for matching citations. Higher values (e.g., 0.90-0.95) produce more conservative matching, while lower values (e.g., 0.75-0.80) produce more aggressive matching. Default is 0.85, which provides a good balance between precision and recall.

method

String distance method to use for fuzzy matching. Options include:

  • "jw" (default): Jaro-Winkler distance, optimized for bibliographic strings

  • "lv": Levenshtein distance

  • Other methods supported by stringdistmatrix

min_chars

Minimum characters for valid citations (default: 20)

Details

The function implements a five-phase matching algorithm:

Phase 1: Normalization and Feature Extraction

  • Converts text to uppercase

  • Removes issue numbers and page numbers (which often contain typos)

  • Removes punctuation and normalizes whitespace

  • Expands common journal abbreviations (e.g., "J. CLEAN. PROD." -> "JOURNAL OF CLEANER PRODUCTION")

  • Extracts key features: first author, year, journal, volume, pages

Phase 1.5: Journal Normalization The function uses the LTWA (List of Title Word Abbreviations) database from ISO 4 standards to normalize journal names. This ensures that abbreviated forms (e.g., "J. Clean. Prod.") and full forms (e.g., "Journal of Cleaner Production") are recognized as the same journal and matched together.

The LTWA database is included in the bibliometrix package. If not found, the function attempts to download it from ISSN.org. Journal normalization can be disabled by ensuring the LTWA database is not available.

Phase 2: Blocking Citations are grouped into blocks by first author and year. This dramatically reduces computational complexity from O(n^2) to approximately O(k*m^2), where k is the number of blocks and m is the average block size.

Phase 3: Within-Block Matching Within each block, citations are compared using string distance metrics and hierarchical clustering. For blocks larger than 500 citations, exact matching on normalized strings is used instead to maintain performance.

Phase 4: Canonical Representative Selection For each cluster, the most complete citation (prioritizing those with volume and page information) is selected as the canonical representative.

Phase 5: Post-Processing Citations sharing the same first author, year, journal, and volume are merged into a single cluster, even if they weren't matched in Phase 3. This catches cases where minor title variations prevented matching.

Value

A data frame with the following columns:

  • CR_original: Original citation string

  • CR_canonical: Canonical (representative) citation for the cluster

  • cluster_id: Unique identifier for each citation cluster

  • n_cluster: Number of citations in the cluster

  • first_author: First author surname

  • year: Publication year

  • journal_iso4: Journal name normalized to ISO4 abbreviated form

  • journal_original: Original journal name as extracted from citation

  • volume: Volume number

  • doi: Digital Object Identifier (when available)

  • blocking_key: Internal key used for blocking (author_year_journal)

References

Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959-975.

See Also

applyCitationMatching for direct application to bibliometrix data frames

Examples

## Not run: 
# Load bibliometrix data
data(scientometrics, package = "bibliometrixData")

# Extract and normalize citations
CR_vector <- unlist(strsplit(scientometrics$CR, ";"))
CR_vector <- trimws(CR_vector)

# Perform normalization with default threshold
matched <- normalize_citations(CR_vector)

# View matching statistics
table(matched$n_cluster)

# Find all variants of a specific citation
subset(matched, cluster_id == matched$cluster_id[1])

# Use more conservative matching
matched_conservative <- normalize_citations(CR_vector, threshold = 0.90)

## End(Not run)


bibliometrix documentation built on Nov. 8, 2025, 5:06 p.m.