applyCitationMatching: Apply citation normalization to bibliometrix data frame

View source: R/apply_citation_matching.R

applyCitationMatchingR Documentation

Apply citation normalization to bibliometrix data frame

Description

This is a convenience wrapper function that applies normalize_citations to a bibliometrix data frame (typically loaded with convert2df). It extracts citations from the CR field, performs normalization and matching, and returns comprehensive results including per-paper citation lists and summary statistics.

Usage

applyCitationMatching(M, threshold = 0.9, method = "jw", min_chars = 20)

Arguments

M

A bibliometrix data frame, typically created by convert2df. Must contain the columns:

  • SR: Short reference identifier for each document

  • CR: Cited references field (citations separated by semicolons)

  • DB: (Optional) Database source identifier for format detection

threshold

Numeric value between 0 and 1 indicating the similarity threshold for matching citations. Default is 0.85. See normalize_citations for details on selecting appropriate thresholds.

method

String distance method to use for fuzzy matching. Options include:

  • "jw" (default): Jaro-Winkler distance, optimized for bibliographic strings

  • "lv": Levenshtein distance

  • Other methods supported by stringdistmatrix

min_chars

Minimum characters for valid citations (default: 20)

Details

The function automatically handles the new Scopus citation format (where the year appears at the end in parentheses) by converting it to the classic format before processing.

The function performs the following steps:

  1. Splits the CR field by semicolons to extract individual citations

  2. Detects and converts new Scopus format citations to classic format

  3. Trims whitespace from each citation

  4. Applies normalize_citations to identify duplicate citations

  5. Links normalized citations back to source documents (SR)

  6. Generates summary statistics and reconstructs normalized CR fields

The normalized CR field can be used to replace the original CR field in subsequent bibliometric analyses, ensuring that citation counts and network analyses are not inflated by duplicate citations with minor formatting differences.

Value

A list with four elements:

full_data

A data frame with columns:

  • SR: Source document identifier

  • CR: Original citation string

  • CR_canonical: Canonical (normalized) citation

  • cluster_id: Unique cluster identifier

  • n_cluster: Size of the citation cluster

  • first_author, year, journal, volume: Extracted metadata

summary

A data frame summarizing citation frequencies with columns:

  • CR_canonical: The canonical citation for each cluster

  • n: Total number of times this work was cited

  • n_variants: Number of different formatting variants found

  • variants_example: Sample of variant formats (up to 3 examples)

Sorted by citation frequency (n) in descending order.

matched_citations

Complete output from normalize_citations, useful for detailed analysis of the matching process.

CR_normalized

A data frame with columns:

  • SR: Source document identifier

  • CR: Reconstructed CR field with normalized citations (semicolon-separated)

  • n_references: Number of unique references after normalization

This can be merged back with M to replace the original CR field.

References

Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959-975.

See Also

normalize_citations for the underlying normalization algorithm citations for citation analysis localCitations for local citation analysis

Examples

## Not run: 
# Load bibliometric data
file <- "https://www.bibliometrix.org/datasets/savedrecs.txt"
M <- convert2df(file, dbsource = "wos", format = "plaintext")

# Apply citation normalization
results <- applyCitationMatching(M, threshold = 0.85)

# View top cited works (after normalization)
head(results$summary, 20)

# See how many variants were found for the top citation
top_citation <- results$summary$CR_canonical[1]
variants <- subset(results$full_data, CR_canonical == top_citation)
unique(variants$CR)

# Replace original CR with normalized CR in the data frame
M_normalized <- M %>%
  rename(CR_orig = CR) %>%
  left_join(results$CR_normalized, by = "SR")

# Compare citation counts before and after normalization
original_citations <- strsplit(M$CR, ";") %>%
  unlist() %>%
  trimws() %>%
  table() %>%
  length()

normalized_citations <- nrow(results$summary)

cat("Original unique citations:", original_citations, "\n")
cat("After normalization:", normalized_citations, "\n")
cat("Duplicates found:", original_citations - normalized_citations, "\n")

# Use normalized data for further analysis
CR_analysis <- citations(M_normalized, field = "article", sep = ";")

## End(Not run)


bibliometrix documentation built on Nov. 8, 2025, 5:06 p.m.