applyCitationMatching: Apply citation normalization to bibliometrix data frame
In bibliometrix: Comprehensive Science Mapping Analysis

View source: R/apply_citation_matching.R

applyCitationMatching

R Documentation

Apply citation normalization to bibliometrix data frame

Description

This is a convenience wrapper function that applies normalize_citations to a bibliometrix data frame (typically loaded with convert2df). It extracts citations from the CR field, performs normalization and matching, and returns comprehensive results including per-paper citation lists and summary statistics.

Usage

applyCitationMatching(M, threshold = 0.9, method = "jw", min_chars = 20)

Arguments

`M`	A bibliometrix data frame, typically created by `convert2df`. Must contain the columns: `SR`: Short reference identifier for each document `CR`: Cited references field (citations separated by semicolons) `DB`: (Optional) Database source identifier for format detection
`threshold`	Numeric value between 0 and 1 indicating the similarity threshold for matching citations. Default is 0.85. See `normalize_citations` for details on selecting appropriate thresholds.
`method`	String distance method to use for fuzzy matching. Options include: "jw" (default): Jaro-Winkler distance, optimized for bibliographic strings "lv": Levenshtein distance Other methods supported by `stringdistmatrix`
`min_chars`	Minimum characters for valid citations (default: 20)

Details

The function automatically handles the new Scopus citation format (where the year appears at the end in parentheses) by converting it to the classic format before processing.

The function performs the following steps:

Splits the CR field by semicolons to extract individual citations
Detects and converts new Scopus format citations to classic format
Trims whitespace from each citation
Applies normalize_citations to identify duplicate citations
Links normalized citations back to source documents (SR)
Generates summary statistics and reconstructs normalized CR fields

The normalized CR field can be used to replace the original CR field in subsequent bibliometric analyses, ensuring that citation counts and network analyses are not inflated by duplicate citations with minor formatting differences.

Value

A list with four elements:

full_data

A data frame with columns:

SR: Source document identifier
CR: Original citation string
CR_canonical: Canonical (normalized) citation
cluster_id: Unique cluster identifier
n_cluster: Size of the citation cluster
first_author, year, journal, volume: Extracted metadata

summary

A data frame summarizing citation frequencies with columns:

CR_canonical: The canonical citation for each cluster
n: Total number of times this work was cited
n_variants: Number of different formatting variants found
variants_example: Sample of variant formats (up to 3 examples)

Sorted by citation frequency (n) in descending order.

matched_citations

Complete output from normalize_citations, useful for detailed analysis of the matching process.

CR_normalized

A data frame with columns:

SR: Source document identifier
CR: Reconstructed CR field with normalized citations (semicolon-separated)
n_references: Number of unique references after normalization

This can be merged back with M to replace the original CR field.

References

Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959-975.

Examples

## Not run: 
# Load bibliometric data
file <- "https://www.bibliometrix.org/datasets/savedrecs.txt"
M <- convert2df(file, dbsource = "wos", format = "plaintext")

# Apply citation normalization
results <- applyCitationMatching(M, threshold = 0.85)

# View top cited works (after normalization)
head(results$summary, 20)

# See how many variants were found for the top citation
top_citation <- results$summary$CR_canonical[1]
variants <- subset(results$full_data, CR_canonical == top_citation)
unique(variants$CR)

# Replace original CR with normalized CR in the data frame
M_normalized <- M %>%
  rename(CR_orig = CR) %>%
  left_join(results$CR_normalized, by = "SR")

# Compare citation counts before and after normalization
original_citations <- strsplit(M$CR, ";") %>%
  unlist() %>%
  trimws() %>%
  table() %>%
  length()

normalized_citations <- nrow(results$summary)

cat("Original unique citations:", original_citations, "\n")
cat("After normalization:", normalized_citations, "\n")
cat("Duplicates found:", original_citations - normalized_citations, "\n")

# Use normalized data for further analysis
CR_analysis <- citations(M_normalized, field = "article", sep = ";")

## End(Not run)

bibliometrix documentation built on Nov. 8, 2025, 5:06 p.m.