Example session for Weight-based deduplication

knitr::opts_chunk$set(message = FALSE, warning = FALSE)
options(width = 60)
backup_options <- options()

This document shows an example session using the package RecordLinkage. A single data set is deduplicated using an EM algorithm for weight calculation. Conducting linkage of two data sets differs only in the step of generating record pairs.

Generating record pairs

library(RecordLinkage)

The data to be deduplicated is expected to reside in a data frame or matrix, each row containing one record. Example data sets of 500 and 10000 records are included in the package as RLData500 and RLData10000.

data(RLdata500)
RLdata500[1:5,]

For deduplication, compare.dedup is to be used. In this example, blocking is set to return only record pairs which agree in at least two components of the subdivided date of birth, resulting in 810 pairs. The argument identity preserves the true matching status for later evaluation.

pairs <- compare.dedup(RLdata500, identity = identity.RLdata500,
                       blockfld = list(c(5,6), c(6,7), c(5,7)))
summary(pairs)

Weight calculation

Weights are calculated by means of an EM algorithm. This step is computationally intensive and might take a while. The histogram shows the resulting weight distribution.

pairs <- emWeights(pairs)
hist(pairs$Wdata, plot = FALSE)

Classification

For determining thresholds, record pairs within a given range of weights can be printed using getPairs^[The output of getPairs is shortened in this document.]. In this case, 24 is set as upper and -7 as lower threshold, dividing links, possible links and non-links. The summary shows the resulting contingency table and error measures.

getPairs(pairs, 30, 20)
getPairs(pairs, 30, 20)[23:36,]
pairs <- emClassify(pairs, threshold.upper = 24, threshold.lower = -7)
summary(pairs)

Review of the record pairs denoted as possible links is facilitated by getPairs, which can be forced to show only possible links via argument show. A list with the ids of linked pairs can be extracted from the output of getPairs with argument single.rows set to TRUE.

possibles <- getPairs(pairs, show = "possible")
possibles[1:6,]
links <- getPairs(pairs, show = "links", single.rows = TRUE)
link_ids <- links[, c("id1", "id2")]
link_ids
options(backup_options)


Try the RecordLinkage package in your browser

Any scripts or data that you put into this service are public.

RecordLinkage documentation built on Jan. 25, 2026, 9:06 a.m.