est_block_error: Estimate errors due to blocking in record linkage

View source: R/est_block_error.R

est_block_errorR Documentation

Estimate errors due to blocking in record linkage

Description

Function computes estimators for false positive rate (FPR) and false negative rate (FNR) due to blocking in record linkage, as proposed by Dasylva and Goussanou (2021). Assumes duplicate-free data sources, complete coverage of the reference data set and blocking decisions based solely on record pairs.

Usage

est_block_error(
  x = NULL,
  y = NULL,
  blocking_result = NULL,
  n = NULL,
  N = NULL,
  G,
  alpha = NULL,
  p = NULL,
  lambda = NULL,
  tol = 10^(-4),
  maxiter = 100,
  sample_size = NULL
)

Arguments

x

Reference data (required if n and N are not provided).

y

Query data (required if n is not provided).

blocking_result

data.frame or data.table containing blocking results (required if n is not provided).

n

Integer vector of numbers of accepted pairs formed by each record in the query data set with records in the reference data set, based on blocking criteria (if NULL, derived from blocking_result).

N

Total number of records in the reference data set (if NULL, derived as length(x)).

G

Number of classes in the finite mixture model.

alpha

Numeric vector of initial class proportions (length G; if NULL, initialized as rep(1/G, G)).

p

Numeric vector of initial matching probabilities in each class of the mixture model (length G; if NULL, randomly initialized from runif(G, 0.5, 1)).

lambda

Numeric vector of initial Poisson distribution parameters for non-matching records in each class of the mixture model (length G; if NULL, randomly initialized from runif(G, 0.1, 2)).

tol

Convergence tolerance for the EM algorithm (default 10^(-6)).

maxiter

Maximum number of iterations for the EM algorithm (default 1000).

sample_size

Bootstrap sample (from n) size used for calculations (if NULL, uses all data).

Details

Consider a large finite population that comprises of N individuals, and two duplicate-free data sources: a register and a file. Assume that the register has no undercoverage, i.e. each record from the file corresponds to exactly one record from the same individual in the register. Let n_i denote the number of register records which form an accepted (by the blocking criteria) pair with record i on the file. Assume that:

  • two matched records are neighbours with a probability that is bounded away from 0 regardless of N,

  • two unmatched records are accidental neighbours with a probability of O(\frac{1}{N}).

The finite mixture model n_i \sim \sum_{g=1}^G \alpha_g(\text{Bernoulli}(p_g) \ast \text{Poisson}(\lambda_g)) is assumed. When G is fixed, the unknown model parameters are given by the vector \psi = [(\alpha_g, p_g, \lambda_g)]_{1 \leq g \leq G} that may be estimated with the Expectation-Maximization (EM) procedure.

Let n_i = n_{i|M} + n_{i|U}, where n_{i|M} is the number of matched neighbours and n_{i|U} is the number of unmatched neighbours, and let c_{ig} denote the indicator that record i is from class g. For the E-step of the EM procedure, the equations are as follows

\begin{aligned} P(n_i | c_{ig} = 1) &= I(n_i = 0)(1-p_g)e^{-\lambda_g}+I(n_i > 0)\Bigl(p_g+(1-p_g)\frac{\lambda_g}{n_i}\Bigr)\frac{e^{-\lambda_g}\lambda_g^{n_i-1}}{(n_i-1)!}, \\ P(c_{ig} = 1 | n_i) &= \frac{\alpha_gP(n_i | c_{ig} = 1)}{\sum_{g'=1}^G\alpha_{g'}P(n_i | c_{ig'} = 1)}, \\ P(n_{i|M} = 1 | n_i,c_{ig} = 1) &= \frac{p_gn_i}{p_gn_i + (1-p_g)\lambda_g}, \\ P(n_{i|U} = n_i | n_i,c_{ig} = 1) &= I(n_i = 0) + I(n_i > 0)\frac{(1-p_g)\lambda_g}{p_gn_i + (1-p_g)\lambda_g}, \\ P(n_{i|U} = n_i-1 | n_i,c_{ig} = 1) &= \frac{p_gn_i}{p_gn_i + (1-p_g)\lambda_g}, \\ E[c_{ig}n_{i|M} | n_i] &= P(c_{ig} = 1 | n_i)P(n_{i|M} = 1 | n_i,c_{ig} = 1), \\ E[n_{i|U} | n_i,c_{ig} = 1] &= \Bigl(\frac{p_g(n_i-1) + (1-p_g)\lambda_g}{p_gn_i + (1-p_g)\lambda_g}\Bigr)n_i, \\ E[c_{ig}n_{i|U} | n_i] &= P(c_{ig} = 1 | n_i)E[n_{i|U} | n_i,c_{ig} = 1]. \end{aligned}

The M-step is given by following equations

\begin{aligned} \hat{p}_g &= \frac{\sum_{i=1}^mE[c_{ig}n_{i|M} | n_i;\psi]}{\sum_{i=1}^mE[c_{ig} | n_i; \psi]}, \\ \hat{\lambda}_g &= \frac{\sum_{i=1}^mE[c_{ig}n_{i|U} | n_i; \psi]}{\sum_{i=1}^mE[c_{ig} | n_i; \psi]}, \\ \hat{\alpha}_g &= \frac{1}{m}\sum_{i=1}^mE[c_{ig} | n_i; \psi]. \end{aligned}

As N \to \infty, the error rates and the model parameters are related as follows

\begin{aligned} \text{FNR} &\xrightarrow{p} 1 - E[p(v_i)], \\ (N-1)\text{FPR} &\xrightarrow{p} E[\lambda(v_i)], \end{aligned}

where E[p(v_i)] = \sum_{g=1}^G\alpha_gp_g and E[\lambda(v_i)] = \sum_{g=1}^G\alpha_g\lambda_g.

Value

Returns a list containing:

  • FPR – estimated false positive rate,

  • FNR – estimated false negative rate,

  • iter – number of the EM algorithm iterations performed,

  • convergence – logical, indicating whether the EM algorithm converged within maxiter iterations.

References

Dasylva, A., Goussanou, A. (2021). Estimating the false negatives due to blocking in record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 47, No. 2.

Dasylva, A., Goussanou, A. (2022). On the consistent estimation of linkage errors without training data. Jpn J Stat Data Sci 5, 181–216. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s42081-022-00153-3")}

Examples

## an example proposed by Dasylva and Goussanou (2021)
## we obtain results very close to those reported in the paper

set.seed(111)

neighbors <- rep(0:5, c(1659, 53951, 6875, 603, 62, 5))

errors <- est_block_error(n = neighbors,
                          N = 63155,
                          G = 2,
                          tol = 10^(-3),
                          maxiter = 50)

errors


blocking documentation built on June 18, 2025, 9:16 a.m.