# linkRecords: Bayes Estimates of Bipartite Matchings In BRL: Beta Record Linkage

## Description

Bayes point estimates of bipartite matchings that can be obtained in closed form according to Theorems 1, 2 and 3 of Sadinle (2017).

## Usage

 `1` ```linkRecords(Zchain, n1, lFNM = 1, lFM1 = 1, lFM2 = 2, lR = Inf) ```

## Arguments

 `Zchain` matrix as the output `\$Z` of the function `bipartiteGibbs`, with `n2` rows and `nIter` columns containing a chain of draws from a posterior distribution on bipartite matchings. Each column indicates the records in datafile 1 to which the records in datafile 2 are matched according to that draw. `n1` number of records in datafile 1. `lFNM` individual loss of a false non-match in the loss functions of Sadinle (2017), default `lFNM=1`. `lFM1` individual loss of a false match of type 1 in the loss functions of Sadinle (2017), default `lFM1=1`. `lFM2` individual loss of a false match of type 2 in the loss functions of Sadinle (2017), default `lFM2=2`. `lR` individual loss of 'rejecting' to make a decision in the loss functions of Sadinle (2017), default `lR=Inf`.

## Details

Not all combinations of losses `lFNM, lFM1, lFM2, lR` are supported. The losses have to be positive numbers and satisfy one of three conditions:

1. Conditions of Theorem 1 of Sadinle (2017): `(lR == Inf) & (lFNM <= lFM1) & (lFNM + lFM1 <= lFM2)`

2. Conditions of Theorem 2 of Sadinle (2017): `((lFM2 >= lFM1) & (lFM1 >= 2*lR)) | ((lFM1 >= lFNM) & (lFM2 >= lFM1 + lFNM))`

3. Conditions of Theorem 3 of Sadinle (2017): `(lFM2 >= lFM1) & (lFM1 >= 2*lR) & (lFNM >= 2*lR)`

If one of the last two conditions is satisfied, the point estimate might be partial, meaning that there might be some records in datafile 2 for which the point estimate does not include a linkage decision. For combinations of losses not supported here, the linear sum assignment problem outlined by Sadinle (2017) needs to be solved.

## Value

A vector containing the point estimate of the bipartite matching. If `lR != Inf` the output might be a partial estimate. A number smaller or equal to `n1` in entry `j` indicates the record in datafile 1 to which record `j` in datafile 2 gets linked, a number `n1+j` indicates that record `j` does not get linked to any record in datafile 1, and the value `-1` indicates a 'rejection' to link, meaning that the correct linkage decision is not clear.

## References

Mauricio Sadinle (2017). Bayesian Estimation of Bipartite Matchings for Record Linkage. Journal of the American Statistical Association 112(518), 600-612. [Published] [arXiv]

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33``` ```data(twoFiles) myCompData <- compareRecords(df1, df2, flds=c("gname", "fname", "age", "occup"), types=c("lv","lv","bi","bi")) chain <- bipartiteGibbs(myCompData) ## discard first 100 iterations of Gibbs sampler ## full estimate of bipartite matching (full linkage) fullZhat <- linkRecords(chain\$Z[,-c(1:100)], n1=nrow(df1), lFNM=1, lFM1=1, lFM2=2, lR=Inf) ## partial estimate of bipartite matching (partial linkage), where ## lR=0.5, lFNM=1, lFM1=1 mean that we consider not making a decision for ## a record as being half as bad as a false non-match or a false match of type 1 partialZhat <- linkRecords(chain\$Z[,-c(1:100)], n1=nrow(df1), lFNM=1, lFM1=1, lFM2=2, lR=.5) ## for which records the decision is not clear according to this set-up of the losses? undecided <- which(partialZhat == -1) df2[undecided,] ## compute frequencies of link options observed in the chain linkOptions <- apply(chain\$Z[undecided, -c(1:100)], 1, table) linkOptions <- lapply(linkOptions, sort, decreasing=TRUE) linkOptionsInds <- lapply(linkOptions, names) linkOptionsInds <- lapply(linkOptionsInds, as.numeric) linkOptionsFreqs <- lapply(linkOptions, function(x) as.numeric(x)/sum(x)) ## first record without decision df2[undecided[1],] ## options for this record; row of NAs indicates possibility that record has no match in df1 cbind(df1[linkOptionsInds[[1]],], prob = round(linkOptionsFreqs[[1]],3) ) ```

BRL documentation built on Jan. 13, 2020, 5:07 p.m.