Introduction to blink Package

We present a small example from "Entity Resolution with Emprically Motivated Priors", Bayesian Analysis, (10),4:849-975. We will be using the RecordLinkage package in R and the RLdata500 data set.

The blink package removes duplicate entries from multiple databases using the method outlined in the paper above. We illustrate an example of using this package using a German dataset comprised of first and last name and full date of birth.

Our goals include

Understanding the RLdata500 dataset

The RLdata500 dataset exists already in the RecordLinkage package in R. We review this data set for the user.

The RLdata500 data consists of 500 records with 10 percent duplication. Thus, there are 450 unique individuals. There is full information on each record containing first name, last name, and full date of birth.

We first load the Record Linkgae package and load the RLdata500 data set. We also, provide the first few lines of the data.

library(blink)
data(RLdata500)
head(RLdata500)

Preparing the data

Next, we prepare the data for working with the blink package. The methods assume that at least one variable is a categorical variable and one is a string variable. To proceed, we treat first and last name as string variables and full date of birth as the categorical variables.

# X.c contains the categorical variables
# X.s contains the string variables 
# p.c is the number of categorical variables
# p.s contains the number of string variables
X.c <- RLdata500[c("by","bm","bd")]
X.c <- as.matrix(RLdata500[,"bd"],ncol=1)
p.c <- ncol(X.c)
X.s <- as.matrix(RLdata500[-c(2,4,7)])
p.s <- ncol(X.s)

Setting the Tuning Parameters

Now, we give a small example for setting the tuning parameters before running the Gibbs sampler. The details regarding our recommendations can be found in the aforementioned referenced paper.

First, we work with a file number identifier.

# File number identifier
# Note: Recall that X.c and X.s include all files "stacked" on top of each other.
# The vector below keeps track of which rows of X.c and X.s are in which files.
file.num <- rep(c(1,2,3),c(200,150,150))

Next, we work with the parameters that tune the prior on the amount of distortion that goes into the model. (This corresponds to the parameters of a Beta(a,b) distribution.)

# Subjective choices for distortion probability prior
a <-1
b <- 999

Then we write a function for the Edit distance between two strings. Other distance functions could be used, such as Jaro-Winkler. Sensitivity of this distance function is discussed in the aforementioned paper.

d <- function(string1,string2){adist(string1,string2)}

For the steepness parameter, we recommend

c <- 1

The Gibbs Sampler for blink

We now run a test version of the Gibbs sampler using blink, with 10 Gibbs iterations and a maximum size of M=500 (assuming the overall known population size is 500). Note that this is a toy test and one should note that the Gibbs sampler has failed to converged at this point, which should not be suprising.

lam.gs <- rl.gibbs(file.num=file.num,X.s=X.s,X.c=X.c,num.gs=2,a=a,b=b,c=c,d=d, M=500)

Summary information of blink

Let's read in the estimate linkage structure using 10 Gibbs iterations.

estLink <- lam.gs
estPopSize <- apply(estLink , 1, function(x) {length(unique(x))})
plot(density(estPopSize),xlim=c(300,500),main="",lty=1, "Observed Population Size", ylim= c(0,1))
abline(v=450,col="red")
abline(v=mean(estPopSize),col="black",lty=2)
mean(estPopSize)
sd(estPopSize)

Next, we illustrate an example of what we did in the paper for this dataset. Show the same example, show the error rates, show the comparisons with other methods. Show convergence rates.

We have run the Gibbs sampler for approximately 100,000 iterations and read in the output. (If you run this on your own, please do this on a server and not on a laptop since the convergence of this data set is slow due to this particular data set.)

temp <- unzip("../inst/extdata/sampleLinkage2.txt.zip")

estLink <- as.matrix(read.table(temp,header=TRUE))
dim(estLink)
estPopSize <- apply(estLink , 1, function(x) {length(unique(x))})
plot(density(estPopSize),xlim=c(300,500),main="",lty=1, "Observed Population Size", ylim= c(0,1))
abline(v=450,col="red")
abline(v=mean(estPopSize),col="black",lty=2)
mean(estPopSize)
sd(estPopSize)

Let us now consider the error rates (false negative rate and false positive rate), which are defined in the paper.

# let's calculated the estimated pairwise links
# using the blink method 
est.links.pair <- links(estLink[85000:90000,])
# let's calulated the true links using the 
# unique identifiers that are stored
true.links <- links(matrix(identity.RLdata500,nrow=1))
true.links.pair <- pairwise(true.links)
# Correct, incorrect, and missing links
comparison <- links.compare(est.links.pair,true.links.pair,counts.only=TRUE)
comparison
missing.links <- comparison$missing
true.links<-comparison$correct 
false.links <- comparison$incorrect

truth.links <- true.links+false.links
fpr = false.links/truth.links
fnr = missing.links/truth.links
fdr = false.links/(true.links+false.links)
c(fpr,fnr,fdr)


Try the blink package in your browser

Any scripts or data that you put into this service are public.

blink documentation built on Aug. 7, 2017, 9:03 a.m.