Installation of ActiveRL is done through the devtools
package in R. You will need to download the package if you have never done so before using the command install.packages("devtools")
.
library(devtools)
install_github('kaylafrisoli/ActiveRL')
Record linkage is the process of identifying records corresponding to unique entities across multiple data sets. This is necessary because records within a data set may not share a unique identifier, such as social security number or licence plate number. Record linkage is a common problem in many fields. For example, if two airlines merge they may want to link their flight records to gain information on their customers. They would need to decide if Sam Smith with birth date 05/15/86 is the same person as Samuel Smyth with birth date 05/16/86. Record linkage is especially important for government organizations like the U.S. Census Bureau, who require accurate linkages of information across their many data sources.
ActiveRL provides the tools to perform record linkage on a data set of records. If you have training data you can utilize current supervised learning methods to create unique ids for your testing data. If not, you can create your own training data and then utilize the same methods to create unique ids.
We build our own training data using an active learning approach that allows users to create their own optimized training datasets for specific record linkage problems. Our algorithm prompts users to label record-pairs for which its predictions are most uncertain, increasing the resulting classifier’s predictive power. There is much work to be done in this section and we are continually learning more about active learning, and how to handle situations wtih imbalanced classes (more record non-matches than record matches).
We will walk through an example from a data set built in to the RecordLinkage
package in R. You will need to install this package if you do not already have it.
library(ActiveRL)
library(RecordLinkage)
data("RLdata500")
ids <- identity.RLdata500
First we will evenly split our data into testing and training data.
splitRL <- SplitIntoTrainTest(RLdata500, ids, seed=16, prob.of.train = .5)
?SplitIntoTrainTest
Then we will block both our training and testing data by birth month. In this example we create 12 different data sets, where each one corresponding to a month. This reduces our comparison space and makes computations more feasabile. Each block is independent of one another and therefore we perform record linkage only within each block.
blockTrain <- BlockRlData(splitRL$training.data,
var.names = "bm",
n.chars = 2,
splitRL$train.id)
blockTest <- BlockRlData(splitRL$testing.data,
var.names = "bm",
n.chars = 2,
splitRL$test.id)
We then need to make comparisons between records, within each block. We will do this all at once at first and then we will parallelize this process across blocks, because they are independent. We are going to compare each field in variables.to.match
using the functions in string.comparators
. These should match up respectively.
compare.train <- CompareAllBlocksInLoop(blockTrain$DataSplit,
blockTrain$IdSplit,
variables.to.match = c("fname_c1",
"lname_c1",
"by",
"bm",
"bd"),
string.comparators = c("jarowinkler",
"jarowinkler",
"AbsoluteDifference",
"AbsoluteDistance",
"AbsoluteDistance"),
record.ids.to.keep=c("PreSplitRecord", "PreBlockRecord"))
compare.test <- CompareAllBlocksInLoop(blockTest$DataSplit,
blockTest$IdSplit,
variables.to.match = c("fname_c1",
"lname_c1",
"by",
"bm",
"bd"),
string.comparators = c("jarowinkler",
"jarowinkler",
"AbsoluteDifference",
"AbsoluteDistance",
"AbsoluteDistance"),
record.ids.to.keep=c("PreSplitRecord", "PreBlockRecord"))
Running the loop across cores:
library(doMC)
options(cores = 6)
registerDoMC()
Dsplit <- blockTrain$DataSplit
Idsplit <- blockTrain$IdSplit
variables.to.match <- c("fname_c1",
"lname_c1",
"by",
"bm",
"bd")
string.comparators <- c("jarowinkler",
"jarowinkler",
"AbsoluteDifference",
"AbsoluteDistance",
"AbsoluteDistance")
record.ids.to.keep <- c("PreSplitRecord", "PreBlockRecord")
block.comparison.lists <- vector("list", length(Dsplit))
block.comparison.lists <- foreach(i = seq_along(Dsplit)) %dopar% {
# ids.for.loop <- vector("list", length(Dsplit))
if(is.null(Idsplit)){
ids.for.loop <- NULL
} else{
ids.for.loop <- Idsplit[[i]]
}
comparison.in.block <- as.data.frame(CompareUniqueCombinations(as.data.frame(Dsplit[[i]]),
as.vector(ids.for.loop),
variables.to.match = variables.to.match,
string.comparators = string.comparators,
record.ids.to.keep = record.ids.to.keep))
}
compare.train <- block.comparison.lists
options(cores = 6)
registerDoMC()
Dsplit <- blockTest$DataSplit
Idsplit <- blockTest$IdSplit
block.comparison.lists <- vector("list", length(Dsplit))
block.comparison.lists <- foreach(i = seq_along(Dsplit)) %dopar% {
# ids.for.loop <- vector("list", length(Dsplit))
if(is.null(Idsplit)){
ids.for.loop <- NULL
} else{
ids.for.loop <- Idsplit[[i]]
}
comparison.in.block <- as.data.frame(CompareUniqueCombinations(as.data.frame(Dsplit[[i]]),
as.vector(ids.for.loop),
variables.to.match = variables.to.match,
string.comparators = string.comparators,
record.ids.to.keep = record.ids.to.keep))
}
compare.test <- block.comparison.lists
We then merge or training comparisons and build a model on the comparison data.
training.merged <- MergeAllBlocks(compare.train)
model <- glm(True_Match ~ fname_c1.jar + lname_c1.jar + by.Abs + bm.Abs + bd.Abs,
data=training.merged,
family = binomial)
We can then calculate the probability that records in our testing data match and hierarchically cluster based on these probabilities. We end up with unique ids for our testing data.
get.test.ids.by.block <- AllBlocksHclustCutGLM(model, compare.test, blockTest$DataSplit, .5)
We may want to check how well we did.
true.match <- get.test.ids.by.block$merged.comparison.data$True_Match
predicted.match <- get.test.ids.by.block$merged.comparison.data$HclustMatch
> evaluation(true.match, predicted.match)
$contingency.table
predicted
truth 0 1
0 2596 0
1 2 8
$false.positive.error
[1] 0
$false.negative.error
[1] 0.2
$false.discovery.rate
[1] 0
$sensitivity
[1] 0.8
$specificity
[1] 1
$precision
[1] 1
$negative.predictive.value
[1] 0.9992302
$accuracy
[1] 0.9992325
createTraining <- BuildATrainingDataset(RLdata500,
n.pairs.to.test=10,
variables.to.match = c("fname_c1",
"lname_c1",
"by",
"bm",
"bd"),
string.comparators = c("jarowinkler",
"jarowinkler",
"AbsoluteDifference",
"AbsoluteDistance",
"AbsoluteDistance"),
standardized.variables=c("fname_c1",
"lname_c1"))
> createTraining$tested.comparisons
fname_c1.jar lname_c1.jar by.Abs bm.Abs bd.Abs True_Match CurrentRecord1 CurrentRecord2 Active_Match
29 0.0000000 0.0000000 45 6 1 NA 1 30 0
125 0.0000000 0.0000000 22 2 1 NA 1 126 0
121282 1.0000000 1.0000000 32 2 11 NA 417 435 0
122238 1.0000000 1.0000000 16 1 8 NA 429 473 0
124569 1.0000000 1.0000000 9 1 1 NA 481 490 1
57032 0.5555556 0.5873016 7 5 15 NA 132 310 0
85128 0.5972222 0.7703704 39 2 15 NA 218 499 0
103440 0.4179894 0.6583333 19 2 4 NA 294 305 0
41309 0.5396825 0.5619048 42 1 12 NA 91 495 0
66563 0.5000000 0.0000000 38 5 15 NA 159 283 0
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.