knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.align = 'center') options(width = 60) backup_options <- options()
This document is a (practical) description of a procedure for Record Linkage by means of Extreme Value Theory (EVT). No labeled training data are needed, but user decisions are necessary for the selection of thresholds in a mean residual life plot (also known as mean excess plot).
In the following, the data set RLdata500 will be used. As classification with EVT is weight-based, weights have to be calculated for the record pairs to classify. In this case an EM algorithm is applied.
library(RecordLinkage)
data(RLdata500) bf <- list(1, 3, 5, 6, 7) rpairs <- compare.dedup(RLdata500, identity = identity.RLdata500, blockfld = bf, strcmp = 1:4) rpairs <- emWeights(rpairs)
Calling getParetoThreshold opens a mean residual life (MRL) plot for the computed weights, as shown in Figure 1. From this graph, an interval has to be selected where the graph has a relatively long and approximately linear descent. Usually this can be found in the range between 0 and 20 for weights computed with emWeights or between 0.5 and 0.9 for weights computed with epiWeights. Figure 2 shows the same MRL plot with the appropriate segment marked.
The interval is selected by clicking on the endpoints of the desired segment of the graph. In some cases the right endpoint is identical to the edge of the graph, in this case only selection of the left endpoint is necessary. See the documentation of identify for more information on selecting points on a plot.
## Not run: getParetoThreshold(rpairs)
plotMRL(rpairs)
plotMRL(rpairs) abline(v = c(1.2, 12.8), col = "red", lty = "dashed") l <- mrl(rpairs$Wdata) range <- l$x > 1.2 & l$x < 12.8 points(l$x[range], l$y[range], col = "red", type = "l")
As an alternative to interactive selection, the interval can be given as argument to getParetoThreshold. The return value is in every case a threshold which can be used directly for classification.
threshold <- getParetoThreshold(rpairs, interval = c(1.2, 12.8)) result <- emClassify(rpairs, threshold) summary(result)
options(backup_options)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.