knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(reclin) library(dplyr)
Using reclin
for deduplication will be demonstrated using an example
Towns with names containing 'rdam' or 'rdm' have been selected. This should contain most records concerning the two largest cities in The Netherlands: Amsterdam and Rotterdam.
data("town_names") head(town_names)
First, we do a little bit of cleanup of the names. We have a lot of names of the form 'amsterdam z.o.', 'amsterdam zo', etc. Removing non-alphanumeric characters will probably help. Also, some of the O's are written as 0's (zeros).
town_names$clean_name <- gsub("[^[:alnum:]]", "", town_names$name) town_names$clean_name <- gsub("0", "o", town_names$clean_name)
We will now compare all records from the dataset to each other. First, we
generate all possible pairs of records. However, as it is not necessary to
compare the first record to the second and also the second to the first, we will
only select pairs for which the second index is larger than the first. This is
done by filter_pairs_for_deduplication
. We compare the names using
jaro_winkler
and select records for which the Jaro-Winkler similarity is above
0.88. This value is determined by eye-balling the data. Usually values around
0.9 work.
p <- pair_blocking(town_names, town_names) %>% filter_pairs_for_deduplication() %>% compare_pairs("clean_name", default_comparator = jaro_winkler()) %>% score_simsum() %>% select_threshold(0.88) head(p)
We have now selected some town names that we consider the same: records 2 and 3 (record 3 in output above) are the same, and records 3 and 4 (record 6). However, records 2 and 4 are not classified as belonging to the same record (record 5).
In our final step we want to assign each record in our original data set town_names
into a number of groups, each group containing all records with the same town
names. The function deduplicate_equivalance
does that. It will use the 'rules'
derived above: 2 and 3 belong to the same group
,
3 and 4 belong to the same group
, etc., to assign each record to a group. It
will, therefore, also assign records 2 and 4 to the same group. For those
familiar with graph theory: it derives all subgraphs and assigns all nodes in a
subgraph the same identifier.
res <- deduplicate_equivalence(p) head(res)
As we can see records 2 to 6 are assigned to the same group. We can calculate the number of groups and compare that to the original number of town names:
length(unique(res$duplicate_groups)) length(unique(res$duplicate_groups))/nrow(res)
We are only left with r length(unique(res$duplicate_groups))
town names; a
reduction of approximately 90 percent. For this small number of remaining groups
it is possible to manually derive the correct names, or, if that would be
available, we could use the most frequent name in each group as the group name.
Lets assume that we are able to correctly determine the group names. This means that we assign the most frequent official name to each group:
res <- res %>% group_by(duplicate_groups, official_name) %>% mutate(n = n()) %>% group_by(duplicate_groups) %>% mutate(group_name = first(official_name, order_by = desc(n)))
We can then calculate the confusion matrix and calculate the precision and recall:
precision <- res %>% group_by(group_name) %>% summarise(precision = sum(group_name == official_name)/n()) precision_recall <- res %>% group_by(official_name) %>% summarise(recall = sum(group_name == official_name)/n()) %>% left_join(precision, by = c("official_name" = "group_name")) %>% mutate(precision = ifelse(is.na(precision), 0, precision)) precision_recall
Overall precision and recall
summarise(precision_recall, mean(recall), mean(precision))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.