Deduplication using reclin

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(reclin)
library(dplyr)

Using reclin for deduplication will be demonstrated using an example

Towns with names containing 'rdam' or 'rdm' have been selected. This should contain most records concerning the two largest cities in The Netherlands: Amsterdam and Rotterdam.

data("town_names")
head(town_names)

First, we do a little bit of cleanup of the names. We have a lot of names of the form 'amsterdam z.o.', 'amsterdam zo', etc. Removing non-alphanumeric characters will probably help. Also, some of the O's are written as 0's (zeros).

town_names$clean_name <- gsub("[^[:alnum:]]", "", town_names$name)
town_names$clean_name <- gsub("0", "o", town_names$clean_name)

We will now compare all records from the dataset to each other. First, we generate all possible pairs of records. However, as it is not necessary to compare the first record to the second and also the second to the first, we will only select pairs for which the second index is larger than the first. This is done by filter_pairs_for_deduplication. We compare the names using jaro_winkler and select records for which the Jaro-Winkler similarity is above 0.88. This value is determined by eye-balling the data. Usually values around 0.9 work.

p <- pair_blocking(town_names, town_names) %>% 
  filter_pairs_for_deduplication() %>%
  compare_pairs("clean_name", default_comparator = jaro_winkler()) %>% 
  score_simsum() %>% 
  select_threshold(0.88)
head(p)

We have now selected some town names that we consider the same: records 2 and 3 (record 3 in output above) are the same, and records 3 and 4 (record 6). However, records 2 and 4 are not classified as belonging to the same record (record 5).

In our final step we want to assign each record in our original data set town_names into a number of groups, each group containing all records with the same town names. The function deduplicate_equivalance does that. It will use the 'rules' derived above: 2 and 3 belong to the same group, 3 and 4 belong to the same group, etc., to assign each record to a group. It will, therefore, also assign records 2 and 4 to the same group. For those familiar with graph theory: it derives all subgraphs and assigns all nodes in a subgraph the same identifier.

res <- deduplicate_equivalence(p)
head(res)

As we can see records 2 to 6 are assigned to the same group. We can calculate the number of groups and compare that to the original number of town names:

length(unique(res$duplicate_groups))
length(unique(res$duplicate_groups))/nrow(res)

We are only left with r length(unique(res$duplicate_groups)) town names; a reduction of approximately 90 percent. For this small number of remaining groups it is possible to manually derive the correct names, or, if that would be available, we could use the most frequent name in each group as the group name.

Lets assume that we are able to correctly determine the group names. This means that we assign the most frequent official name to each group:

res <- res %>% group_by(duplicate_groups, official_name) %>% mutate(n = n()) %>% 
  group_by(duplicate_groups) %>%
  mutate(group_name = first(official_name, order_by = desc(n)))

We can then calculate the confusion matrix and calculate the precision and recall:

precision <- res %>% group_by(group_name) %>% 
  summarise(precision = sum(group_name == official_name)/n())

precision_recall <- res %>% group_by(official_name) %>% 
  summarise(recall = sum(group_name == official_name)/n()) %>%
  left_join(precision, by = c("official_name" = "group_name")) %>% 
  mutate(precision = ifelse(is.na(precision), 0, precision))

precision_recall

Overall precision and recall

summarise(precision_recall, mean(recall), mean(precision))


Try the reclin package in your browser

Any scripts or data that you put into this service are public.

reclin documentation built on Nov. 23, 2021, 9:09 a.m.