README.md

# Deduplicate

While working with real life data I have faced quite often the issue of determining if there are duplicated in the data. It is also quite common to question if the new samples are actually new or updates of the records in a dataset.

To tackle this issues I have created deduplicate which is nothing else than wrapper functions around dplyr's joins and fuzzyjoin.

So far I have the following functions:

All these functions work with using the naïve approach of creating a unique id out of the multiple columns of the dataset, e.g. FIRSTNAME_LASTNAME_CITY

Planned features include:

Todo's

Fix "custom_id" name conflict dids <- create_idcols(d, id_cols) add_approx_unique_id(dids, col = "custom_id")

exclusive_ids for more than 2 ids. Use mutate_all() with do()



jpmarindiaz/deduplicate documentation built on May 19, 2019, 10:46 p.m.