dedup: Deduplicate records
In ropensci/scrubr: Clean Biological Occurrence Records

dedup

R Documentation

Deduplicate records

Description

Deduplicate records

Usage

dedup(x, how = "one", tolerance = 0.9)

Arguments

`x`	(data.frame) A data.frame, tibble, or data.table
`how`	(character) How to deal with duplicates. The default of "one" keeps one record of each group of duplicates, and drops the others, putting them into the `dups` attribute. "all" drops all duplicates, in case e.g., you don't want to deal with any records that are duplicated, as e.g., it may be hard to tell which one to remove.
`tolerance`	(numeric) Score (0 to 1) at which to determine a match. You'll want to inspect outputs closely to tweak this value based on your data, as results can vary.

Value

Returns a data.frame, optionally with attributes

Examples

df <- sample_data_1
smalldf <- df[1:20, ]
smalldf <- rbind(smalldf, smalldf[10,])
smalldf[21, "key"] <- 1088954555
NROW(smalldf)
dp <- dframe(smalldf) %>% dedup()
NROW(dp)
attr(dp, "dups")

# Another example - more than one set of duplicates
df <- sample_data_1
twodups <- df[1:10, ]
twodups <- rbind(twodups, twodups[c(9, 10), ])
rownames(twodups) <- NULL
NROW(twodups)
dp <- dframe(twodups) %>% dedup()
NROW(dp)
attr(dp, "dups")

ropensci/scrubr documentation built on Sept. 12, 2022, 2:12 p.m.