dedup: Deduplicate records
In scrubr: Clean Biological Occurrence Records

Description Usage Arguments Value Examples

Deduplicate records

1	dedup(x, how = "one", tolerance = 0.9)

`x`	(data.frame) A data.frame, tibble, or data.table
`how`	(character) How to deal with duplicates. The default of "one" keeps one record of each group of duplicates, and drops the others, putting them into the `dups` attribute. "all" drops all duplicates, in case e.g., you don't want to deal with any records that are duplicated, as e.g., it may be hard to tell which one to remove.
`tolerance`	(numeric) Score (0 to 1) at which to determine a match. You'll want to inspect outputs closely to tweak this value based on your data, as results can vary.

Returns a data.frame, optionally with attributes

df <- sample_data_1
smalldf <- df[1:20, ]
smalldf <- rbind(smalldf, smalldf[10,])
smalldf[21, "key"] <- 1088954555
NROW(smalldf)
dp <- dframe(smalldf) %>% dedup()
NROW(dp)
attr(dp, "dups")

# Another example - more than one set of duplicates
df <- sample_data_1
twodups <- df[1:10, ]
twodups <- rbind(twodups, twodups[c(9, 10), ])
rownames(twodups) <- NULL
NROW(twodups)
dp <- dframe(twodups) %>% dedup()
NROW(dp)
attr(dp, "dups")

[1] 21
[1] 20
<scrubr dframe>
Size: 1 X 5


              name longitude latitude                date        key
             (chr)     (dbl)    (dbl)              (time)      (dbl)
1 Ursus americanus -76.78671 35.53079 2015-04-05 23:00:00 1088954555
[1] 12
[1] 10
<scrubr dframe>
Size: 2 X 5


              name longitude latitude                date        key
             (chr)     (dbl)    (dbl)              (time)      (int)
1 Ursus americanus -78.25027 36.93018 2015-03-20 21:11:24 1088923534
2 Ursus americanus -76.78671 35.53079 2015-04-05 23:00:00 1088954559