dedup: Deduplicate records

Description Usage Arguments Value Examples

View source: R/dedup.R

Description

Deduplicate records

Usage

1
dedup(x, how = "one", tolerance = 0.9)

Arguments

x

(data.frame) A data.frame, tibble, or data.table

how

(character) How to deal with duplicates. The default of "one" keeps one record of each group of duplicates, and drops the others, putting them into the dups attribute. "all" drops all duplicates, in case e.g., you don't want to deal with any records that are duplicated, as e.g., it may be hard to tell which one to remove.

tolerance

(numeric) Score (0 to 1) at which to determine a match. You'll want to inspect outputs closely to tweak this value based on your data, as results can vary.

Value

Returns a data.frame, optionally with attributes

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
df <- sample_data_1
smalldf <- df[1:20, ]
smalldf <- rbind(smalldf, smalldf[10,])
smalldf[21, "key"] <- 1088954555
NROW(smalldf)
dp <- dframe(smalldf) %>% dedup()
NROW(dp)
attr(dp, "dups")

# Another example - more than one set of duplicates
df <- sample_data_1
twodups <- df[1:10, ]
twodups <- rbind(twodups, twodups[c(9, 10), ])
rownames(twodups) <- NULL
NROW(twodups)
dp <- dframe(twodups) %>% dedup()
NROW(dp)
attr(dp, "dups")

Example output

[1] 21
[1] 20
<scrubr dframe>
Size: 1 X 5


              name longitude latitude                date        key
             (chr)     (dbl)    (dbl)              (time)      (dbl)
1 Ursus americanus -76.78671 35.53079 2015-04-05 23:00:00 1088954555
[1] 12
[1] 10
<scrubr dframe>
Size: 2 X 5


              name longitude latitude                date        key
             (chr)     (dbl)    (dbl)              (time)      (int)
1 Ursus americanus -78.25027 36.93018 2015-03-20 21:11:24 1088923534
2 Ursus americanus -76.78671 35.53079 2015-04-05 23:00:00 1088954559

scrubr documentation built on June 12, 2021, 9:06 a.m.