dedup_data: Deduplicate Data

Description Usage Arguments Value Examples

View source: R/dedup_data.R

Description

Description

Usage

1
2
3
4
5
6
7
8
dedup_data(
  .score,
  .source,
  .target,
  .cols_match,
  .min_sim = NULL,
  .col_score = c("sms", "smw", "smc", "sss", "ssw", "ssc")
)

Arguments

.score

Dataframe generated by scores_data()

.source

The Source Dataframe.
(Must contain a unique column id and the columns you want to match on)

.target

The Target Dataframe.
(Must contain a unique column id and the columns you want to match on)

.cols_match

A character vector of columns to perform fuzzy matching.

.min_sim

Named vector with minimum similarities

.col_score

Score column generated by scores_data().
Options are:

  • sms: Simple Mean (mean over all fuzzy columns)

  • smw: Weighted Mean (mean over all fuzzy columns, weighted by get_weights())

  • smc: Custom Mean (mean over all fuzzy columns, weighted custom weights)

  • sss: Simple Mean, squared (mean over all fuzzy columns, scores are squared)

  • ssw: Weighted Mean, squared (mean over all fuzzy columns, scores are squared before weighted by get_weights())

  • ssc: Custom Mean, squared (mean over all fuzzy columns, scores are squared before weighted custom weights)

Value

A dataframe

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
tab_source <- table_source[1:100, ]
tab_target <- table_target[1:999, ]
cols_match <- c("name", "iso3", "city", "address")
cols_exact <- "iso3"
cols_join  <- c("name", "iso3")
tab_match <- match_data(
  .source = tab_source,
  .target = tab_target,
  .cols_match = cols_match,
  .cols_exact = cols_exact,
  .cols_join = cols_join,
  .method = "soundex"
)
tab_score <- scores_data(
  .matches = tab_match, 
  .source = tab_source, 
  .target = tab_target, 
  .cols_match = cols_match,
  .cols_exact = cols_exact
  )

dedup_data(
  .score = tab_score, 
  .source = tab_source, 
  .target = tab_target,
  .cols_match = cols_match,
  .col_score = "sms"
  )

MatthiasUckert/Rmatch documentation built on Jan. 3, 2022, 11:09 p.m.