match_complete: Complete Match

Description Usage Arguments Value Examples

View source: R/match_complete.R

Description

Description

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
match_complete(
  .source,
  .target,
  .cols_match,
  .cols_join = NULL,
  .cols_exact = NULL,
  .max_match = 10,
  .method = "osa",
  .verbose = TRUE,
  .workers = future::availableCores(),
  .char_block = c(Inf, Inf),
  .standardize = TRUE,
  .w_unique = NULL,
  .w_custom = NULL,
  .min_sim = NULL,
  .col_score = c("sms", "smw", "smc", "sss", "ssw", "ssc")
)

Arguments

.source

The Source Dataframe.
(Must contain a unique column id and the columns you want to match on)

.target

The Target Dataframe.
(Must contain a unique column id and the columns you want to match on)

.cols_match

A character vector of columns to perform fuzzy matching.

.cols_join

Columns to perfrom an exact match on, before fuzzy-matching.
(Matched IDs will be excluded from fuzzy-match)

.cols_exact

Columns that must be matched perfectly.
(Data will be partitioned using those columns)

.max_match

Maximum number of matches to return (Default = 10)

.method

One of "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex".
See: stringdist-metrics stringdist

.verbose

Print additional information?

.workers

Number of cores to utilize (Default all cores determined by future::availableCores())

.char_block

Character Block Size. Used to partition data.

  • First element chunks the source data in ngram-blocks.

  • Second element allows for characters in target below/above block size.

.standardize

Perform String Standardization using standardize_data()?

.w_unique

Weights calculated by get_weights()

.w_custom

A named numeric vector that matches the columns of .cols_match w/o the columns of .cols_exact

.min_sim

Named vector with minimum similarities

.col_score

Score column generated by scores_data().
Options are:

  • sms: Simple Mean (mean over all fuzzy columns)

  • smw: Weighted Mean (mean over all fuzzy columns, weighted by get_weights())

  • smc: Custom Mean (mean over all fuzzy columns, weighted custom weights)

  • sss: Simple Mean, squared (mean over all fuzzy columns, scores are squared)

  • ssw: Weighted Mean, squared (mean over all fuzzy columns, scores are squared before weighted by get_weights())

  • ssc: Custom Mean, squared (mean over all fuzzy columns, scores are squared before weighted custom weights)

Value

A dataframe

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
match_complete(
  .source = table_source[1:100, ],
  .target = table_target[1:999, ],
  .cols_match = c("name", "iso3", "city", "address"),
  .cols_join = c("name", "iso3"),
  .cols_exact = "iso3",
  .max_match = 25,
  .method = "soundex",
  .verbose = TRUE,
  .workers = 4,
  .char_block = c(5, 5),
  .standardize = TRUE,
  .w_unique = NULL,
  .w_custom = c(name = .7, city = .2, address = .1),
  .col_score = "sms"
)

MatthiasUckert/Rmatch documentation built on Jan. 3, 2022, 11:09 p.m.