match_name: Match a loanbook to asset-based company data (abcd) by the...

View source: R/match_name.R

match_nameR Documentation

Match a loanbook to asset-based company data (abcd) by the ⁠name_*⁠ columns

Description

match_name() scores the match between names in a loanbook dataset (columns can be name_direct_loantaker, ⁠name_intermediate_parent*⁠ and name_ultimate_parent) with names in an asset-based company data (column name_company). The raw names are first internally transformed, and aliases are assigned. The similarity between aliases in each of the loanbook and abcd is scored using stringdist::stringsim().

Usage

match_name(
  loanbook,
  abcd,
  by_sector = TRUE,
  min_score = 0.8,
  method = "jw",
  p = 0.1,
  overwrite = NULL,
  join_id = NULL,
  ...
)

Arguments

loanbook, abcd

data frames structured like r2dii.data::loanbook_demo and r2dii.data::abcd_demo.

by_sector

Should names only be compared if companies belong to the same sector?

min_score

A number between 0-1, to set the minimum score threshold. A score of 1 is a perfect match.

method

Method for distance calculation. One of c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). See stringdist::stringdist-metrics.

p

Prefix factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25. If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.

overwrite

A data frame used to overwrite the sector and/or name columns of a particular direct loantaker or ultimate parent. To overwrite only sector, the value in the name column should be NA and vice-versa. This file can be used to manually match loanbook companies to abcd.

join_id

A join specification passed to dplyr::inner_join(). If a character string, it assumes identical join columns between loanbook and abcd. If a named character vector, it uses the name as the join column of loanbook and the value as the join column of abcd.

...

Arguments passed on to stringdist::stringsim().

Value

A data frame with the same groups (if any) and columns as loanbook, and the additional columns:

  • id_2dii - an id used internally by match_name() to distinguish companies

  • level - the level of granularity that the loan was matched at (e.g direct_loantaker or ultimate_parent)

  • sector - the sector of the loanbook company

  • sector_abcd - the sector of the abcd company

  • name - the name of the loanbook company

  • name_abcd - the name of the abcd company

  • score - the score of the match (manually set this to 1 prior to calling prioritize() to validate the match)

  • source - determines the source of the match. (equal to loanbook unless the match is from overwrite

The returned rows depend on the argument min_value and the result of the column score for each loan: * If any row has score equal to 1, match_name() returns all rows where score equals 1, dropping all other rows. * If no row has score equal to 1,match_name() returns all rows where score is equal to or greater than min_score. * If there is no match the output is a 0-row tibble with the expected column names – for type stability.

Package options

r2dii.match.sector_classifications: Allows you to use your own sector_classififications instead of the default. This feature is experimental and may be dropped and/or become a new argument to match_name().

Assigning aliases

The transformation process used to compare names between loanbook and abcd datasets applies best practices commonly used in name matching algorithms:

  • Remove special characters.

  • Replace language specific characters.

  • Abbreviate certain names to reduce their importance in the matching.

  • Spell out numbers to increase their importance.

Handling grouped data

This function ignores but preserves existing groups.

See Also

Other main functions: prioritize()

Examples

## Not run: 
library(r2dii.data)
library(tibble)

# Small data for examples
loanbook <- head(loanbook_demo, 50)
abcd <- head(abcd_demo, 50)

match_name(loanbook, abcd)

match_name(loanbook, abcd, min_score = 0.9)

# Use your own `sector_classifications`
your_classifications <- tibble(
  sector = "power",
  borderline = FALSE,
  code = "D35.11",
  code_system = "XYZ"
)

# match on LEI
loanbook <- tibble(
  sector_classification_system = "XYZ",
  sector_classification_direct_loantaker = "D35.11",
  id_ultimate_parent = "UP15",
  name_ultimate_parent = "Won't fuzzy match",
  id_direct_loantaker = "C294",
  name_direct_loantaker = "Won't fuzzy match",
  lei_direct_loantaker = "LEI123"
)

abcd <- tibble(
  name_company = "alpine knits india pvt. limited",
  sector = "power",
  lei = "LEI123"
)

match_name(loanbook, abcd, join_by = c(lei_direct_loantaker = "lei"))

restore <- options(r2dii.match.sector_classifications = your_classifications)

loanbook <- tibble(
  sector_classification_system = "XYZ",
  sector_classification_direct_loantaker = "D35.11",
  id_ultimate_parent = "UP15",
  name_ultimate_parent = "Alpine Knits India Pvt. Limited",
  id_direct_loantaker = "C294",
  name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd"
)

abcd <- tibble(
  name_company = "alpine knits india pvt. limited",
  sector = "power"
)

match_name(loanbook, abcd)

# Cleanup
options(restore)

## End(Not run)

r2dii.match documentation built on June 22, 2024, 9:38 a.m.