Using clean_strings

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(fedmatch)

Using clean_strings

clean_strings is the way to prepare strings for name matching, either within tier_match (see the Using-tier-match vignette). There are several useful options that allow for many different options.

Here's the example string we'll be using:

name_vec <- corp_data1[, Company]
name_vec

First, we can use the basic string cleaning defaults:

clean_strings(name_vec)

Without any additional arguments, clean_strings does the following:

Then, we have a few different options we can use.

sp_char_words

sp_char_words is a data.frame with 2 columns: the first column is symbols to replace, and the second is their replacement. fedmatch as a built-in set of symbols:

print(sp_char_words)

But, you can use any data.frame you'd like, to make whatever replacements you'd like:

new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple"))
clean_strings(name_vec, sp_char_words = new_sp_char)

common_words

common_words is similar, but it respects word boundaries (so you don't replace every usage of 'Corp' with 'Corporation', for example.) fedmatch has a built-in set of 54 words and their replacements:

print(corporate_words[1:5])

But, you can use whatever words you'd like:

clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"),
                                                              replacement = c("bananas", "oranges")))

(bananas motors sounds like a lovely place to work). Note that the 'almart' in 'walmart' didn't get replaced, because common_words respects word boundaries.,

You can also use a related function, word_frequency, to look for the most common strings in your data:

word_frequency(sample(c("hi", "Hello", "bye    "), 1e4, replace = TRUE))

Remove characters and words

remove_words and remove_char are booleans that let you simply remove the words in 'common_words' or specify a set of characters to remove rather than replacing them.

clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c"))
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"),
                                                              replacement = c("bananas", "oranges")),
              remove_words = TRUE)

stem

stem is a boolean that lets you stem words, using SnowballC::wordStem. 'stemming' words means removing common suffixes:

clean_strings(c( "call", "calling", "called"), stem = TRUE)

See the documentation in SnowballC::wordStem for details.



Try the fedmatch package in your browser

Any scripts or data that you put into this service are public.

fedmatch documentation built on Nov. 23, 2021, 1:07 a.m.