knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(fedmatch)
clean_strings
is the way to prepare strings for name matching, either within tier_match
(see the Using-tier-match
vignette). There are several useful options that allow for many different options.
Here's the example string we'll be using:
name_vec <- corp_data1[, Company]
name_vec
First, we can use the basic string cleaning defaults:
clean_strings(name_vec)
Without any additional arguments, clean_strings
does the following:
Then, we have a few different options we can use.
sp_char_words
is a data.frame with 2 columns: the first column is symbols to replace, and the second is their replacement. fedmatch
as a built-in set of symbols:
print(sp_char_words)
But, you can use any data.frame you'd like, to make whatever replacements you'd like:
new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple")) clean_strings(name_vec, sp_char_words = new_sp_char)
common_words
is similar, but it respects word boundaries (so you don't replace every usage of 'Corp' with 'Corporation', for example.) fedmatch
has a built-in set of 54 words and their replacements:
print(corporate_words[1:5])
But, you can use whatever words you'd like:
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"), replacement = c("bananas", "oranges")))
(bananas motors sounds like a lovely place to work). Note that the 'almart' in 'walmart' didn't get replaced, because common_words respects word boundaries.,
You can also use a related function, word_frequency
, to look for the most common strings in your data:
word_frequency(sample(c("hi", "Hello", "bye "), 1e4, replace = TRUE))
remove_words and remove_char are booleans that let you simply remove the words in 'common_words' or specify a set of characters to remove rather than replacing them.
clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c")) clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"), replacement = c("bananas", "oranges")), remove_words = TRUE)
stem
is a boolean that lets you stem words, using SnowballC::wordStem
. 'stemming' words means removing common suffixes:
clean_strings(c( "call", "calling", "called"), stem = TRUE)
See the documentation in SnowballC::wordStem
for details.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.