knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(fedmatch)
clean_strings
is the way to prepare strings for name matching, either within tier_match
(see the Using-tier-match
vignette). There are several useful options that allow for many different options.
Here's the example string we'll be using:
name_vec <- corp_data1[, Company]
name_vec
First, we can use the basic string cleaning defaults:
clean_strings(name_vec)
Without any additional arguments, clean_strings
does the following:
Then, we have a few different options we can use.
sp_char_words
is a data.frame with 2 columns: the first column is symbols to replace, and the second is their replacement. fedmatch
as a built-in set of symbols:
print(sp_char_words)
But, you can use any data.frame you'd like, to make whatever replacements you'd like:
new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple")) clean_strings(name_vec, sp_char_words = new_sp_char)
common_words
is similar, but it respects word boundaries (so you don't replace every usage of 'Corp' with 'Corporation', for example.) fedmatch
has a built-in set of 54 words and their replacements:
print(corporate_words[1:5])
But, you can use whatever words you'd like:
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"), replacement = c("bananas", "oranges")))
(bananas motors sounds like a lovely place to work). Note that the 'almart' in 'walmart' didn't get replaced, because common_words respects word boundaries.,
You can also use a related function, word_frequency
, to look for the most common strings in your data:
word_frequency(sample(c("hi", "Hello", "bye "), 1e4, replace = TRUE))
remove_words and remove_char are booleans that let you simply remove the words in 'common_words' or specify a set of characters to remove rather than replacing them.
clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c")) clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"), replacement = c("bananas", "oranges")), remove_words = TRUE)
stem
is a boolean that lets you stem words, using SnowballC::wordStem
. 'stemming' words means removing common suffixes:
clean_strings(c( "call", "calling", "called"), stem = TRUE)
See the documentation in SnowballC::wordStem
for details.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.