clean_strings: String cleaning for easier matching
In seunglee98/fedmatch: Fast, Flexible, and User-Friendly Record Linkage Methods

clean_strings

R Documentation

String cleaning for easier matching

Description

clean_strings takes a string vector and cleans it according to user-given options.

Usage

clean_strings(
  string,
  sp_char_words = fedmatch::sp_char_words,
  common_words = NULL,
  remove_char = NULL,
  remove_words = FALSE,
  stem = FALSE
)

Arguments

`string`	character or character vector of strings
`sp_char_words`	character vector. Data.frame where first column is special characters and second column is full words. The default is
`common_words`	data.frame. Data.frame where first column is abbreviations and second column is full words.
`remove_char`	character vector. string of specific characters (for example, "letters") to be removed
`remove_words`	logical. If TRUE, removes all abbreviations and replacement words in common_words
`stem`	logical. If TRUE, words are stemmed

Details

This function takes a variety of options, each of which changes the behavior. Without the default settings, clean_strings will do the following: make the string lowercase; replace special characters &, $, \ names ("and", "dollar", "percent", "at"); convert tabs to spaces and removes extra spaces. This default cleaning puts the strings in a standard format to allow for easier matching.

The other options allow for the removal or replacement of other words or characters.