knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(recordlinkR)

An Example Dataset

Datasets with name fields can come in a variety of formats. some datasets may provide full names whereas others may have separate columns for first and last. Full names can be listed in order of or . Additionally, there may be inconsistencies in how names are capitalized and whether punctuation is included.

Below is an example of a dataset with several name columns following different formats.

# Some sample character and numeric data 
names <- c('Doe, Jon', 'GRAY, Amy**', 'JACKSON, MICHAEL!', 'smith, jane')
mothers.names <- c('Dorian\' Do', 'Brittany John-son Smith', 'Mary JACKSON', 'Sam')
nicknames <- c('JON', 'AMY', 'MIKE', 'JAN')
age <- c(15, 14, 12, 17)

# Create dataframe
data <- data.frame(names, mothers.names, nicknames, age, stringsAsFactors = FALSE)

knitr::kable(data, caption = 'Sample Dataframe')

The function to preprocess a dataframe is cleanNames.

To apply changes to the variables names, mothers.names, and nicknames, add as a vector to the cols parameter of cleanNames. By default, lower and strip.punctuation are TRUE.

data.cleaned <- cleanNames(data,
                           cols = c('names', 'mothers.names', 'nicknames'),
                           lower = T,
                           strip.punctuation = T)
knitr::kable(data.cleaned, caption = 'Cleaned Sample Dataframe')

To train an encoder model on names it is helpful to train models on first or last names alone, rather than in conjunction. String columns can be split using cleanNames by specifying a character split on and names of the newly created variables.

split.pattern <- list('names' = ',', 'mothers.names' = ' ')
new.col.names <- list('names' = c('last', 'first'), 'mothers.names' = c('mothers.first', 'mothers.last'))

data.cleaned.and.split <- cleanNames(data,
                                     cols = c('names', 'mothers.names', 'nicknames'),
                                     split.pattern = split.pattern, 
                                     split.col.names = new.col.names, 
                                     lower = T,
                                     strip.punctuation = T)

knitr::kable(data.cleaned.and.split, caption = 'Cleaned Sample Dataframe with Split Columns')


kailin-lu/recordlinkR documentation built on May 4, 2019, 7:37 a.m.