merge_names: Surname probability merging function.

View source: R/merge_names.R

merge_namesR Documentation

Surname probability merging function.

Description

merge_names merges names in a user-input dataset with corresponding race/ethnicity probabilities derived from both the U.S. Census Surname List and Spanish Surname List and voter files from states in the Southern U.S.

Usage

merge_names(
  voter.file,
  namesToUse,
  census.surname,
  table.surnames = NULL,
  table.first = NULL,
  table.middle = NULL,
  clean.names = TRUE,
  impute.missing = FALSE,
  model = "BISG"
)

Arguments

voter.file

An object of class data.frame. Must contain a row for each individual being predicted, as well as a field named last containing each individual's surname. If first name is also being used for prediction, the file must also contain a field named first. If middle name is also being used for prediction, the field must also contain a field named middle.

namesToUse

A character vector identifying which names to use for the prediction. The default value is "last", indicating that only the last name will be used. Other options are "last, first", indicating that both last and first names will be used, and "last, first, middle", indicating that last, first, and middle names will all be used.

census.surname

A TRUE/FALSE object. If TRUE, function will call merge_surnames to merge in Pr(Race | Surname) from U.S. Census Surname List (2000, 2010, or 2020) and Spanish Surname List. If FALSE, user must provide a name.dictionary (see below). Default is TRUE.

table.surnames

An object of class data.frame provided by the users as an alternative surname dictionary. It will consist of a list of U.S. surnames, along with the associated probabilities P(name | ethnicity) for ethnicities: white, Black, Hispanic, Asian, and other. Default is NULL. (last_name for U.S. surnames, p_whi_last for White, p_bla_last for Black, p_his_last for Hispanic, p_asi_last for Asian, p_oth_last for other).

table.first

See table.surnames.

table.middle

See table.surnames.

clean.names

A TRUE/FALSE object. If TRUE, any surnames in voter.file that cannot initially be matched to the database will be cleaned, according to U.S. Census specifications, in order to increase the chance of finding a match. Default is TRUE.

impute.missing

See predict_race.

model

See predict_race.

Details

This function allows users to match names in their dataset with database entries estimating P(name | ethnicity) for each of the five major racial groups for each name. The database probabilities are derived from both the U.S. Census Surname List and Spanish Surname List and voter files from states in the Southern U.S.

By default, the function matches names as follows:

  1. Search raw surnames in the database;

  2. Remove any punctuation and search again;

  3. Remove any spaces and search again;

  4. Remove suffixes (e.g., "Jr") and search again (last names only)

  5. Split double-barreled names into two parts and search first part of name;

  6. Split double-barreled names into two parts and search second part of name;

Each step only applies to names not matched in a previous step. Steps 2 through 6 are not applied if clean.surname is FALSE.

Note: Any name appearing only on the Spanish Surname List is assigned a probability of 1 for Hispanics/Latinos and 0 for all other racial groups.

Value

Output will be an object of class data.frame. It will consist of the original user-input data with additional columns that specify the part of the name matched with Census data (surname.match), and the probabilities Pr(Race | Surname) for each racial group (p_whi for White, p_bla for Black, p_his for Hispanic/Latino, p_asi for Asian and Pacific Islander, and p_oth for Other/Mixed).

Examples

data(voters)
## Not run: try(merge_names(voters, namesToUse = "surname", census.surname = TRUE))

wru documentation built on May 29, 2024, 9:46 a.m.