merge_surnames: Surname probability merging function.

View source: R/merge_surnames.R

merge_surnamesR Documentation

Surname probability merging function.

Description

merge_surnames merges surnames in user-input dataset with corresponding race/ethnicity probabilities from U.S. Census Surname List and Spanish Surname List.

Usage

merge_surnames(
  voter.file,
  surname.year = 2020,
  name.data,
  clean.surname = TRUE,
  impute.missing = TRUE
)

Arguments

voter.file

An object of class data.frame. Must contain a field named 'surname' containing list of surnames to be merged with Census lists.

surname.year

An object of class numeric indicating which year Census Surname List is from. Accepted values are 2010 and 2000. Default is 2020.

name.data

An object of class data.frame. Must contain a leading column of surnames, and 5 subsequent columns, with Pr(Race | Surname) for each of the five major racial categories.

clean.surname

A TRUE/FALSE object. If TRUE, any surnames in voter.file that cannot initially be matched to surname lists will be cleaned, according to U.S. Census specifications, in order to increase the chance of finding a match. Default is TRUE.

impute.missing

A TRUE/FALSE object. If TRUE, race/ethnicity probabilities will be imputed for unmatched names using race/ethnicity distribution for all other names (i.e., not on Census List). Default is TRUE.

Details

This function allows users to match surnames in their dataset with the U.S. Census Surname List (from 2000 or 2010) and Spanish Surname List to obtain Pr(Race | Surname) for each of the five major racial groups.

By default, the function matches surnames to the Census list as follows:

  1. Search raw surnames in Census surname list;

  2. Remove any punctuation and search again;

  3. Remove any spaces and search again;

  4. Remove suffixes (e.g., Jr) and search again;

  5. Split double-barreled surnames into two parts and search first part of name;

  6. Split double-barreled surnames into two parts and search second part of name;

  7. For any remaining names, impute probabilities using distribution for all names not appearing on Census list.

Each step only applies to surnames not matched in a previous ste. Steps 2 through 7 are not applied if clean.surname is FALSE.

Note: Any name appearing only on the Spanish Surname List is assigned a probability of 1 for Hispanics/Latinos and 0 for all other racial groups.

Value

Output will be an object of class data.frame. It will consist of the original user-input data with additional columns that specify the part of the name matched with Census data (surname.match), and the probabilities Pr(Race | Surname) for each racial group (p_whi for White, p_bla for Black, p_his for Hispanic/Latino, p_asi for Asian and Pacific Islander, and p_oth for Other/Mixed). #'

Examples

data(voters)
## Not run: try(merge_surnames(voters))


kosukeimai/wru documentation built on April 8, 2024, 6:03 p.m.