predict_race: Race prediction function.

View source: R/predict_race.R

predict_raceR Documentation

Race prediction function.

Description

predict_race makes probabilistic estimates of individual-level race/ethnicity.

Usage

predict_race(
  voter.file,
  census.surname = TRUE,
  surname.only = FALSE,
  census.geo = c("tract", "block", "block_group", "county", "place", "zcta"),
  census.key = Sys.getenv("CENSUS_API_KEY"),
  census.data = NULL,
  age = FALSE,
  sex = FALSE,
  year = "2020",
  party = NULL,
  retry = 3,
  impute.missing = TRUE,
  skip_bad_geos = FALSE,
  use.counties = FALSE,
  model = "BISG",
  race.init = NULL,
  name.dictionaries = NULL,
  names.to.use = "surname",
  control = NULL
)

Arguments

voter.file

An object of class data.frame. Must contain a row for each individual being predicted, as well as a field named surname containing each individual's surname. If using geolocation in predictions, voter.file must contain a field named state, which contains the two-character abbreviation for each individual's state of residence (e.g., "nj" for New Jersey). If using Census geographic data in race/ethnicity predictions, voter.file must also contain at least one of the following fields: county, tract, block_group, block, and/or place. These fields should contain character strings matching U.S. Census categories. County is three characters (e.g., "031" not "31"), tract is six characters, block group is usually a single character and block is four characters. Place is five characters. See below for other optional fields.

census.surname

A TRUE/FALSE object. If TRUE, function will call merge_surnames to merge in Pr(Race | Surname) from U.S. Census Surname List (2000, 2010, or 2020) and Spanish Surname List. If FALSE, user must provide a name.dictionary (see below). Default is TRUE.

surname.only

A TRUE/FALSE object. If TRUE, race predictions will only use surname data and calculate Pr(Race | Surname). Default is FALSE.

census.geo

An optional character vector specifying what level of geography to use to merge in U.S. Census geographic data. Currently "county", "tract", "block_group", "block", and "place" are supported. Note: sufficient information must be in user-defined voter.file object. If census.geo = "county", then voter.file must have column named county. If census.geo = "tract", then voter.file must have columns named county and tract. And if census.geo = "block", then voter.file must have columns named county, tract, and block. If census.geo = "place", then voter.file must have column named place. If census.geo = "zcta", then voter.file must have column named zcta. Specifying census.geo will call census_helper function to merge Census geographic data at specified level of geography.

census.key

A character object specifying user's Census API key. Required if census.geo is specified, because a valid Census API key is required to download Census geographic data.

If NULL, the default, attempts to find a census key stored in an environment variable named CENSUS_API_KEY.

census.data

A list indexed by two-letter state abbreviations, which contains pre-saved Census geographic data. Can be generated using get_census_data function.

age

An optional TRUE/FALSE object specifying whether to condition race predictions on age (in addition to surname and geolocation). Default is FALSE. Must be same as age in census.data object. May only be set to TRUE if census.geo option is specified. If TRUE, voter.file should include a numerical variable age.

sex

optional TRUE/FALSE object specifying whether to condition race predictions on sex (in addition to surname and geolocation). Default is FALSE. Must be same as sex in census.data object. May only be set to TRUE if census.geo option is specified. If TRUE, voter.file should include a numerical variable sex, where sex is coded as 0 for males and 1 for females.

year

An optional character vector specifying the year of U.S. Census geographic data to be downloaded. Use "2010", or "2020". Default is "2020".

party

An optional character object specifying party registration field in voter.file, e.g., party = "PartyReg". If specified, race/ethnicity predictions will be conditioned on individual's party registration (in addition to geolocation). Whatever the name of the party registration field in voter.file, it should be coded as 1 for Democrat, 2 for Republican, and 0 for Other.

retry

The number of retries at the census website if network interruption occurs.

impute.missing

Logical, defaults to TRUE. Should missing be imputed?

skip_bad_geos

Logical. Option to have the function skip any geolocations that are not present in the census data, returning a partial data set. Default is set to FALSE, in which case it will break and provide error message with a list of offending geolocations.

use.counties

A logical, defaulting to FALSE. Should census data be filtered by counties available in census.data?

model

Character string, either "BISG" (default) or "fBISG" (for error-correction, fully-Bayesian model).

race.init

Vector of initial race for each observation in voter.file. Must be an integer vector, with 1=white, 2=black, 3=hispanic, 4=asian, and 5=other. Defaults to values obtained using model="BISG_surname".

name.dictionaries

Optional named list of data.frame's containing counts of names by race. Any of the following named elements are allowed: "surname", "first", "middle". When present, the objects must follow the same structure as last_c, first_c, mid_c, respectively.

names.to.use

One of 'surname', 'surname, first', or 'surname, first, middle'. Defaults to 'surname'.

control

List of control arguments only used when model="fBISG", including

iter

Number of MCMC iterations. Defaults to 1000.

burnin

Number of iterations discarded as burnin. Defaults to half of iter.

verbose

Print progress information. Defaults to TRUE.

me.correct

Boolean. Should the model correct measurement error for races|geo? Defaults to TRUE.

seed

RNG seed. If NULL, a seed is generated and returned as an attribute for reproducibility.

Details

This function implements the Bayesian race prediction methods outlined in Imai and Khanna (2015). The function produces probabilistic estimates of individual-level race/ethnicity, based on surname, geolocation, and party.

Value

Output will be an object of class data.frame. It will consist of the original user-input voter.file with additional columns with predicted probabilities for each of the five major racial categories: pred.whi for White, pred.bla for Black, pred.his for Hispanic/Latino, pred.asi for Asian/Pacific Islander, and pred.oth for Other/Mixed.

Examples


#' data(voters)
try(predict_race(voter.file = voters, surname.only = TRUE))
## Not run: 
try(predict_race(voter.file = voters, census.geo = "tract"))

## End(Not run)
## Not run: 
try(predict_race(
  voter.file = voters, census.geo = "place", year = "2020"))

## End(Not run)
## Not run: 
CensusObj <- try(get_census_data(state = c("NY", "DC", "NJ")))
try(predict_race(
  voter.file = voters, census.geo = "tract", census.data = CensusObj, party = "PID")
  )

## End(Not run)
## Not run: 
CensusObj2 <- try(get_census_data(state = c("NY", "DC", "NJ"), age = T, sex = T))
try(predict_race(
  voter.file = voters, census.geo = "tract", census.data = CensusObj2, age = T, sex = T))

## End(Not run)
## Not run: 
CensusObj3 <- try(get_census_data(state = c("NY", "DC", "NJ"), census.geo = "place"))
try(predict_race(voter.file = voters, census.geo = "place", census.data = CensusObj3))

## End(Not run)


wru documentation built on May 29, 2024, 9:46 a.m.