This tutorial demonstrates how to perform Bayesian Improved Surname Geocoding when the race/ethncity of individuals are unknown within a dataset.
Bayesian Improved Surname Geocoding (BISG) is a method that applies Bayes' Rule to predict the race/ethnicity of an individual using the individual's surname and geocoded location [Elliott et. al 2008, Elliot et al. 2009, Imai and Khanna 2016].
Specifically, BISG first calculates the prior probability of individual i being of a ceratin racial group r given their surname s or $$Pr(R_i=r|S_i=s)$$. The prior probability created from the surname is then updated with the probability of the individual i living in a geographic location g belonging to a racial group r, or $$Pr(G_i=g|R_i=r)$$). The following equation describes how BISG calculates race/ethnicity of individuals using Bayes Theorem, given the surname and geographic location, and specifically when race/ethncicty is unknown :
$$Pr(R_i=r|S_i=s, G_i=g)=\frac{Pr(G_i= g|R_i =r)Pr(R_i =r |S_i= s)}{\sum_{i=1}^n Pr(G_i= g|R_i =r)Pr(R_i =r |S_i= s)}$$
In R, the wru
package titled, WRU: Who Are You performs BISG. This vignette will walk you through how to prepare your geocoded voter file for performing BISG by stepping you through the process of cleaning your voter file, prepping voter data for running the BISG, and finally, performing BISG to obtain racial/ethnic probailities of individuals in a voter file.
We will perform BISG using the previous Gwinnett and Fulton county voter registration data called ga_geo.csv
that was geocoded in the eiCompare: Geocoding vignette.
The first step in performing BISG is to geocode your voter file addresses. For information on geocoding, visit the Geocoding Vignette.
Let's begin by loading your geocoded voter data into R/RStudio.
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Load the R packages needed to perform BISG. If you have not already downloaded the following packages, please install these packages.
# Load libraries suppressPackageStartupMessages({ library(eiCompare) library(stringr) library(sf) library(wru) library(tidyr) library(ggplot2) library(dplyr) })
Load in census data, the shape file and geocoded voter registration data with latitude and longitude coordinates.
# Load Georgia census data data(georgia_census)
We will use the data(gwin_fulton_shape)
to load the shape file. The shape file includes FIPS code information for Gwinnett and Fulton counties and the associated multipolygon shape geometries indicated by the geometry
column.
# Shape file for Gwinnett and Fulton counties data(gwin_fulton_shape)
tigris
package (optional)The shapefile can also be loaded using the tigris package. The tigris package uses the US Census Bureau's Geocoding API which is publicly available so no API key is needed. With the tigris package, you can load your census data according to a geographic level (i.e. counties, cities, tracts, blocks, etc.) There is additional code below that you can use if wanting to load your shape file using tigris. Remember to remove the # in order to use the code.
# install.packages("tigris") # library(tigris) # gwin_fulton_shape <- blocks(state = "GA", county = c("Gwinnett", "Fulton"))
Load geocoded voter file.
# Load geocoded voter registration file data(ga_geo)
Obtain the first six rows of the voter file to check that the file has downloaded properly.
# Check the first six rows of the voter file head(ga_geo, 6)
View the column names of the voter file. Some of these columns will be used along the journey to performing BISG.
# Find out names of columns in voter file names(ga_geo)
Check the dimensions (the number of rows and columns) of the voter file.
# Get the dimensions of the voter file dim(ga_geo)
There are 12 voters (or observations) and 25 columns in the voter file.
Convert geometry column name into two columns for latitude and longitude points.
ga_geo <- ga_geo %>% tidyr::extract(geometry, c("lon", "lat"), "\\((.*), (.*)\\)", convert = TRUE)
The next step involves removing duplicate voter IDs from the voter file, using the dedupe_voter_file
function.
# Remove duplicate voter IDs (the unique identifier for each voter) voter_file_dedupe <- dedupe_voter_file(voter_file = ga_geo, voter_id = "registration_number")
There are no duplicate voter IDs in the dataset.
# Convert the voter_shaped_merged file into a data frame for performing BISG. voter_file_complete <- as.data.frame(voter_file_dedupe) class(voter_file_complete)
Note that wru
requires an internet connection to pull in supplemental data. If the connection cannot be made, wru_predict_race_wrapper
will return NULL
.
georgia_census$GA$year <- 2010 # Perform BISG bisg_df <- eiCompare::wru_predict_race_wrapper( voter_file = voter_file_complete, census_data = georgia_census, voter_id = "registration_number", surname = "last_name", state = "GA", county = "COUNTYFP10", tract = "TRACTCE10", block = "BLOCKCE10", census_geo = "block", use_surname = TRUE, surname_only = FALSE, surname_year = 2010, use_age = FALSE, use_sex = FALSE, return_surname_flag = TRUE, return_geocode_flag = TRUE, verbose = TRUE )
# Check BISG dataframe head(bisg_df)
summary(bisg_df)
# Obtain aggregate values for the BISG results by county bisg_agg <- precinct_agg_combine( voter_file = bisg_df, group_col = "COUNTYFP10", race_cols = c("pred.whi", "pred.bla", "pred.his", "pred.asi", "pred.oth"), true_race_col = "race", include_total = FALSE ) # Table with BISG race predictions by county head(bisg_agg)
bisg_bar <- bisg_agg %>% tidyr::gather("Type", "Value", -COUNTYFP10) %>% ggplot(aes(COUNTYFP10, Value, fill = Type)) + geom_bar(position = "dodge", stat = "identity") + labs(title = "BISG Predictions for Fulton and Gwinnett Counties", y = "Proportion", x = "Counties") + theme_bw() bisg_bar + scale_color_discrete(name = "Race/Ethnicity Proportions")
Finally, we will map the BISG data onto choropleth maps.
bisg_dfsub <- bisg_df %>% dplyr::select(BLOCKCE10, pred.whi, pred.bla, pred.his, pred.asi, pred.oth) bisg_dfsub
# Join bisg and shape file bisg_sf <- dplyr::left_join(gwin_fulton_shape, bisg_dfsub, by = "BLOCKCE10")
# Plot choropleth map of race/ethnicity predictions for Fulton and Gwinnett counties plot(bisg_sf["pred.bla"], main = "Proportion of Black Voters identified by BISG")
plot(bisg_sf["pred.whi"], main = "Proportion of White Voters identified by BISG")
plot(bisg_sf["pred.his"], main = "Proportion of Hispanic Voters identified by BISG")
plot(bisg_sf["pred.asi"], main = "Proportion of Asian Voters identified by BISG")
plot(bisg_sf["pred.oth"], main = "Proportion of 'Other' Voters identified by BISG")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.