[^updated]: Last updated: r format(Sys.time(), '%d %B, %Y')

\captionsetup[table]{labelformat=empty}

| Page | Variable | Label | |-----------------------------|:---------------------------------------------|:---------------------------------------------| | \hyperlink{page.3}{3} | \hyperlink{page.3}{HISTID} |Historical unique identifier | | \hyperlink{page.4}{4} | \hyperlink{page.4}{birth_gnis_code} |Place of birth, GNIS code | | \hyperlink{page.5}{5} | \hyperlink{page.5}{birth_city} |Place of birth, city | | \hyperlink{page.6}{6} | \hyperlink{page.6}{birth_county} |Place of birth, county | | \hyperlink{page.7}{7} | \hyperlink{page.7}{birth_fips} |Place of birth, FIPS county code | | \hyperlink{page.8}{8} | \hyperlink{page.8}{birth_region} |Place of birth, census region | | \hyperlink{page.9}{9} | \hyperlink{page.9}{death_zip} |Place of death, 5-digit ZIP code | | \hyperlink{page.10}{10} | \hyperlink{page.10}{death_city} |Place of death, city | | \hyperlink{page.11}{11} | \hyperlink{page.11}{death_county} |Place of death, county | | \hyperlink{page.12}{12} | \hyperlink{page.12}{death_fips} |Place of death, FIPS county code | | \hyperlink{page.13}{13} | \hyperlink{page.13}{death_state} |Place of death, state | | \hyperlink{page.14}{14} | \hyperlink{page.14}{death_region} |Place of death, census region | | \hyperlink{page.15}{15} | \hyperlink{page.15}{death_country} |Place of death, country | | \hyperlink{page.16}{16} | \hyperlink{page.16}{death_ruc1993} |place of death, county Rural-Urban Continuum Code |

\vspace{10pt}

\begin{center} \underline{\textbf{Summary and Methodology}} \end{center} \vspace{10pt}

The CenSoc-Numident Supplementary Geography File (N = 6,971,468) provides a set of supplementary geographic variables reporting place of birth and/or death for individuals in the CenSoc-Numident mortality file. This file can be linked onto the CenSoc-Numident at the individual-level using HISTID. Individual records may contain only place of birth variables, only place of death variables, or both place of birth and place of death variables.

Place of death variables: To construct the place of death variables, we use ZIP code of residence as reported in Social Security Numident death records. We map these ZIP codes onto city, county, state, census region, and country information using a database from the United States Postal Service (USPS) \textcolor{blue}{(link)}. For ZIP codes that have been decommissioned, we use a secondary database from UnitedStatesZipCodes.org. Approximately 6.3 million records in the CenSoc-Numident dataset have a ZIP code that can be mapped to state, county, and city information, though some fields may be missing for certain records. Place of death variables are primarily available for individuals who died in the 50 U.S. states and the District of Columbia, but some are also present for ZIP codes in Puerto Rico and military bases abroad.

Place of birth variables: To construct the place of birth variables, we use 12-character city/county of birth string from Social Security Numident application and claims records. These strings are uncleaned and contain misspellings and other inconsistencies. We mapped these uncleaned strings onto Geographic Names Information System (GNIS) codes using the crosswalk developed for the paper:

| Black, Dan A., Seth G. Sanders, Evan J. Taylor, and Lowell J. Taylor. 2015. “The Impact of the Great | Migration on Mortality of African Americans: Evidence from the Deep South.”American Economic Review. | 105(2):477–503. doi: 10.1257/aer.20120642.

To construct other place of birth variable, we mapped the GNIS codes onto city, county, state, and census region information using a database from the U.S. Board on Geographic Names \textcolor{blue}{(link)}. Please cite Black et al. (2015) if you are using any of the birthplace geography variables in the file. Birthplace variables are available for nearly 6.5 million records in this dataset. These variables are only available for locations within the 50 United States and the District of Columbia. Note that state/country of birth is already included in the CenSoc-Numident (see the variables bpl and bpl_string) and thus not present in this dataset.

\newpage

library(dplyr)
library(data.table)
library(tidyverse)
library(knitr)
library(forcats)
library(cowplot)
library(readxl)
library(kableExtra)

Numident_Geography_File <- fread("/data/censoc/workspace/geography_variables/numident_geography_file_v2.1_conservative.csv",
                              colClasses = c("death_fips" = "character", "death_zip" = "character",
                                             "birth_fips" = "character"))


numident <- fread("/data/censoc/censoc_data_releases/censoc_numident/censoc_numident_v2.1/censoc_numident_v2.1.csv")
numident <- numident[link_abe_exact_conservative == 1]

numident_merged <- merge(numident, Numident_Geography_File, by = "HISTID", all.x = TRUE)
rm(numident)

#replace empty strings with NA to make match rates easier to compute
numident_merged[numident_merged == ""] <- NA # SLOW

numident_merged$death_place_match <-ifelse(!is.na(numident_merged$death_county),1,0)

nrow(numident_merged)

\newpage

\huge HISTID \normalsize \vspace{12pt}

Label: Historical unique identifier

Description: HISTID is a unique individual-level identifier from IPUMS-USA 1940 Census data. This variable uniquely identifies all records in the dataset and can be used to merge onto the CenSoc-Numident mortality dataset.

\newpage

\huge birth_gnis_code

\normalsize \vspace{12pt}

Label: Place of birth, GNIS code

Description: birth_gnis_code is a numeric variable that reports a person's Geographic Names Information System (GNIS) code for their place of birth. Each GNIS code maps onto physical locations located in the U.S., including longitude and latitude coordinates, physical feature names, counties, and states. For more information on GNIS codes, please see \textcolor{blue}{https://www.usgs.gov/faqs/what-geographic-names-information-system-gnis}.

Missing values occur if birth city is absent from Numident records, or if the original birth city string was not matched to a GNIS code.

\vspace{40 pt}

unclean_birthplace_stings_path <- "/data/censoc/workspace/geography_variables/bunmd_city_string.csv" 
birth_city_raw_strings <- fread(unclean_birthplace_stings_path, select = c("ssn", "bpl_city"))
ssn_crosswalk_path <- "/data/censoc/crosswalks/numident_ssn_histid_v2_1_crosswalk.csv"
ssn_to_histid <- fread(ssn_crosswalk_path)
histid_with_raw_city_strings <- left_join(ssn_to_histid, birth_city_raw_strings, by = "ssn") %>% 
  select(-c(ssn)) %>% rename(HISTID = histid)
numident_merged <- left_join(numident_merged, histid_with_raw_city_strings, by = "HISTID")

valid_birth_percent <- 100 - (mean(is.na(numident_merged$birth_gnis_code)) * 100)
birth_percent <- 100 - (mean(is.na(numident_merged$bpl_city)) * 100)
death_percent <- (mean(!is.na(numident_merged$zip_residence)) * 100)
valid_death_percent <- 100 - (mean(is.na(numident_merged$death_county)) * 100)

birth_props <- c(birth_percent,valid_birth_percent)
death_props <- c(death_percent,valid_death_percent)

birth_props_plot <- barplot(birth_props, names.arg = c("Have birth city information", "Birth city matched to GNIS code"),
        xlab = "Birth City Information Availibility", ylab = "Percentage of CenSoc-Numident Records",
        main = "Birth City Match Percentage",
        col = c("lightblue", "grey"),
        ylim = c(0, 100)) # Set the y-axis range to 0-100

# Calculate the bar centers
bar_centers <- birth_props_plot

#value_labels <- c("98%", "91%")
value_labels <- paste0(round(birth_props,1), "%")

text(x = bar_centers, y = birth_props, labels = value_labels, col = "black", cex = 1, pos = 1)

\newpage

\huge birth_city \normalsize \vspace{12pt}

Label: Place of birth, city

Description: birth_city is a character variable that reports 12-character city/county of birth string from the Numident Application records. This variable is generally uncleaned and unprocessed, and contains spelling errors and naming inconsistencies.

Missing values occur when birth city is not available in Numident records.

\newpage

\huge birth_county \normalsize \vspace{12pt}

Label: Place of birth, county

Description: birth_county is a character variable reporting the name of a person's county of birth, as determined by GNIS place of birth.

Missing values occur if birth city is absent from Numident records, or if the original birth city string was not matched to a GNIS code.

Note: In 2022, the state of Connecticut began replacing its eight counties with nine county-equivalent planning regions. This dataset contains the eight "legacy" county regions of Connecticut.

\newpage

\huge birth_fips \normalsize \vspace{12pt}

Label: Place of birth, county FIPS code

Description: birth_fips is a character variable reporting a person's county FIPS code of birth. FIPS codes are 5 digits, of which the first two are the FIPS code of the state to which the county belongs. For a list of modern FIPS codes, see: \textcolor{blue}{https://en.wikipedia.org/wiki/List_of_United_States_FIPS_codes_by_county}.

Missing values occur if birth city is absent from Numident records, or if the original birth city string was not matched to a GNIS code or county.

Notes: FIPS codes appear numeric but can contain leading zeros. County-level FIPS codes in this dataset are always 5 digits. In 2022, the state of Connecticut began replacing its eight counties with nine county-equivalent planning regions. This dataset contain FIPS codes for the the eight "legacy" county regions of Connecticut. FIPS codes occasionally change due to county boundary changes, name changes, etc., and thus codes contained in this dataset may not be completely consistent with external county-level data.

\newpage

\huge birth_region \normalsize \vspace{12pt}

Label: Place of birth, census region

Description: birth_region is a character variable reporting a person's census region of birth. For more information on census regions, please see \textcolor{blue}{https://www.census.gov/programs-surveys/economic-census/guidance-geographies/levels.html}.

Missing values occur if birth state/country is absent from Numident records or outside of the United States.

\vspace{40 pt}

birthregion_table <- table(Numident_Geography_File$birth_region, useNA = "ifany")

birthregion_table <- as.data.frame(birthregion_table)

birthregion_table$Freq_perc <- round((birthregion_table$Freq/sum(birthregion_table$Freq))*100, digits = 1)
birthregion_table$label <- c("NA", "Midwest", "Northeast", "South", "West")

br_tab <- kable(birthregion_table %>% select(Var1, label, Freq, Freq_perc), 
      caption = "Births by Census Region",
      col.names = c("birth_region", "label", "n","freq %"),
      align = c("l", "r","l"),
      booktabs = TRUE,
      format = "latex") %>% 
  kableExtra::kable_styling(latex_options = "HOLD_position")

br_tab

\newpage

\huge death_zip \normalsize \vspace{12pt}

Label: Place of death, ZIP code

Description: death_zip is a character variable that reports 5-digit ZIP code of last residence, as obtained from Numident death records.

Missing values may occur if ZIP code was entered incorrectly in Numident records, if ZIP code is absent from Numident records, or if residence at time of death was outside USPS identifiable ZIP codes (e.g., most locations abroad).

Notes: ZIP codes appear numeric, but can contain leading zeros. They should always be 5 digits. Non-identifiable codes such as "XX768" and "00000" may remain in this field. We note that ZIP codes are primarily used for USPS mail delivery purposes, not as geographic units. For a more detailed description of ZIP codes, please see: \textcolor{blue}{https://faq.usps.com/s/article/ZIP-Code-The-Basics}.

\vspace{75 pt}

# Your existing barplot code
death_props_plot <- barplot(death_props, names.arg = c("Have a death ZIP code", "ZIP code matched to county"),
        xlab = "Match Percentage", ylab = "Percentage of CenSoc-Numident Records",
        main = "Death ZIP Code Match Percentage",
        col = c("lightblue", "grey"),
        ylim = c(0, 100)) # Set the y-axis range to 0-100

# Calculate the bar centers
bar_centers <- death_props_plot

# Define the value labels
#value_labels <- c("90%", "89%")
value_labels <- paste0(round(death_props,1), '%')

text(x = bar_centers, y = birth_props, labels = value_labels, col = "black", cex = 1, pos = 1, offset = 1.5)

\newpage

\huge death_city \normalsize \vspace{12pt}

Label: Place of death, city

Description: death_city is a character variable reporting a person's city of death, as determined from the ZIP code of their residence at time of death. For ZIP codes that cross city lines, we report the city where the primary post office for that ZIP code is located.

Missing values may occur if ZIP code was entered incorrectly in Numident records, if ZIP code is absent from Numident records, or if place of death could not be matched to an identifiable USPS ZIP code.

\newpage

\huge death_county \normalsize \vspace{12pt}

Label: Place of death, county

Description: death_county is a character variable reporting the name of a person's county of death, as determined by the ZIP code of their residence at time of death.

Missing values may occur if ZIP code was entered incorrectly in Numident records, if ZIP code is absent from Numident records, or if ZIP codes could not be matched against USPS records.

Note: In 2022, the state of Connecticut began replacing its eight counties with nine county-equivalent planning regions. This dataset contains the eight "legacy" county regions for Connecticut.

\vspace{40 pt}

numident_men <- filter(numident_merged, sex == 1)
numident_women <- filter(numident_merged, sex == 2)

death_year_match_men <- prop.table(table(numident_men$byear, numident_men$death_place_match), margin = 1)

death_year_match_men <- as.data.frame.matrix(death_year_match_men)

death_year_match_men$byear <- rownames(death_year_match_men)

death_year_match_men <- reshape2::melt(death_year_match_men, id.vars = "byear", variable.name = "match", value.name = "proportion")

death_year_match_men$byear_numeric <- as.numeric(as.character(death_year_match_men$byear))

death_year_match_women <- prop.table(table(numident_women$byear, numident_women$death_place_match), margin = 1)

death_year_match_women <- as.data.frame.matrix(death_year_match_women)

death_year_match_women$byear <- rownames(death_year_match_women)

death_year_match_women <- reshape2::melt(death_year_match_women, id.vars = "byear", variable.name = "match", value.name = "proportion")

death_year_match_women$byear_numeric <- as.numeric(as.character(death_year_match_women$byear))

death_year_match_men$gender_category <- "Men"
death_year_match_women$gender_category <- "Women"

birth_year_match_combined_gender <- rbind(death_year_match_men, death_year_match_women)

# Create the combined plot
combined_plot <- birth_year_match_combined_gender %>%
  filter(byear >= 1890 & byear <= 1940) %>% 
  filter(match == 1) %>% 
  ggplot(aes(x = byear_numeric, y = proportion, color = gender_category)) +
  geom_point(size = 1, alpha = 0.5) + # Keep only the points
  labs(title = "Matched Death ZIP Code to County by Gender and Cohort", x = "Birth Year",
       y = "Proportion of CenSoc-Numident Records") +
  theme_cowplot() +
  scale_color_manual(values = c("blue", "red"), name = "Gender") +
  theme(plot.title=element_text(size=13))

# Print the combined plot
print(combined_plot)

\newpage

\huge death_county_fips \normalsize \vspace{12pt}

Label: Place of death, county FIPS code

Description: death_fips is a character variable reporting a person's county FIPS code of birth. FIPS codes are 5 digits, of which the first two are the FIPS code of the state to which the county belongs. For a list of modern FIPS codes, see: \textcolor{blue}{https://en.wikipedia.org/wiki/List_of_United_States_FIPS_codes_by_county}.

Missing values may occur if ZIP code was entered incorrectly in Numident records, if ZIP code is absent from Numident records, if place of death could not be matched to an identifiable USPS identifiable ZIP code, or if county could not be matched to a FIPS code. County FIPS codes are not available for deaths occurring in Puerto Rico.

Notes: FIPS codes appear numeric but can contain leading zeros. County-level FIPS codes in this dataset are are always 5 digits. In 2022, the state of Connecticut began replacing its eight counties with nine county-equivalent planning regions. This dataset contain FIPS codes for the the eight "legacy" county regions of Connecticut. FIPS codes occasionally change due to county boundary changes, name changes, etc., and thus codes contained in this dataset may not be completely consistent with external county-level data.

#Hypothetical education data
education <- read_excel("/data/censoc/workspace/geography_variables/Education.xlsx") 

#we rename FIPS to death_fips so that it can match
colnames(education)[1] <- "death_fips" 

#recode Miami-Dade's FIPS to Dade County's FIPS
education <- education %>%
  mutate(death_fips = ifelse(death_fips == "12086", "12025", death_fips)) %>% 
  mutate(death_fips = as.character(death_fips))

#merge onto Numident 
numident_merged <- merge(numident_merged, education, by = "death_fips", all.x = TRUE) 

\newpage

\huge death_state \normalsize \vspace{12pt}

Label: Place of death, state

Description: death_state is a character variable reporting a person's state of death, as determined by the ZIP code of their residence at time of death. Values in this field includes the 50 U.S states, the District of Columbia, and Puerto Rico.

Missing values may occur if ZIP code was entered incorrectly in Numident records, if ZIP code is absent from Numident records, or if place of death could not be matched to an identifiable USPS identifiable ZIP code.

\vspace{30 pt}

state_key <- data.frame(death_state = tolower(state.abb), label = state.name) %>% 
  rbind(c("pr", "Puerto Rico")) %>% 
  rbind(c("dc", "Washington DC"))


#death_state_tab 
tab <-  Numident_Geography_File %>% group_by(death_state) %>%
  tally() %>% 
  left_join(state_key, by = "death_state") %>% 
  mutate(`freq %` = round(n*100 / sum(n), 1)) %>% 
  select(death_state, label, n, `freq %`)
rows <- seq_len(nrow(tab) %/% 2)
death_state_tab  <- knitr::kable(list(tab[rows, 1:4],
                                      matrix(numeric(), nrow=0, ncol=1),
                                      tab[-rows, 1:4]),
                                 caption = "Deaths by State",
                                 format = "latex",
                                 booktabs = TRUE)  %>%
  kableExtra::kable_styling(latex_options = "HOLD_position")

death_state_tab

\newpage

\huge death_region \normalsize \vspace{12pt}

Label: Place of death, census region

Description: death_region is a character variable reporting a person's census region of death, as sourced from the ZIP code of their residence at time of death. This variable is available for persons dying within the 50 United States and the District of Columbia.

Missing values may occur if ZIP code was entered incorrectly in Numident records, if ZIP code is absent from Numident records, or if place of death could not be matched to an identifiable USPS ZIP code in the United States.

\vspace{70 pt}

deathregion_table <- table(Numident_Geography_File$death_region, useNA = "ifany")

deathregion_table <- as.data.frame(deathregion_table)

deathregion_table$Freq_perc <- round((deathregion_table$Freq/sum(deathregion_table$Freq))*100, digits = 1)
deathregion_table$label <- c("NA", "Midwest", "Northeast", "South", "West")

region_tab <- knitr::kable(deathregion_table %>% select(Var1, label, Freq, Freq_perc), 
      caption = "Deaths by Census Region",
      col.names = c("birth_region", "label", "n","freq %"),
      align = c("l", "r","l"),
      format = "latex",
      booktabs = TRUE) %>% 
  kable_styling(latex_options = "hold_position")

region_tab

\newpage

\huge death_country \normalsize \vspace{12pt}

Label: Place of death, country

Description: death_country is a character variable reporting a person's country of death, as determined by the ZIP code of their residence at time of death. While the vast majority of ZIP codes are within the U.S., countries that host U.S. military bases on their territories may also have ZIP codes associated with them. Codes follow \textcolor{blue}{ISO 3166-1 alpha-2 country codes}.

Missing values may occur if ZIP code being entered incorrectly in Numident records, if ZIP code is absent from Numident records, or if place of death could not be matched to an identifiable USPS identifiable ZIP code.

\newpage

\huge death_ruc1993 \normalsize \vspace{12pt}

Label: County of death, Rural-Urban Continuum Code

Description: death_ruc1993 is a numeric variable that reports the 1993 Rural-Urban Continuum Code for a person's county (or county equivalent) of death, as sourced from the United States Department of Agriculture (USDA). The Rural-Urban Continuum Codes are a classification system developed to categorize U.S. counties based on their degree of urbanization and adjacency to metropolitan areas. The system consists of ten codes, ranging from 0 to 9, where lower codes represent more urbanized counties and higher codes represent more rural areas. For more information, see: \textcolor{blue}{https://www.ers.usda.gov/data-products/rural-urban-continuum-codes.aspx}.

Missing values may occur if ZIP code was entered incorrectly in Numident records, if ZIP code is absent from Numident records, or if ZIP code could not be matched to a county.

\vspace{40 pt}

ruc1993_table <- table(Numident_Geography_File$death_ruc1993, useNA = "always")

ruc1993_table <- as.data.frame(ruc1993_table)

ruc1993_table <- ruc1993_table %>% 
  mutate(Rural_Urban_Label = factor(Var1, levels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "NA"),
                                    labels = c("Central counties of metro areas of 1 million population or more", "Fringe counties of metro areas of 1 million population or more", "Counties in metro areas of 250,000 to 1 million population", "Counties in metro areas of fewer than 250,000 population", "Urban population of 20,000 or more, adjacent to a metro area", "Urban population of 20,000 or more, not adjacent to a metro area","Urban population of 2,500 to 19,999, adjacent to a metro area","Urban population of 2,500 to 19,999, not adjacent to a metro area","Rural or fewer than 2,500 urban population, adjacent to a metro area","Rural or fewer than 2,500 urban population, not adjacent to a metro area", "NA")))

ruc1993_table$Freq_perc <- round((ruc1993_table$Freq/sum(ruc1993_table$Freq))*100,digits = 1)

ruc1993_table <- ruc1993_table[, c(1, 3, 2, 4)]
ructab <- knitr::kable(ruc1993_table, 
      caption = "Rural-Urban County of Death",
      col.names = c("death_ruc1993", "label", "n","freq %"),
      align = c("l", "r", "l","l"),
      format = "latex",
      booktabs = TRUE) %>% 
  kable_styling(latex_options = "hold_position")


ructab


caseybreen/wcensoc documentation built on Nov. 21, 2024, 5:15 a.m.