Summary: This vignette presents a method for calculating mortality differentials using Ordinary Least Squares (OLS) regression on age of death. The goal of this vignette is to highlight the need for adjustment factors for interpreting OLS Regression coefficients when working with truncated data and work through a stylized example using this method to explore the relationship between wage income and longevity in the CenSoc-Numident dataset.
Before getting started with the vignette, make sure to:
Follow the instructions in the ipumsr
vignette [TODO link] to merge the CenSoc-Numident and Census files
Install packages if necessary (use the install.packages()
function)
tidyverse
ipumsr
brooms
devtools
data.table
The original R notebook (.Rmd file) for this vignette can be downloaded here.
## Library Packages library(tidyverse) ## Tools for data analysis library(ipumsr) ## Reads in IPUMS data library(broom) ## Cleans regression output library(devtools) ## R package development tools library(data.table)## Tools for working with big data library(devtools) ## Set seed for reproducibility set.seed(0.4886) ## Source some files from a GitHub Repository devtools::source_url("https://github.com/caseybreen/numident_paper/tree/master/code/helper_functions.R?raw=TRUE")
Using OLS Regression on age of death with fixed effect terms for each year of birth is a straightforward method for analyzing CenSoc mortality data. There are a few specific considerations of using regression on age of death to analyze the CenSoc Mortality data. The CenSoc-Numident file only includes deaths occurring in the window of 1988-2005 (the period with high death coverage). As the left and right truncation ages vary by birth cohort, it is important to include fixed effect terms for each year of birth. Models of the form:
$$ Age_at_death = birth_year_dummy + covariates_of_interest $$
provide estimates of the effect of the covariates on the age of death in the sample, controlling for birth cohort truncation effects. In this example, we work with the cohorts of 1900-1920, so we need to include a fixed effect term for year of birth.
As the truncated window of deaths excludes the tails of the mortality distribution, any measurement of the average difference between groups will be downwardly biased. Imagine there are two types of lobsters: red lobsters with an average weight of 5 pounds and blue lobsters with an average weight of 2 pounds. Lobster fishers aren't able to sell lobsters that are too small or too large — so they catch only lobsters between 3 and 4 pounds. If we just look at the lobsters that are caught, we're observing only small red lobsters and big blue lobsters. If we compare the size of the caught lobsters, we would be understating the true difference in size between the red and blue lobsters. By the same logic, the coefficients in our OLS regression model will underestimate the size of the true effect.
{width=100%} For example, in the Figure above, we show two distributions of age of death in which population A (solid line) has an e65 — life expectancy at age 65 — of 18 years and population B (dashed line) has an e65 of 19 years. If we only observe deaths from ages 78 to 95, as we would for the cohort of 1910, the difference in these truncated means would only be 0.358, understating the e65 difference by a factor of 2.79. Bearing this in mind, we'll use 2.79 as our adjustment factor.
To account for this, we can calculate adjustment factors to help us more accurately interpret our regression coefficients. Users can do their full statistical analysis and then apply regression coefficients to help us better interpret our results.
We will step through calculating regression adjustment coefficients. First, we'll decide upon the appropriate parameters for the Gompertz model — a popular old age mortality model. In general, we can just use default values to calculate our mortality parameters. The Gompertz parameters can be left at their default values of $\beta = 0.1$ and $M = 84$ unless there is a reason to override these values based on external estimates.
For demonstration purposes, in this vignette, we'll use maximum likelihood estimation to calculate new Gompertz parameters from the birth cohort of 1910. We'll then use simulation to calculate adjustment factors for this data set. For a more detailed discussion of this maximum likelihood estimation method, see the “The Berkeley Unified Numident Mortality Database" working paper.
## Format data for maximum likelihood estimation censoc_numident_demo<-fread("/data/josh/CenSoc/censoc_data/censoc_linked_to_census/v2.1/censoc_numident_v2.1_linked.csv") deaths_tabulated <- censoc_numident_demo %>% filter(byear %in% 1910) %>% group_by(death_age) %>% summarize(deaths = sum(weight)) %>% filter(death_age %in% 78:94) %>% tibble::deframe() ## Maximum Likelihood Estimation x.left <- min(as.numeric(names(deaths_tabulated))) x.right <- max(as.numeric(names(deaths_tabulated))) + 1 ## Truncation out <- counts.trunc.gomp.est(deaths_tabulated, x.left = x.left, x.right = x.right, b.start = 1/9, M.start = 80) ## Get Gompertz Parameters (b.vec <- exp(out$par[1])) (M.vec <- exp(out$par[2])) ## Get adjustment Factor adjust.factor <- get.bunmd.adjust.factor(1905:1915, M = M.vec, b = b.vec, N = 1000000) #> [1] 0.09249799 #> [1] 84.30276 #> [1] "simulating: please be patient" #> [1] "regressing: please continue to be patient" #> [1] 3.44
The association between relative wage income and mortality in the US has been well-documented. Can we see this association in our CenSoc-Numident dataset?
To answer this question, we'll use the INCWAGE
variable. This is an imperfect measure of income, as it was only collected for salary and wage workers. Census enumerators in 1940 were explicitly instructed to "not include earnings of businessmen, farmers, or professional persons who depend upon business profits, sales of crops, or fees for income and who do not work for wages or salaries." (See instructions to enumeration for more detail). In this analysis, we account for this by only including persons with non-zero wages — other researchers may take a different approach.
The CenSoc-Numident file is already restricted to deaths occurring in the "high-coverage" period of 1988-2005. We further restrict this analysis to the birth cohorts of 1905 to 1915 and deaths for persons 65+ (these are the birth cohorts and ages at death for which we have the best death coverage).
## Calculate income deciles ## Restrict to cohorts of 1900-1920 censoc_incwage <- censoc_numident_demo %>% filter(byear %in% c(1905:1915)) %>% filter(dyear >= 65) %>% filter(INCWAGE > 0 & INCWAGE <= 5001) %>% ## INCWAGE is topcoded at $5001; higher values denote NA/Missing mutate(wage_decile = ntile(INCWAGE, 10)) %>% mutate(wage_decile = as.factor(wage_decile)) %>% mutate(educ_yrs = case_when( #EDUCD== 2 ~ 0, EDUCD== 14 ~ 1, EDUCD== 15 ~ 2, EDUCD== 16 ~ 3, EDUCD== 17 ~ 4, EDUCD== 22 ~ 5, EDUCD== 23 ~ 6, EDUCD== 25 ~ 7, EDUCD== 26 ~ 8, EDUCD== 30 ~ 9, EDUCD== 40 ~ 10, EDUCD== 50 ~ 11, EDUCD== 60 ~ 12, EDUCD== 70 ~ 13, EDUCD== 80 ~ 14, EDUCD== 90 ~ 15, EDUCD== 100 ~ 16, EDUCD== 110 ~ 17, EDUCD == 111 ~ 19, EDUCD == 112 ~ 20, EDUCD == 113 ~ 21, TRUE ~ 0 )) ## Prepare data for model censoc_incwage <- censoc_incwage %>% mutate(race_string = as_factor(RACE), bpl_string = as_factor(BPL), sex_string = as_factor(SEX), byear = as_factor(byear)) ## print out variables for model censoc_incwage %>% select(death_age, byear, sex_string, race_string, bpl_string, sex_string, educ_yrs) %>% sample_n(6)
| death_age| byear|sex_string |race_string |bpl_string | educ_yrs| |---------:|-----:|:----------|:----------------------------|:-------------|--------:| | 83| 1914|Male |White |Indiana | 8| | 86| 1913|Male |Black/African American |Mississippi | 4| | 79| 1911|Female |White |Maine | 12| | 78| 1915|Male |White |Tennessee | 12| | 73| 1914|Female |White |Massachusetts | 10| | 84| 1915|Male |White |Montana | 12|
Here, we'll run our regression and inflate our regression coefficients and standard errors by our adjustment factor of 3.1.
## Run linear model ## Use both IPUMS and CenSoc weights model <- lm(death_age ~ wage_decile + byear + race_string + sex_string + educ_yrs + bpl_string, data = censoc_incwage, weight = weight) ## Put model results into a data.frame model.df <- tidy(model, conf.int = T) %>% mutate(estimate_adj = estimate * adjust.factor, std.error_adj = std.error * adjust.factor) ## Select coefficients and standard errors from OLS model model.df <- model.df %>% filter(str_detect(term, "wage_decile")) %>% mutate(quantile = as.numeric(substr(term, 12, 15))) %>% select(quantile, value = estimate, value_adj = estimate_adj, std.error, std.error_adj = std.error_adj) %>% add_row(quantile = 1, value = 0, std.error = 0, value_adj = 0, std.error_adj = 0) ## Plot income_decile_ols_coefficients <- ggplot(data = model.df) + geom_pointrange(aes(ymin = value - std.error, ymax = value + std.error, y = value, x = quantile, color = "Raw")) + geom_pointrange(aes(ymin = value_adj - std.error_adj, ymax = value_adj + std.error_adj, y = value_adj, x = quantile, color = "Adjusted") ) + theme_bw(base_size = 18) + labs(title = "Income pattern of Longevity at Age 65", x = "Wage Income Decile", y = "Additional Years of Life") + scale_x_continuous(breaks=seq(0,17,2)) + scale_color_manual(name = "", values = c("Adjusted" = "black", "Raw" = "red")) + theme(legend.position = "bottom") ggsave(plot = income_decile_ols_coefficients, filename = "figures/income_decile_ols_coefficients.png")
The plot shows the association between relative wage income and longevity, after controlling for gender, race, place of birth, and years of education. The reference group for this plot is the first income decile. When interpreting these mortality differentials, we should keep in mind that the plotted adjusted regression coefficients and standard errors have been scaled-up by a factor of 3.22.
The CenSoc-Numident dataset allows researchers to explore broad patterns of mortality differentials in the United States. However, the truncated mortality distribution presents challenges for mortality estimation. To overcome this, we use OLS regression with adjustment factors to scale-up the magnitude of the regression coefficients to account for the downward bias caused by truncation.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.