knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette is based on this FiveThirtyEight article How to Tell Someone’s Age When All You Know Is Her Name. That article demonstrated how you could use actuarial life tables and a database of baby names to estimate the age of living Americans with a given name.
In this vignette we use the nzbabynames data set and the NZ complete cohort actuarial life tables to calculate the median age of individuals with a given first name.^[Like FiveThirtyEight we assume that people of the same sex die at the same rate regardless of their name - in practice will vary by a number of factors related to both name and life expectancy such as ethnicity].
Let's start with one name that is stereotypically 'old' - Ethel
library(ggplot2) library(matrixStats) library(dplyr) library(nzbabynames) library(stringr) library(lubridate) name_of_interest <- "Ethel" df <- life_tables %>% filter(percentile == "median") %>% filter(age == (lubridate::year(Sys.Date()) - yearofbirth)) %>% mutate(prop_still_alive = lx/100000, Year = yearofbirth, Sex = stringr::str_to_title(sex)) %>% right_join(filter(nzbabynames, Name == name_of_interest), by= c("Year", "Sex")) %>% mutate(prop_still_alive = case_when( Year < 1918 ~ 0, TRUE ~ prop_still_alive), Name = case_when( is.na(Name) ~ name_of_interest, TRUE ~ Name), Count = case_when( is.na(Count) ~ 0, TRUE ~ as.numeric(Count)), alive_count = Count * prop_still_alive) med_age <- df %>% summarise(wted_median = weightedMedian(age, w = alive_count, na.rm = TRUE)) max <- max(df$Count) peak <- df %>% filter(Count == max) ggplot(data = df, aes(x = Year)) + geom_col(aes(y = Count), fill="grey")+ geom_col(aes(y = alive_count), fill="cornflowerblue") + scale_x_continuous(limits = c(1900, 2016)) + coord_cartesian(expand = FALSE) + theme_classic()
r name_of_interest
peaked in r peak$Year
when r max
r name_of_interest
s were born. Based on the actuarial data, r round(sum(df$alive_count, na.rm = TRUE)/sum(df$Count, na.rm = TRUE)*100,0)
% of r name_of_interest
s born between 1900 and 2017 would still be alive in r lubridate::year(Sys.Date())
. The median age of a living r name_of_interest
would be r round(med_age$wted_median, 0)
years.
Let's contrast that with a 'younger' name - Brittany.
name_of_interest <- "Brittany" df <- life_tables %>% filter(percentile == "median") %>% filter(age == (lubridate::year(Sys.Date()) - yearofbirth)) %>% mutate(prop_still_alive = lx/100000, Year = yearofbirth, Sex = stringr::str_to_title(sex)) %>% right_join(filter(nzbabynames, Name == name_of_interest), by= c("Year", "Sex")) %>% mutate(prop_still_alive = case_when( Year < 1918 ~ 0, TRUE ~ prop_still_alive), Name = case_when( is.na(Name) ~ name_of_interest, TRUE ~ Name), Count = case_when( is.na(Count) ~ 0, TRUE ~ as.numeric(Count)), alive_count = Count * prop_still_alive) med_age <- df %>% summarise(wted_median = weightedMedian(age, w = alive_count, na.rm = TRUE)) max <- max(df$Count) peak <- df %>% filter(Count == max) ggplot(data = df, aes(x = Year)) + geom_col(aes(y = Count), fill="grey")+ geom_col(aes(y = alive_count), fill="cornflowerblue") + scale_x_continuous(limits = c(1900, 2016)) + coord_cartesian(expand = FALSE) + theme_classic()
r name_of_interest
peaked in r peak$Year
when r max
r name_of_interest
s were born. Based on the actuarial data, r round(sum(df$alive_count, na.rm = TRUE)/sum(df$Count, na.rm = TRUE)*100,0)
% of r name_of_interest
s born between 1900 and 2017 would still be alive in r lubridate::year(Sys.Date())
. The median age of a living r name_of_interest
would be r round(med_age$wted_median, 0)
years.
Ethel and Brittany are both names with pretty clear trajectories. But of course some names rise and fall over time. Let's pick one of these - Joseph
name_of_interest <- "Joseph" df <- life_tables %>% filter(percentile == "median") %>% filter(age == (lubridate::year(Sys.Date()) - yearofbirth)) %>% mutate(prop_still_alive = lx/100000, Year = yearofbirth, Sex = stringr::str_to_title(sex)) %>% right_join(filter(nzbabynames, Name == name_of_interest), by= c("Year", "Sex")) %>% mutate(prop_still_alive = case_when( Year < 1918 ~ 0, TRUE ~ prop_still_alive), Name = case_when( is.na(Name) ~ name_of_interest, TRUE ~ Name), Count = case_when( is.na(Count) ~ 0, TRUE ~ as.numeric(Count)), alive_count = Count * prop_still_alive) med_age <- df %>% summarise(wted_median = weightedMedian(age, w = alive_count, na.rm = TRUE)) max <- max(df$Count) peak <- df %>% filter(Count == max) ggplot(data = df, aes(x = Year)) + geom_col(aes(y = Count), fill="grey")+ geom_col(aes(y = alive_count), fill="cornflowerblue") + scale_x_continuous(limits = c(1900, 2016)) + coord_cartesian(expand = FALSE) + theme_classic()
r name_of_interest
peaked in r peak$Year
when r max
r name_of_interest
s were born. Based on the actuarial data, r round(sum(df$alive_count, na.rm=TRUE)/sum(df$Count, na.rm=TRUE)*100,0)
% of r name_of_interest
s born between 1900 and 2017 would still be alive in r lubridate::year(Sys.Date())
. The median age of a living r name_of_interest
would be r round(med_age$wted_median, 0)
years.
However, note that while there is a clear peak in r peak$Year
this name has had much more consistent consistent staying power than either Ethel or Brittany, meaning that you might be less confident about guessing the age of any given r name_of_interest
based on name alone. Put more formally, the interquartile range (and thus uncertainty) around the estimated age for r name_of_interest
is much greater than around Ethel or Brittany.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.