In ekothe/nzbabynames: Data on baby names in New Zealand

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette is based on this FiveThirtyEight article How to Tell Someone’s Age When All You Know Is Her Name. That article demonstrated how you could use actuarial life tables and a database of baby names to estimate the age of living Americans with a given name.

In this vignette we use the nzbabynames data set and the NZ complete cohort actuarial life tables to calculate the median age of individuals with a given first name.^[Like FiveThirtyEight we assume that people of the same sex die at the same rate regardless of their name - in practice will vary by a number of factors related to both name and life expectancy such as ethnicity].

An "old" name

Let's start with one name that is stereotypically 'old' - Ethel

library(ggplot2)
library(matrixStats)
library(dplyr)
library(nzbabynames)
library(stringr)
library(lubridate)

name_of_interest <- "Ethel"

df <- life_tables %>% 
  filter(percentile == "median") %>% 
  filter(age == (lubridate::year(Sys.Date()) - yearofbirth)) %>% 
  mutate(prop_still_alive = lx/100000, 
         Year = yearofbirth, 
         Sex = stringr::str_to_title(sex)) %>% 
  right_join(filter(nzbabynames, Name == name_of_interest), by= c("Year", "Sex")) %>% 
  mutate(prop_still_alive = case_when(
    Year < 1918 ~ 0, 
    TRUE ~ prop_still_alive), 
  Name = case_when(
    is.na(Name) ~ name_of_interest, 
    TRUE ~ Name), 
  Count = case_when(
    is.na(Count) ~ 0, 
    TRUE ~ as.numeric(Count)),
  alive_count = Count * prop_still_alive)

med_age <- df %>% 
  summarise(wted_median = weightedMedian(age, w = alive_count, na.rm = TRUE))

max <- max(df$Count)

peak <- df %>% 
  filter(Count == max)

ggplot(data = df, aes(x = Year)) +
  geom_col(aes(y = Count), fill="grey")+
  geom_col(aes(y = alive_count), fill="cornflowerblue") +
  scale_x_continuous(limits = c(1900, 2016)) +
  coord_cartesian(expand = FALSE) +
  theme_classic()

r name_of_interest peaked in r peak$Year when r max r name_of_interests were born. Based on the actuarial data, r round(sum(df$alive_count, na.rm = TRUE)/sum(df$Count, na.rm = TRUE)*100,0)% of r name_of_interests born between 1900 and 2017 would still be alive in r lubridate::year(Sys.Date()). The median age of a living r name_of_interest would be r round(med_age$wted_median, 0) years.

And a young one

Let's contrast that with a 'younger' name - Brittany.

name_of_interest <- "Brittany"

df <- life_tables %>% 
  filter(percentile == "median") %>% 
  filter(age == (lubridate::year(Sys.Date()) - yearofbirth)) %>% 
  mutate(prop_still_alive = lx/100000, 
         Year = yearofbirth, 
         Sex = stringr::str_to_title(sex)) %>% 
  right_join(filter(nzbabynames, Name == name_of_interest), by= c("Year", "Sex")) %>% 
  mutate(prop_still_alive = case_when(
    Year < 1918 ~ 0, 
    TRUE ~ prop_still_alive), 
  Name = case_when(
    is.na(Name) ~ name_of_interest, 
    TRUE ~ Name), 
  Count = case_when(
    is.na(Count) ~ 0, 
    TRUE ~ as.numeric(Count)),
  alive_count = Count * prop_still_alive)

med_age <- df %>% 
  summarise(wted_median = weightedMedian(age, w = alive_count, na.rm = TRUE))

max <- max(df$Count)

peak <- df %>% 
  filter(Count == max)

ggplot(data = df, aes(x = Year)) +
  geom_col(aes(y = Count), fill="grey")+
  geom_col(aes(y = alive_count), fill="cornflowerblue") +
  scale_x_continuous(limits = c(1900, 2016)) +
  coord_cartesian(expand = FALSE) +
  theme_classic()

Finally a less informative name

Ethel and Brittany are both names with pretty clear trajectories. But of course some names rise and fall over time. Let's pick one of these - Joseph

name_of_interest <- "Joseph"

df <- life_tables %>% 
  filter(percentile == "median") %>% 
  filter(age == (lubridate::year(Sys.Date()) - yearofbirth)) %>% 
  mutate(prop_still_alive = lx/100000, 
         Year = yearofbirth, 
         Sex = stringr::str_to_title(sex)) %>% 
  right_join(filter(nzbabynames, Name == name_of_interest), by= c("Year", "Sex")) %>% 
  mutate(prop_still_alive = case_when(
    Year < 1918 ~ 0, 
    TRUE ~ prop_still_alive), 
  Name = case_when(
    is.na(Name) ~ name_of_interest, 
    TRUE ~ Name), 
  Count = case_when(
    is.na(Count) ~ 0, 
    TRUE ~ as.numeric(Count)),
  alive_count = Count * prop_still_alive)

med_age <- df %>% 
  summarise(wted_median = weightedMedian(age, w = alive_count, na.rm = TRUE))

max <- max(df$Count)

peak <- df %>% 
  filter(Count == max)

ggplot(data = df, aes(x = Year)) +
  geom_col(aes(y = Count), fill="grey")+
  geom_col(aes(y = alive_count), fill="cornflowerblue") +
  scale_x_continuous(limits = c(1900, 2016)) +
  coord_cartesian(expand = FALSE) +
  theme_classic()

r name_of_interest peaked in r peak$Year when r max r name_of_interests were born. Based on the actuarial data, r round(sum(df$alive_count, na.rm=TRUE)/sum(df$Count, na.rm=TRUE)*100,0)% of r name_of_interests born between 1900 and 2017 would still be alive in r lubridate::year(Sys.Date()). The median age of a living r name_of_interest would be r round(med_age$wted_median, 0) years.

However, note that while there is a clear peak in r peak$Year this name has had much more consistent consistent staying power than either Ethel or Brittany, meaning that you might be less confident about guessing the age of any given r name_of_interest based on name alone. Put more formally, the interquartile range (and thus uncertainty) around the estimated age for r name_of_interest is much greater than around Ethel or Brittany.

ekothe/nzbabynames documentation built on Feb. 27, 2021, 5:55 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com