In caseybreen/wcensoc:

[^updated]: Last updated: r format(Sys.time(), '%d %B, %Y')

\vspace{-10truemm}

#knitr::opts_chunk$set(include=FALSE)

library(tidyverse)
library(data.table)
library(kableExtra)

Description

These files allow researchers to identify sibships in the Berkeley Unified Numident Mortality Database (BUNMD).

We locate sibships (sibling groups) in the BUNMD by matching individuals on their parents' first and last names as recorded in Social Security Numident records. Two methods are used to to match siblings:

1) The exact match method identifies siblings only with exactly identical parent names (after names have undergone cleaning and standardization). This is the most stringent match method.

2) The flexible match method permits parents' names to be slightly different, within a threshold defined by Jaro–Winkler string distance, in addition to exact matches. This allows siblings to be matched even in cases of minor misspellings, mistranscriptions, or spelling variations in parents' names among sibships (e.g., mother's maiden name recorded as "Brannum" and "Branum"). This increases the number of siblings found, but has higher potential to falsely match unrelated individuals. Most (but not all) individuals and sibling connections identified using the exact match method are also identified using this method.

Sibships are identified among individuals in the BUNMD who died at age 65+ in the years 1988-2005. They may be of any gender composition and contain from 2 to 9 siblings each. An overview of the size of resulting sibships created by each method is presented below:

# Read BUNMD data
bunmd_sibs_exact <- fread("/global/scratch/p2p3/pl1_demography/censoc/censoc_data_releases/siblings/siblings_v2/bunmd_siblings_v2/bunmd_sibs_exact_match_v2.csv")
bunmd_sibs_flex <- fread("/global/scratch/p2p3/pl1_demography/censoc/censoc_data_releases/siblings/siblings_v2/bunmd_siblings_v2/bunmd_sibs_flexible_match_v2.csv")

# get basic metrics to make table below
nrow(bunmd_sibs_exact) # total size
bunmd_sibs_exact %>% group_by(sib_group_id_exact) %>% filter(row_number() == 1) %>% nrow() # number of groups
bunmd_sibs_exact %>% group_by(sib_group_id_exact) %>% mutate(n_sibs=n()) %>% ungroup() %>% 
  group_by(n_sibs) %>% tally() # group sizes


nrow(bunmd_sibs_flex) # total size
bunmd_sibs_flex %>% group_by(sib_group_id_flexible) %>% filter(row_number() == 1) %>% nrow() # number of groups
bunmd_sibs_flex %>% group_by(sib_group_id_flexible) %>% mutate(n_sibs=n()) %>% ungroup() %>% 
  group_by(n_sibs) %>% tally() # group sizes

df <- tibble(
             "Match method" = c("Exact match", "Flexble match"),
             "Number of individuals" = c("4,767,193", "6,252,614"),
             "Number of sibships" = c("2,130,398 ", "2,745,707"),
             "Mean sibship size" = c("2.24", "2.28")
             ) %>% 
  knitr::kable(format = "pipe")

df %>% kable_styling(full_width = F)

For a detailed description of sibling identification methodologies and characteristics of sibships, please see the paper: Methods for Identifying Siblings in Administrative Mortality Data available online at https://censoc.berkeley.edu/documentation/.

Usage

Each sibling dataset consists of two columns: a unique individual identifier ssn (social security number), and an identifier for each sibling group. The ssn of one sibling within each sibship is used as the group identifier.

\vspace{15pt}

# Show first rows the exact match
head(bunmd_sibs_exact, 5)

head(bunmd_sibs_exact, 5)

\vspace{15pt}

These files must be used in conjunction with the BUNMD. Users will need to merge the datasets using the unique identifier ssn, as in the example code below:

require(data.table)
# Read full BUNMD 
bunmd <- data.table::fread("bunmd_v2.csv")
# Read BUNMD sibships IDs
bunmd_sib_id <- data.table::fread("bunmd_sibs_flexible_match_v2.csv")
# Attach sibling IDs to the BUNMD, keeping all records in the BUNMD
bunmd_with_sibs <- merge(bunmd, bunmd_sibs_id, by = "ssn", all.x = TRUE)