In bayesiandemography/demprep: Prepare Demographic Data

library(knitr)
library(demprep)
library(dplyr)
opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = TRUE,
  fig.width = 6.5,
  fig.align = "center"
)

births_in <- data.frame(date_child = c("2020-05-18", "2020-07-02", "2020-10-05"),
                        date_parent = c("2000-02-14", "2002-05-18", "1990-12-18"))

births_out <- data.frame(date_child = c("2020-03-14", "2020-08-22"),
                         date_parent = c("1968-01-23", "2008-03-05"))

Introduction

Demographic analyses typically use counts of people or events with standardised age groups and periods,

deaths_clean <- data.frame(age = rep(c("0-4", "5-9", "10-14"), times = 2),
                           year = rep(2020:2021, each = 3),
                           count = c(3, 5, 1, 4, 0, 2))
kable(deaths_clean,
      align = "c",
      caption = "Counts of deaths by age group and year")

However, raw data from registration systems or surveys typically reports individual events, with dates rather than age groups or periods,

deaths_raw <- data.frame(name = c("Alice", "Bilal", "Clara", "Djeneba", "Ergi", "Faisal"),
                               date_of_birth = c("2022-04-17", "2009-12-03", "2020-07-01",
                                         "2011-11-29", "2020-05-16", "2018-03-30"),
                   date_of_death = c("2022-07-19", "2021-10-18", "2022-14-16",
                                         "2022-08-07", "2020-06-03", "2020-01-13"))
kable(deaths_raw,
      caption = "Deaths by date of birth and date of death")

Even data that has been processed may use formats we do not want, or may contain gaps,

deaths_semiclean <- data.frame(age = c("Infant", "2 years", "12 years",
                                       "Infants", "1 year"),
                               year = c(2017, 2020, 2020,
                                        2021, 2021),
                               count = c(2, 1, 1, 3, 2))
kable(deaths_semiclean,
      caption = "Counts of deaths by age group and year")

Getting from raw or partly-processed data to data that is ready for demographic analysis can be deceptively difficult. Consider, for instance, the task of assigning people to one-month age groups. When does a person who was born on 31 January 2001 turn one month old? If we use the base R function seq.Date to calculate ages we obtain a rather counter-intuitive answer: 3 March 2001.

seq.Date(from = as.Date("2001-01-31"), 
         by = "month",
         length.out = 2)

Package demprep provides functions for processing data to the point where it is ready for demographic analysis. demprep focuses in particular on ages and periods, since this is typically the hardest part of preparing demographic data.

Background

Calculating age

In demography, as in ordinary English, "age" normally means "age in completed years". For instance, demographers describe a person who was born 5 years and 57 days ago as 5 years old. If, instead of years, demographers are measuring age in months, then they use age in completed months. Demographers describe a person who was born 3 months and 23 days ago as 3 months old.

Calculating ages is, however, complicated by the fact that length of the units, as measured days elapsed, is not constant. Leap years are one day longer than non-leap years, for instance, and May has one more day than April.

The basic rule used by demprep is that a person gains a extra month of age each time the person attains the day-of-the-month that they were born. If a person was born on 5 March 2000, for instance, then the person turns one month old on 5 April 2000, turns 2 months old on May 2000, turns 12 months old on 5 March 2001, and so on.

The basic rule needs to be extended, however, to deal with cases when a month does not contain the day-of-the-month when a person was not born. The basic rule does not, for instance, allow us to say when a person who was born on 31 January turns one month old, since February does not contain a 31st day.

To deal with such cases, demprep follows the principle that if the day-of-the-month when a person was born does not occur during a month, then the person gains the the extra month of age on the first day of the next month. A person who is born on 31 January 2000 turns one month old on 1 March 2000.

More precisly, when provided with the year, month, and day-of-month of an event, and the year, month, and day-of-month of the person's birth, then demprep calculates age in completed months at the time of the event as follows: \begin{equation} \begin{split} \text{age in completed months} & = 12 \times (\text{year of event} - \text{year of birth}) \ & \quad + \text{month of event} - \text{month of birth} \ & \quad - I{\text{day-of-month of event} \ge \text{day-of-month of birth}}, \end{split} \end{equation} where $I{\text{day-of-month of event} \ge \text{day-of-month of birth}}$ means "1 if the day-of-month of the event is greater than or equal to the day-of-month of the birth, and 0 otherwise".

Age in completed years can be obtained by dividing age in completed months by 12, and discarding any remainder. If a person is aged 123, measured in completed months, then the person is aged 10, measured in completed years.

The demprep approach to calculating age requires accepting that the same age measured in months can correspond to different ages measured in total days elapsed, depending on the time of year. This variability is, however, unavoidable, if ages and periods are to line up in the way that people typically expect them to. We have to accept some variability if, for instance, people who are aged $n$ months on 1 February are to be aged $n+1$ months on 1 March, and $n+2$ months on 1 April.

Definining age groups, periods, and cohorts

Age groups and periods are two alternative ways of grouping people: one based on duration since birth, and the other based on calendar time. Cohorts are a third approach. A cohort is a collection of people who all experienced a particular event during a specified period. The particular event is normally birth, so that we might have a cohort composed of people born during the 1960s, or a cohort of people born during the year 2007. However, cohorts can also be defined by other events. People who married during the 1960s form a cohort, for instance, as do people who started high school during 2007.

One way of visualising age groups, periods, and cohorts is through a Lexis diagram. The figure below is Lexis diagram. The horizontal axis represents time, and the vertical axis represent age. The horizontal grid lines mark out age groups. The age group "5-9" for instance, starts at exact age 5 and finishes just before exact age 10. The vertical grid lines mark out periods. The period "2005-2010", for instance, starts on 1 January 2005 and ends just before 1 January 2010. The diagonal lines mark out cohorts. The cohort "2000-2005", for instance, starts down the bottom left of the diagram, and ends at the top right, at which point the date is 1 January 2025 and cohort members are all in the age group 20-24.

date <- as.Date(c("2020-12-07", "2022-05-30"))
dob <- as.Date(c("2001-07-22", "2006-03-14"))
age <- c(19 + 4.5/12, 16 + 2.5/12)

breaks_time <- seq.Date(from = as.Date("2000-01-01"),
                        to = as.Date("2025-01-01"), 
                        by = "5 years")
breaks_age <- seq.int(from = 0L, 
                      to = 25L, 
                      by = 5L)
year <- as.integer(format(breaks_time, "%Y"))
labels_time <- demprep:::make_labels_period_custom(year,
                                                   include_na = FALSE)
labels_age <- demprep:::make_labels_age(breaks = breaks_age,
                              open_last = FALSE,
                              include_na = FALSE)
demprep:::plot_date_to_age_triangle(date = date,
                                    dob = dob,
                                    unit = "year",
                                    breaks_time = breaks_time,
                                    breaks_age = breaks_age,
                                    labels_time = labels_time,
                                    labels_age = labels_age,
                                    show_vert = TRUE,
                                    show_diag = TRUE,
                                    show_months = FALSE)
text(x = date,
     y = age,
     labels = expression(italic(A), italic(B)),
     pos = 4)

The two black diagonal lines in the diagram above are "life lines". They depict the lives of individuals A and B from the time of their births, in periods 2000-2005 and 2005-2010, to the time of the deaths, in 2020-2025.

Individuals A and B died in the same period (2020-2025). They also belonged to the same age group (20-24) when they died. However, they belonged to two different cohorts: 2000-2005 versus 2005-2010. If all we knew about the deaths of A and B was the age group and period, then we would not be able to tell whether the deaths belonged to cohort 2000-2005 or to cohort 2005-2010.

Table: Assigning deaths to cohorts

| Individual | Age group | Period | Cohort | |:----------:|:---------:|:---------:|:------:| | A | 20-24 | 2020-2025 | ? | | B | 20-24 | 2020-2025 | ? |

This is uncertainty about cohort membership is an example of a more general phenomenon. Information on the age group and period of an event only allows us to narrow the choice of cohorts down to two. To resolve the remaining uncertainty, we need one more pice of information. One way of encoding this information is through "Lexis triangles".

Consider again the Lexis diagram above. Lexis triangles are formed by the intersection of the horizontal, vertical, and diagonal lines. The triangles below the diagonal lines are known as lower Lexis triangles, and the triangles above the diagonal lines are known as upper Lexis triangles. The death of individual A belongs to an upper Lexis triangle, while the death of individual B belongs to a lower Lexis triangle. Events in an upper triangle belong to the earlier of the two possible cohorts, and events in a lower triangle belong to the later of the two cohorts.

Table: Assigning deaths to cohorts (with Lexis triangles)

| Individual | Age group | Period | Lexis triangle | Cohort | |:----------:|:---------:|:---------:|:--------------:|:---------:| | A | 20-24 | 2020-2025 | Upper | 2000-2005 | | B | 20-24 | 2020-2025 | Lower | 2005-2010 |

One point to note about age groups, periods, and cohorts is that, to accommodate all three within the same dataset, age, period, and cohort all need to be measured using intervals of the same length. In the Lexis diagram above, for instance, age groups, periods, and cohorts all have lengths of 5 years. Demographic data do not always come in this form. It is, for instance, common to have data with 5-year age groups and 1-year periods.

Labels for age groups, periods, and cohorts

The table below summarises the age group, period, and cohort labels produced by demprep. Dashes indicate combinations of unit, type, and grouping that demprep does not cater to.

Table: Examples of demprep labels for age groups, periods, and cohorts

| Unit | Type | Age group | Period | Cohort | | :------ | :------- | :---------: | :----------: | :------------: | | year | single | "5" | "2020" | "2020" | | year | multiple | "5-9" | "2025-2030" | "2025-2030" | | year | open left | - | - | "<2020" | | year | open right | "100+" | - | - | | quarter | single | "20" | "2020 Q1" | "2020 Q1" | | quarter | multiple | - | - | - | | quarter | open left | - | - | "<2020 Q1" | | quarter | open right | "400+" | - | - | | month | single | "60" | "2020 Jan" | "2020 Jan" | | month | multiple | - | - | - | | month | open left | - | - | "<2020 Jan" | | month | open right | "1200+" | - | - |

demprep creates labels for three different units: years, quarters, and months. demprep allows age groups to be open on the right, and allows cohorts to be open on the left. demprep allows labels composed of multiple years, but not labels composed of multiple quarters or months.

Producers of demographic data almost all follow a rule that multi-year age groups are labeled as "[lower limit]-[upper limit minus one]", so that, for instance, the interval between exact ages 5 and 10 is labelled "5-9". Producers of demographer data are much less consistent in the way they label multi-year periods and cohorts. A majority use a "[lower limit]-[upper limit]" format, so that the interval between 1 January 2020 and 1 January 2025 is labelled "2020-2025". Some, however, use a "[lower limit]-[upper limit minus one]" format, so that the same period is labelled "2020-2024". demprep uses "[lower limit]-[upper limit]" labels.

All standard labels for periods and cohorts denominated in years, in demprep and elsewhere, are ambiguous. One reason for the ambiguity is the fact that periods and cohorts do not always start 1 January. In official statistics, for instance, it is common for periods to start on 1 July and end on 30 Jun. Labels such as "2015" or "2001-2006" do not distinguish between these possibilities.

The type of ambiguity is specific to one-year periods and cohorts. A one-year period or cohort that starts on a date other than 1 January overlaps with two calendar years. For instance, a one-year period that starts on 1 July 2020 and ends on 30 June 2021 belongs partly to calendar year 2020 and partly to calendar year 2021. Some data producers label one-year periods and cohorts according to the calendar year at the start of the period or cohort, and others label them according to the calendar year at the end. Some data producers, for instance, would label a period starting on 1 July 2020 and ending on 30 June 2021 as "2020" and others would label it as "2021".

Consider, for instance, the label "2020". Restricting ourselves to periods that start on the first of the month, the label "2020" can be interpreted in 23 different ways:

In principle, the best response to this ambiguity would be to use alternative labels that were unambiguous. demprep does in fact provide a way of producing unambiguous labels, via the as_date_range functions. Most of the time, however, it is easier to work within existing conventions, which is what most funcitons in demprep do.

Functions in demprep

Overview

Data preparation using demprep typically proceeds as follows:

include_graphics("workflow.png",
                 auto_pdf = TRUE)

0. Read in data and do initial tidying using non-demprep functions

Use functions from base R or elsewhere to read the data into a data frame, put any date variables into a "year-month-day" format, and tidy variables not related to age, period, or cohort.

1. Create or clean age, period, and cohort variables

If the original data contains dates, use functions such as date_to_age_year or date_to_period_multi to construct age, period, and cohort labels. If the original data already contains age, period, and cohort labels, use functions such as clean_age to convert them to demprep formats.

2. Ensure labels are consistent and complete

Process any age, period, and cohort labels using functions such as format_age_year and format_period_multi, to make sure the labels are consistent and have all the required levels.

There are, however, a few common data preparation tasks that do not fit neatly into the workflow above, and that can be done using demprep functions:

Imputing dates

When only partial information on dates is provided, the missing information can be imputed using the impute_date and impute_dob.

Creating unambiguous labels

Create non-standard but unambiguous labels for periods or cohorts.

Switching the labeling of one-year periods and cohorts

Convert one-year period or cohort labels from using calender-year-at-start to calendar-year-at-end, or vice versa.

The `date_to` functions

Overview

Suppose that we have some raw, individual-level data on dates of birth and dates of death:

deaths <- data.frame(name = c("Anwar", "Baptiste", "Candice"),
                     date_birth = c("2014-02-17", "2012-01-10", "2019-04-29"),
                     date_death = c("2019-10-11", "2020-02-27", "2020-08-01"))
deaths

We want to calculate the period when people were born, the period when they died, and their age at death. We can do calculations like these using the date_to functions:

Before running the functions, we need to load package demprep and (for the pipe %>% and various data manipulation functions) package dplyr.

library(demprep)
library(dplyr)

Age

The date_to_age functions calculate ages denominated in years, quarters, and months.

deaths %>%
  mutate(age_years = date_to_age_year(date = date_death,
                                      dob = date_birth),
         age_quarters = date_to_age_quarter(date = date_death,
                                            dob = date_birth),
         age_months = date_to_age_month(date = date_death,
                                        dob = date_birth))

Age is calculated using the approach discussed in Section 2.1, based on months and days-of-month attained (as opposed to the total number of days elapsed.) The date and dob (short for date-of-birth) arguments in the date_to functions can be "Date" vectors (as described in the Dates help page) or anything function as.Date can automatically convert to a "Date" vector. A character vector can be safely converted to a date vector if it uses a "year-month-day" format, as in "2025-03-01".

Periods

By default, date_to_period_year creates periods that start on 1 January,

deaths %>%
  select(-date_birth) %>%
  mutate(year_jan = date_to_period_year(date = date_death))

which we might depict as

demprep:::plot_date_to_period_year(date = deaths$date_death)

However, other start dates are allowed, provided the start dates are the first day of the month,

deaths %>%
  select(-date_birth) %>%
  mutate(year_apr = date_to_period_year(date = date_death,
                                       month_start = "Apr"))

demprep:::plot_date_to_period_year(date = deaths$date_death,
                                   month_start = "Apr")

As discussed in Section 2.3, some data producers label single-year periods by the calendar year at the start of the period, and others by the calendar year at the end. date_to_period_year defaults to using calendar year at the start. The default can be overridden by setting label_year_start to FALSE.

deaths %>%
  select(-date_birth) %>%
  mutate(year_start = date_to_period_year(date = date_death,
                                          month_start = "Apr"),
         year_end = date_to_period_year(date = date_death,
                                        month_start = "Apr",
                                        label_year_start = FALSE))

Labels for quarters and months are simpler than labels for years. Function date_to_period_quarter implements a single set of start dates (1 January, 1 April, 1 July, and 1 October), and a single labeling style.

deaths %>%
  select(-date_birth) %>%
  mutate(quarter = date_to_period_quarter(date_death))

date_to_period_month is similarly simple/inflexible.

deaths %>%
  select(-date_birth) %>%
  mutate(month = date_to_period_month(date_death))

Cohorts

The date_to_cohort functions work like their date_to_period equivalents,

deaths %>%
  select(-date_death) %>%
  mutate(cohort = date_to_cohort_year(date = date_birth,
                                      month_start = "Apr",
                                      label_year_start = FALSE))

demprep:::plot_date_to_cohort_year(date = deaths$date_birth)

Lexis triangles

To calculate Lexis triangles, we need dates of events and dates of birth,

deaths %>%
  mutate(triangle = date_to_triangle_year(date = date_death,
                                          dob = date_birth))

demprep:::plot_date_to_triangle_year(date = deaths$date_death,
                                     dob = deaths$date_birth)

Assigning events to Lexis triangles with real data, where we know dates but not precise times, and where months have different lengths, involves some tricky edge cases. The date_to_triangle functions resolve these edge cases by looking at the date that a person enters a new age group during the period in question. If the date when the person enters the new age group is greater than the date when the event occurs, then the event is allocated to an upper Lexis triangle. If the date when the person enters the new age group is equal to or less than the date when the event occurs, then the event is allocated to a lower Lexis triangle.

The `clean` functions

If the input data for an analysis come from published sources, then it probably already contains labels for age groups age groups, periods, and cohorts rather than precise dates. These labels may, however, require some modification before they match the formats expected by the dem packages. demprep contains a number of functions to help with the cleaning process,

Functions clean_age, clean_period, and clean_cohort try to parse vectors of labels and, where necessary, convert them to dem formats. If clean_age, clean_period, and clean_cohort encounter a label they cannot parse, they leave the label untouched.

x <- c("20 years", "80 and over", "young", "20-24")
clean_age(x)

Function clean_age assumes that labels consisting entirely of multiples of 5 refer to 5-year age groups,

x <- seq(0, 60, 5)
x
clean_age(x)

It also assumes that labels consisting of the numbers 0, 1, 5, 10, ..., come from a life table

x <- c(0, 1, seq(60, 5, -5))
x
clean_age(x)

Functions clean_cohort and clean_period are identical to each other, except that clean_cohort accepts intervals that are open on the left,

x <- c("Q1 2020", "1922", "2010-2025", "before 2020")
clean_cohort(x)

while clean_period does not.

clean_period(x)

Functions clean_age_df, clean_period_df, and clean_cohort_df produce data frames describing how functions clean_age, clean_period, and clean_cohort interpret a set of labels,

x <- c("2021", "2022-2025", "2021", "q2 2020")
clean_period_df(x)

Functions is_valid_age, is_valid_period, and is_valid_cohort can be used to check whether individual labels are already in a valid demprep format,

x <- c("2021", "the 1960s", "1960-1970")
is_valid_period(x)

Some version of

stopifnot(all(is_valid_period(x)))

may be helpful for catching problems.

The `format` functions

Overview

Even after all dates in a dataset have been turned into age, period, and cohort labels, and all labels have been converted demprep style, further processing of the labels may still be useful. In particular, it may be useful to consolidate labels, and fill in gaps. We may wish to turn

| age | count| |-------+------| | 0-2 | 4 |
| 4 | 5 |
| 10-12 | 3 |

into

| age | count| |-------+------| | 0-4 | 4 |
| 0-4 | 5 |
| 10-14 | 3 |

We may even wish to go further, and have some way of capturing the fact that there is an age group between 10-14 and 20-24, even if this dataset does not happen to contain any observations from it.

The format functions take vectors of age, period, cohort, and Lexis triangle labels that follow demprep conventions, and return factors where the intervals have standardised lengths, and where all intermediate categories, including ones that do not appear in the data, are included.

Functions ending in multi return create multi-year intervals, such as 5-year or 10-year periods. Functions ending in custom also create multi-year labels, but, these intervals do not have to all have the same length. Function format_age_lifetab creates special age groups for life tables, and function format_age_births creates age groups for birth counts or rates.

The format functions all return factors that contain intermediate values, including values are not represented in the data. (Factors are R's way of representing categorical variables. See the R base function factor and also the tidyverse package forcats.) Consider, for instance, the vector x,

x <- c("0-4", "10-14")

which omits the value "5-9". The function format_age_multi creates a factor with levels that include 5-9.

format_age_multi(x)

If argument x has an NA, then the levels of the factor created by a format factor will also have NA. This behavior is different from the default behavior for function factor, which is to silently drop NAs. The philosophy of the format functions is that it is better to explicitly deal with NAs.

Age groups

By default, format_age functions create age groups between 0 and 100+,

x <- c("35-38", "50-54", "77")
format_age_multi(x)

Alternative upper and lower limits can be obtained using arguments break_min and break_max,

format_age_multi(x, 
                 break_min = 25, 
                 break_max = 90)

Setting break_min and break_max to NULL allows the data to determine the limits.

format_age_multi(x,
                 break_min = NULL,
                 break_max = NULL)

By default, the final age group is open (ie has no upper limit), but this can be changed using the open_last argument,

format_age_multi(x, open_last = FALSE)

The default width for multi-year age groups is 5. Alternative values are obtained using the width argument,

format_age_multi(x, width = 20)

Age groups of one year, one quarter, and one month can be generated using functions format_age_year, format_age_quarter, and format_age_month. Age groups with arbitrary widths (measured in years) can be generated using function format_age_custom,

format_age_custom(x, breaks = c(15, 40, 80))

Function format_age_lifetab creates the special age groups needed for an "abridged" life table (ie a life table with age groups "0", "1-4", "5-9", "10-14", "15-19", etc.

format_age_lifetab(x)

Function format_age_biths is designed for tabulations of births,

x <- c("22", "30-33", "18", "40-44")
format_age_births(x)

and can be used to recode ages that fall outside the expected range,

x <- c("10", "30-33")
format_age_births(x, recode_up = TRUE)

Periods

In contrast to the format_age functions, the format_period functions do not have break_min and break_max arguments. Instead, the range of the labels is always determined by the data.

x <- c(2018, 2015, 2021)
format_period_year(x)

By default, format_period_multi creates periods are aligned to the year 2000,

df <- data.frame(x = c("2002", "1996", "2027-2028"))
df %>% 
  mutate(width5 = format_period_multi(x),
         width7 = format_period_multi(x, width = 7))

Periods that align with different years can be obtained by varying the origin argument,

df %>% 
  mutate(period = format_period_multi(x, origin = 2021))

Functions format_period_multi and format_period_custom sometimes need extra help from users to correctly interpret labels for single-year periods. By default, format_period_multi and format_period_custom assume that all periods start on 1 January, and that single-year periods are labelled according to the calendar year at the start. The first assumption can be overridden using the month_start argument, and the second assumption can be overridden using the label_year_start argument.

Here is how format_period_multi and format_period_custom interprets the labels "2050" and "2050-2055" when month_start is "January" and label_year_start is TRUE:

| Label | Interpretation | |:--------------|:-----------------------------------| | "2050" | 1 January 2050 to 31 December 2050 | | "2050-2055" | 1 January 2050 to 31 December 2054 |

Here is how format_period_multi and format_period_custom interprets the labels "2050" and "2050-2055" when month_start is "July" and label_year_start is FALSE:

| Label | Interpretation | |:--------------|:----------------------------| | "2050" | 1 July 2049 to 30 June 2050 | | "2050-2055" | 1 July 2050 to 30 June 2055 |

Under the default settings, "2050" belongs to the period "2050-2055". Under the alternative settings, it does not. Period labels are tricky.

Cohorts

The format_cohort functions work exactly like the equivalent format_period functions, except that the format_cohort functions permit intervals to be open on the left.

x <- c(1993, 1992, 1984)
format_cohort_year(x, break_min = 1990)

A birth cohort with no lower limit is equivalent to an age group with no upper limit. At the end of 2019, for instance, the cohort `"<1920" is equivalent to the age group 100+.

Lexis triangles

The format_triangle functions produce Lexis triangles to accompany the age groups and periods produced by format_age and format_period functions. There is, however, an extra complication when reformatting Lexis triangles. Lexis triangle labels "Lower" and "Upper" can only be interpreted in combination with the relevant age groups and periods. The format_triangle functions therefore need information on the age groups and periods that defined the original Lexis triangles. This information is supplied via the age and period arguments,

Other functions

Imputing dates

Sometimes dates information is incomplete, as when a data source gives years and months of birth, but not days. Functions impute_date and impute_dob can be used to impute values for the missing variables.

The imputation is random, so, for reproducibility, we set the random seed.

set.seed(0)

To impute dates when we know the year and month, we use

impute_date(year = c(2000, 2005, 2003),
            month = c("Feb", "Nov", "Apr"))

To impute dates of birth when we know ages at later events, we use

impute_dob(date = c("2021-03-23", "2021-02-13", "2020-04-25"),
           age_years = c(3, 1, 0))

Labels based on date ranges

One way of dealing with the ambiguities of standard period and cohort labels is to switch to non-standard labels that are less ambiguous. The functions

| | |--------------------| | as_date_range_year | | as_date_range_multi | | as_date_range_custom | | as_date_range_quarter | | as_date_range_month |

all convert standard labels into ones that use explicit dates.

By default, month_start is set to "Jan" and label_year_start is set to TRUE

x <- c("2022", "2028")
as_date_range_year(x)

But these can be changed, to produce alternative translations of the same inputs.

x <- c("2022", "2028")
as_date_range_year(x, 
                   month_start = "Mar",
                   label_year_start = FALSE)

Converting to date-range formats is useful when working with multiple data sources, where it can be difficult to keep track of different labeling conventions.

Flipping year labels

Single-year period or cohort labels sometimes need to be converted from a calendar-year-at-start format to a calendar-year-at-end format, or vice versa. These conversions are confusing, and easy to get wrong. Functions flip_to_start and flip_to_end try to make the process a little easier.

x <- c("2001", "2006", "2013")
flip_to_end(x, month_start = "Apr")

Examples

We look at two examples: one where we start with raw individual-level data, and one where we start with pre-tabulated data.

For reproducibility, we set the random seed.

set.seed(0)

Births in Iceland

We start with a dataset consisting of dates of births of children, and dates of birth of the children's fathers,

births <- demprep::icebirths
births %>%
  sample_n(5)

(The dataset was generated from published tabulations from Statistics Iceland using functions impute_date and impute_dob.)

We want to create counts of births by period of birth, age of father, and cohort of father. We want the periods, age groups, and cohorts to have lengths of 5 years. Periods and cohorts start on 1 January, and align with year 2001.

First we use the dates to assign births to single-year periods, age groups, and cohorts,

births <- births %>%
  mutate(period1 = date_to_period_year(date = dob_child),
         age1 = date_to_age_year(date = dob_child,
                                 dob = dob_father),
         cohort1 = date_to_cohort_year(date = dob_father))
births %>%
  sample_n(5)

Next we turn the single-year intervals into 5-year intervals.

To create periods and cohorts, we accept the default values for width and month_start, but override the default value for origin,

births <- births %>%
  mutate(period = format_period_multi(period1,
                                      origin = 2001),
         cohort = format_cohort_multi(cohort1,
                                      origin = 2001))
births %>%
  sample_n(5)

We allow the lower and upper limits for age of father to be set by the data, rather than the default values of 15 and 50, but otherwise accept the defaults for function format_age_births,

births <- births %>%
  mutate(age = format_age_births(x = age1, 
                                 break_min = NULL,
                                 break_max = NULL))
births %>% 
  sample_n(5)

To finish up, we convert from individual-level data to a tabulation,

births <- births %>%
  count(age, period, cohort, name = "count")
births %>% 
  sample_n(5)

Counts of deaths in New Zealand

We start with tabulated data downloaded from the Statistics New Zealand we site

deaths <- demprep::nzdeaths
deaths %>%
  sample_n(5)

We want to clean up the age groups. The original labels are

unique(deaths$age)

We get rid of the "total" category,

deaths <- deaths %>%
  filter(age != "Total all ages")
unique(deaths$age)

We apply function clean_age,

deaths <- deaths %>%
  mutate(age = clean_age(age))
unique(deaths$age)

We merge the "0" and "1-4" age groups, and sex the maximum age group to "90+".

deaths <- deaths %>%
  mutate(age = format_age_multi(age, 
                                break_max = 90))
unique(deaths$age)

After changing the age labels, we have multiple rows with the same combination of age, sex, and year. For instance,

deaths %>%
  filter(age == "0-4",
         sex == "Female",
         year == 2020)

So we consolidate,

deaths <- deaths %>%
  count(age, sex, year, wt = count, name = "count")
deaths %>%
  sample_n(5)

The duplicates are gone.

deaths %>%
  filter(age == "0-4",
         sex == "Female",
         year == 2020)

bayesiandemography/demprep documentation built on Dec. 28, 2021, 8:47 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bayesiandemography/demprep
Prepare Demographic Data

In bayesiandemography/demprep: Prepare Demographic Data

Introduction

Background

Calculating age

Definining age groups, periods, and cohorts

Labels for age groups, periods, and cohorts

Functions in demprep

Overview

The `date_to` functions

Overview

Age

Periods

Cohorts

Lexis triangles

The `clean` functions

The `format` functions

Overview

Age groups

Periods

Cohorts

Lexis triangles

Other functions

Imputing dates

Labels based on date ranges

Flipping year labels

Examples

Births in Iceland

Counts of deaths in New Zealand

R Package Documentation

Browse R Packages

We want your feedback!

bayesiandemography/demprep Prepare Demographic Data

In bayesiandemography/demprep: Prepare Demographic Data

Introduction

Background

Calculating age

Definining age groups, periods, and cohorts

Labels for age groups, periods, and cohorts

Functions in demprep

Overview

The date_to functions

Overview

Age

Periods

Cohorts

Lexis triangles

The clean functions

The format functions

Overview

Age groups

Periods

Cohorts

Lexis triangles

Other functions

Imputing dates

Labels based on date ranges

Flipping year labels

Examples

Births in Iceland

Counts of deaths in New Zealand

R Package Documentation

Browse R Packages

We want your feedback!

bayesiandemography/demprep
Prepare Demographic Data

The `date_to` functions

The `clean` functions

The `format` functions