library(knitr) library(demprep) library(dplyr) opts_chunk$set( collapse = TRUE, comment = "#>", echo = TRUE, fig.width = 6.5, fig.align = "center" ) births_in <- data.frame(date_child = c("2020-05-18", "2020-07-02", "2020-10-05"), date_parent = c("2000-02-14", "2002-05-18", "1990-12-18")) births_out <- data.frame(date_child = c("2020-03-14", "2020-08-22"), date_parent = c("1968-01-23", "2008-03-05"))
Demographic analyses typically use counts of people or events with standardised age groups and periods,
deaths_clean <- data.frame(age = rep(c("0-4", "5-9", "10-14"), times = 2), year = rep(2020:2021, each = 3), count = c(3, 5, 1, 4, 0, 2)) kable(deaths_clean, align = "c", caption = "Counts of deaths by age group and year")
However, raw data from registration systems or surveys typically reports individual events, with dates rather than age groups or periods,
deaths_raw <- data.frame(name = c("Alice", "Bilal", "Clara", "Djeneba", "Ergi", "Faisal"), date_of_birth = c("2022-04-17", "2009-12-03", "2020-07-01", "2011-11-29", "2020-05-16", "2018-03-30"), date_of_death = c("2022-07-19", "2021-10-18", "2022-14-16", "2022-08-07", "2020-06-03", "2020-01-13")) kable(deaths_raw, caption = "Deaths by date of birth and date of death")
Even data that has been processed may use formats we do not want, or may contain gaps,
deaths_semiclean <- data.frame(age = c("Infant", "2 years", "12 years", "Infants", "1 year"), year = c(2017, 2020, 2020, 2021, 2021), count = c(2, 1, 1, 3, 2)) kable(deaths_semiclean, caption = "Counts of deaths by age group and year")
Getting from raw or partly-processed data to data that is ready for demographic analysis can be deceptively difficult. Consider, for instance, the task of assigning people to one-month age groups. When does a person who was born on 31 January 2001 turn one month old? If we use the base R function seq.Date
to calculate ages we obtain a rather counter-intuitive answer: 3 March 2001.
seq.Date(from = as.Date("2001-01-31"), by = "month", length.out = 2)
Package demprep provides functions for processing data to the point where it is ready for demographic analysis. demprep focuses in particular on ages and periods, since this is typically the hardest part of preparing demographic data.
In demography, as in ordinary English, "age" normally means "age in completed years". For instance, demographers describe a person who was born 5 years and 57 days ago as 5 years old. If, instead of years, demographers are measuring age in months, then they use age in completed months. Demographers describe a person who was born 3 months and 23 days ago as 3 months old.
Calculating ages is, however, complicated by the fact that length of the units, as measured days elapsed, is not constant. Leap years are one day longer than non-leap years, for instance, and May has one more day than April.
The basic rule used by demprep is that a person gains a extra month of age each time the person attains the day-of-the-month that they were born. If a person was born on 5 March 2000, for instance, then the person turns one month old on 5 April 2000, turns 2 months old on May 2000, turns 12 months old on 5 March 2001, and so on.
The basic rule needs to be extended, however, to deal with cases when a month does not contain the day-of-the-month when a person was not born. The basic rule does not, for instance, allow us to say when a person who was born on 31 January turns one month old, since February does not contain a 31st day.
To deal with such cases, demprep follows the principle that if the day-of-the-month when a person was born does not occur during a month, then the person gains the the extra month of age on the first day of the next month. A person who is born on 31 January 2000 turns one month old on 1 March 2000.
More precisly, when provided with the year, month, and day-of-month of an event, and the year, month, and day-of-month of the person's birth, then demprep calculates age in completed months at the time of the event as follows: \begin{equation} \begin{split} \text{age in completed months} & = 12 \times (\text{year of event} - \text{year of birth}) \ & \quad + \text{month of event} - \text{month of birth} \ & \quad - I{\text{day-of-month of event} \ge \text{day-of-month of birth}}, \end{split} \end{equation} where $I{\text{day-of-month of event} \ge \text{day-of-month of birth}}$ means "1 if the day-of-month of the event is greater than or equal to the day-of-month of the birth, and 0 otherwise".
Age in completed years can be obtained by dividing age in completed months by 12, and discarding any remainder. If a person is aged 123, measured in completed months, then the person is aged 10, measured in completed years.
The demprep approach to calculating age requires accepting that the same age measured in months can correspond to different ages measured in total days elapsed, depending on the time of year. This variability is, however, unavoidable, if ages and periods are to line up in the way that people typically expect them to. We have to accept some variability if, for instance, people who are aged $n$ months on 1 February are to be aged $n+1$ months on 1 March, and $n+2$ months on 1 April.
Age groups and periods are two alternative ways of grouping people: one based on duration since birth, and the other based on calendar time. Cohorts are a third approach. A cohort is a collection of people who all experienced a particular event during a specified period. The particular event is normally birth, so that we might have a cohort composed of people born during the 1960s, or a cohort of people born during the year 2007. However, cohorts can also be defined by other events. People who married during the 1960s form a cohort, for instance, as do people who started high school during 2007.
One way of visualising age groups, periods, and cohorts is through a Lexis diagram. The figure below is Lexis diagram. The horizontal axis represents time, and the vertical axis represent age. The horizontal grid lines mark out age groups. The age group "5-9" for instance, starts at exact age 5 and finishes just before exact age 10. The vertical grid lines mark out periods. The period "2005-2010", for instance, starts on 1 January 2005 and ends just before 1 January 2010. The diagonal lines mark out cohorts. The cohort "2000-2005", for instance, starts down the bottom left of the diagram, and ends at the top right, at which point the date is 1 January 2025 and cohort members are all in the age group 20-24.
date <- as.Date(c("2020-12-07", "2022-05-30")) dob <- as.Date(c("2001-07-22", "2006-03-14")) age <- c(19 + 4.5/12, 16 + 2.5/12)
breaks_time <- seq.Date(from = as.Date("2000-01-01"), to = as.Date("2025-01-01"), by = "5 years") breaks_age <- seq.int(from = 0L, to = 25L, by = 5L) year <- as.integer(format(breaks_time, "%Y")) labels_time <- demprep:::make_labels_period_custom(year, include_na = FALSE) labels_age <- demprep:::make_labels_age(breaks = breaks_age, open_last = FALSE, include_na = FALSE) demprep:::plot_date_to_age_triangle(date = date, dob = dob, unit = "year", breaks_time = breaks_time, breaks_age = breaks_age, labels_time = labels_time, labels_age = labels_age, show_vert = TRUE, show_diag = TRUE, show_months = FALSE) text(x = date, y = age, labels = expression(italic(A), italic(B)), pos = 4)
The two black diagonal lines in the diagram above are "life lines". They depict the lives of individuals A and B from the time of their births, in periods 2000-2005 and 2005-2010, to the time of the deaths, in 2020-2025.
Individuals A and B died in the same period (2020-2025). They also belonged to the same age group (20-24) when they died. However, they belonged to two different cohorts: 2000-2005 versus 2005-2010. If all we knew about the deaths of A and B was the age group and period, then we would not be able to tell whether the deaths belonged to cohort 2000-2005 or to cohort 2005-2010.
Table: Assigning deaths to cohorts
| Individual | Age group | Period | Cohort | |:----------:|:---------:|:---------:|:------:| | A | 20-24 | 2020-2025 | ? | | B | 20-24 | 2020-2025 | ? |
This is uncertainty about cohort membership is an example of a more general phenomenon. Information on the age group and period of an event only allows us to narrow the choice of cohorts down to two. To resolve the remaining uncertainty, we need one more pice of information. One way of encoding this information is through "Lexis triangles".
Consider again the Lexis diagram above. Lexis triangles are formed by the intersection of the horizontal, vertical, and diagonal lines. The triangles below the diagonal lines are known as lower Lexis triangles, and the triangles above the diagonal lines are known as upper Lexis triangles. The death of individual A belongs to an upper Lexis triangle, while the death of individual B belongs to a lower Lexis triangle. Events in an upper triangle belong to the earlier of the two possible cohorts, and events in a lower triangle belong to the later of the two cohorts.
Table: Assigning deaths to cohorts (with Lexis triangles)
| Individual | Age group | Period | Lexis triangle | Cohort | |:----------:|:---------:|:---------:|:--------------:|:---------:| | A | 20-24 | 2020-2025 | Upper | 2000-2005 | | B | 20-24 | 2020-2025 | Lower | 2005-2010 |
One point to note about age groups, periods, and cohorts is that, to accommodate all three within the same dataset, age, period, and cohort all need to be measured using intervals of the same length. In the Lexis diagram above, for instance, age groups, periods, and cohorts all have lengths of 5 years. Demographic data do not always come in this form. It is, for instance, common to have data with 5-year age groups and 1-year periods.
The table below summarises the age group, period, and cohort labels produced by demprep. Dashes indicate combinations of unit, type, and grouping that demprep does not cater to.
Table: Examples of demprep labels for age groups, periods, and cohorts
| Unit | Type | Age group | Period | Cohort |
| :------ | :------- | :---------: | :----------: | :------------: |
| year | single | "5"
| "2020"
| "2020"
|
| year | multiple | "5-9"
| "2025-2030"
| "2025-2030"
|
| year | open left | - | - | "<2020"
|
| year | open right | "100+"
| - | - |
| quarter | single | "20"
| "2020 Q1"
| "2020 Q1"
|
| quarter | multiple | - | - | - |
| quarter | open left | - | - | "<2020 Q1"
|
| quarter | open right | "400+"
| - | - |
| month | single | "60"
| "2020 Jan"
| "2020 Jan"
|
| month | multiple | - | - | - |
| month | open left | - | - | "<2020 Jan"
|
| month | open right | "1200+"
| - | - |
demprep creates labels for three different units: years, quarters, and months. demprep allows age groups to be open on the right, and allows cohorts to be open on the left. demprep allows labels composed of multiple years, but not labels composed of multiple quarters or months.
Producers of demographic data almost all follow a rule that multi-year age groups are labeled as "[lower limit]-[upper limit minus one]", so that, for instance, the interval between exact ages 5 and 10 is labelled "5-9". Producers of demographer data are much less consistent in the way they label multi-year periods and cohorts. A majority use a "[lower limit]-[upper limit]" format, so that the interval between 1 January 2020 and 1 January 2025 is labelled "2020-2025". Some, however, use a "[lower limit]-[upper limit minus one]" format, so that the same period is labelled "2020-2024". demprep uses "[lower limit]-[upper limit]" labels.
All standard labels for periods and cohorts denominated in years, in demprep and elsewhere, are ambiguous. One reason for the ambiguity is the fact that periods and cohorts do not always start 1 January. In official statistics, for instance, it is common for periods to start on 1 July and end on 30 Jun. Labels such as "2015" or "2001-2006" do not distinguish between these possibilities.
The type of ambiguity is specific to one-year periods and cohorts. A one-year period or cohort that starts on a date other than 1 January overlaps with two calendar years. For instance, a one-year period that starts on 1 July 2020 and ends on 30 June 2021 belongs partly to calendar year 2020 and partly to calendar year 2021. Some data producers label one-year periods and cohorts according to the calendar year at the start of the period or cohort, and others label them according to the calendar year at the end. Some data producers, for instance, would label a period starting on 1 July 2020 and ending on 30 June 2021 as "2020" and others would label it as "2021".
Consider, for instance, the label "2020". Restricting ourselves to periods that start on the first of the month, the label "2020" can be interpreted in 23 different ways:
| Start month | Uses calendar year at start | Uses calendar year at end | | :---------- | :------------------------: | :------------------------: | | January | 1 Jan 2020 - 31 Dec 2020 | 1 Jan 2020 - 31 Dec 2020 | | February | 1 Feb 2020 - 31 Jan 2021 | 1 Feb 2019 - 31 Jan 2020 | | March | 1 Mar 2020 - 28 Feb 2021 | 1 Mar 2019 - 29 Feb 2020 | | April | 1 Apr 2020 - 31 Mar 2021 | 1 Apr 2019 - 31 Mar 2020 | | May | 1 May 2020 - 30 Apr 2021 | 1 May 2019 - 30 Apr 2020 | | June | 1 Jun 2020 - 31 May 2021 | 1 Jun 2019 - 31 May 2020 | | July | 1 Jul 2020 - 30 Jun 2021 | 1 Jul 2019 - 30 Jun 2020 | | August | 1 Aug 2020 - 31 Jul 2021 | 1 Aug 2019 - 31 Jul 2020 | | September | 1 Sep 2020 - 31 Aug 2021 | 1 Sep 2019 - 31 Aug 2020 | | October | 1 Oct 2020 - 30 Sep 2021 | 1 Oct 2019 - 30 Sep 2020 | | November | 1 Nov 2020 - 31 Oct 2021 | 1 Nov 2019 - 31 Oct 2020 | | December | 1 Dec 2020 - 30 Nov 2021 | 1 Dec 2019 - 30 Nov 2020 |
In principle, the best response to this ambiguity would be to use alternative labels that were unambiguous. demprep does in fact provide a way of producing unambiguous labels, via the as_date_range
functions. Most of the time, however, it is easier to work within existing conventions, which is what most funcitons in demprep do.
Data preparation using demprep typically proceeds as follows:
include_graphics("workflow.png", auto_pdf = TRUE)
0. Read in data and do initial tidying using non-demprep functions
Use functions from base R or elsewhere to read the data into a data frame, put any date variables into a "year-month-day" format, and tidy variables not related to age, period, or cohort.
1. Create or clean age, period, and cohort variables
If the original data contains dates, use functions such as date_to_age_year
or date_to_period_multi
to construct age, period, and cohort labels. If the original data already contains age, period, and cohort labels, use functions such as clean_age
to convert them to demprep formats.
2. Ensure labels are consistent and complete
Process any age, period, and cohort labels using functions such as format_age_year
and format_period_multi
, to make sure the labels are consistent and have all the required levels.
There are, however, a few common data preparation tasks that do not fit neatly into the workflow above, and that can be done using demprep functions:
Imputing dates
When only partial information on dates is provided, the missing information can be imputed using the impute_date
and impute_dob
.
Creating unambiguous labels
Create non-standard but unambiguous labels for periods or cohorts.
Switching the labeling of one-year periods and cohorts
Convert one-year period or cohort labels from using calender-year-at-start to calendar-year-at-end, or vice versa.
date_to
functionsSuppose that we have some raw, individual-level data on dates of birth and dates of death:
deaths <- data.frame(name = c("Anwar", "Baptiste", "Candice"), date_birth = c("2014-02-17", "2012-01-10", "2019-04-29"), date_death = c("2019-10-11", "2020-02-27", "2020-08-01")) deaths
We want to calculate the period when people were born, the period when they died, and their age at death. We can do calculations like these using the date_to
functions:
| Age groups | Periods | Cohorts | Lexis triangles |
| :----------- | :----------------------- | :-------- | :------------------------- |
| date_to_age_year
| date_to_period_year
| date_to_cohort_year
| date_to_triangle_year
|
| date_to_age_quarter
| date_to_period_quarter
| date_to_cohort_quarter
| date_to_triangle_quarter
|
| date_to_age_month
| date_to_period_month
| date_to_cohort_month
| date_to_triangle_month
|
Before running the functions, we need to load package demprep and (for the pipe %>%
and various data manipulation functions) package dplyr.
library(demprep) library(dplyr)
The date_to_age
functions calculate ages denominated in years, quarters, and months.
deaths %>% mutate(age_years = date_to_age_year(date = date_death, dob = date_birth), age_quarters = date_to_age_quarter(date = date_death, dob = date_birth), age_months = date_to_age_month(date = date_death, dob = date_birth))
Age is calculated using the approach discussed in Section 2.1, based on months and days-of-month attained (as opposed to the total number of days elapsed.) The date
and dob
(short for date-of-birth) arguments in the date_to
functions can be "Date"
vectors (as described in the Dates
help page) or anything function as.Date
can automatically convert to a "Date"
vector. A character vector can be safely converted to a date vector if it uses a "year-month-day" format, as in "2025-03-01"
.
By default, date_to_period_year
creates periods that start on 1 January,
deaths %>% select(-date_birth) %>% mutate(year_jan = date_to_period_year(date = date_death))
which we might depict as
demprep:::plot_date_to_period_year(date = deaths$date_death)
However, other start dates are allowed, provided the start dates are the first day of the month,
deaths %>% select(-date_birth) %>% mutate(year_apr = date_to_period_year(date = date_death, month_start = "Apr"))
demprep:::plot_date_to_period_year(date = deaths$date_death, month_start = "Apr")
As discussed in Section 2.3, some data producers label single-year periods by the calendar year at the start of the period, and others by the calendar year at the end. date_to_period_year
defaults to using calendar year at the start. The default can be overridden by setting label_year_start
to FALSE
.
deaths %>% select(-date_birth) %>% mutate(year_start = date_to_period_year(date = date_death, month_start = "Apr"), year_end = date_to_period_year(date = date_death, month_start = "Apr", label_year_start = FALSE))
Labels for quarters and months are simpler than labels for years. Function date_to_period_quarter
implements a single set of start dates (1 January, 1 April, 1 July, and 1 October), and a single labeling style.
deaths %>% select(-date_birth) %>% mutate(quarter = date_to_period_quarter(date_death))
date_to_period_month
is similarly simple/inflexible.
deaths %>% select(-date_birth) %>% mutate(month = date_to_period_month(date_death))
The date_to_cohort
functions work like their date_to_period
equivalents,
deaths %>% select(-date_death) %>% mutate(cohort = date_to_cohort_year(date = date_birth, month_start = "Apr", label_year_start = FALSE))
demprep:::plot_date_to_cohort_year(date = deaths$date_birth)
To calculate Lexis triangles, we need dates of events and dates of birth,
deaths %>% mutate(triangle = date_to_triangle_year(date = date_death, dob = date_birth))
demprep:::plot_date_to_triangle_year(date = deaths$date_death, dob = deaths$date_birth)
Assigning events to Lexis triangles with real data, where we know dates but not precise times, and where months have different lengths, involves some tricky edge cases. The date_to_triangle
functions resolve these edge cases by looking at the date that a person enters a new age group during the period in question. If the date when the person enters the new age group is greater than the date when the event occurs, then the event is allocated to an upper Lexis triangle. If the date when the person enters the new age group is equal to or less than the date when the event occurs, then the event is allocated to a lower Lexis triangle.
clean
functionsIf the input data for an analysis come from published sources, then it probably already contains labels for age groups age groups, periods, and cohorts rather than precise dates. These labels may, however, require some modification before they match the formats expected by the dem packages. demprep contains a number of functions to help with the cleaning process,
| Age groups | Periods | Cohorts |
| :----------- | :-------- | :-------- |
| clean_age
| clean_period
| clean_cohort
|
| clean_age_df
| clean_period_df
| clean_cohort_df
|
| is_valid_age
| is_vald_period
| is_valid_cohort
|
Functions clean_age
, clean_period
, and clean_cohort
try to parse vectors of labels and, where necessary, convert them to dem formats. If clean_age
, clean_period
, and clean_cohort
encounter a label they cannot parse, they leave the label untouched.
x <- c("20 years", "80 and over", "young", "20-24") clean_age(x)
Function clean_age
assumes that labels consisting entirely of multiples of 5 refer to 5-year age groups,
x <- seq(0, 60, 5) x clean_age(x)
It also assumes that labels consisting of the numbers 0, 1, 5, 10, ..., come from a life table
x <- c(0, 1, seq(60, 5, -5)) x clean_age(x)
Functions clean_cohort
and clean_period
are identical to each other, except that clean_cohort
accepts intervals that are open on the left,
x <- c("Q1 2020", "1922", "2010-2025", "before 2020") clean_cohort(x)
while clean_period
does not.
clean_period(x)
Functions clean_age_df
, clean_period_df
, and clean_cohort_df
produce data frames describing how functions clean_age
, clean_period
, and clean_cohort
interpret a set of labels,
x <- c("2021", "2022-2025", "2021", "q2 2020") clean_period_df(x)
Functions is_valid_age
, is_valid_period
, and is_valid_cohort
can be used to check whether individual labels are already in a valid demprep format,
x <- c("2021", "the 1960s", "1960-1970") is_valid_period(x)
Some version of
stopifnot(all(is_valid_period(x)))
may be helpful for catching problems.
format
functionsEven after all dates in a dataset have been turned into age, period, and cohort labels, and all labels have been converted demprep style, further processing of the labels may still be useful. In particular, it may be useful to consolidate labels, and fill in gaps. We may wish to turn
| age | count|
|-------+------|
| 0-2 | 4 |
| 4 | 5 |
| 10-12 | 3 |
into
| age | count|
|-------+------|
| 0-4 | 4 |
| 0-4 | 5 |
| 10-14 | 3 |
We may even wish to go further, and have some way of capturing the fact that there is an age group between 10-14 and 20-24, even if this dataset does not happen to contain any observations from it.
The format
functions take vectors of age, period, cohort, and Lexis triangle labels that follow demprep conventions, and return factors where the intervals have standardised lengths, and where all intermediate categories, including ones that do not appear in the data, are included.
| Age groups | Periods | Cohorts | Lexis triangles |
| :----------- | :-------- | :-------- | :---------------- |
| format_age_year
| format_period_year
| format_cohort_year
| format_triangle_year
|
| format_age_multi
| format_period_multi
| format_cohort_multi
| format_triangle_multi
|
| format_age_custom
| format_period_custom
| format_cohort_custom
| \<none> |
| format_age_lifetab
| \<none> | \<none> | \<none> |
| format_age_births
| \<none> | \<none> | format_triangle_births
|
| format_age_quarter
| format_period_quarter
| format_cohort_quarter
| format_triangle_quarter
|
| format_age_month
| format_period_month
| format_cohort_month
| format_triangle_month
|
Functions ending in multi
return create multi-year intervals, such as 5-year or 10-year periods. Functions ending in custom
also create multi-year labels, but, these intervals do not have to all have the same length. Function format_age_lifetab
creates special age groups for life tables, and function format_age_births
creates age groups for birth counts or rates.
The format
functions all return factors that contain intermediate values, including values are not represented in the data. (Factors are R's way of representing categorical variables. See the R base function factor
and also the tidyverse package forcats.) Consider, for instance, the vector x
,
x <- c("0-4", "10-14")
which omits the value "5-9"
. The function format_age_multi
creates a factor with levels that include 5-9
.
format_age_multi(x)
If argument x
has an NA
, then the levels of the factor created by a format
factor will also have NA
. This behavior is different from the default behavior for function factor
, which is to silently drop NA
s. The philosophy of the format
functions is that it is better to explicitly deal with NA
s.
By default, format_age
functions create age groups between 0 and 100+,
x <- c("35-38", "50-54", "77") format_age_multi(x)
Alternative upper and lower limits can be obtained using arguments break_min
and break_max
,
format_age_multi(x, break_min = 25, break_max = 90)
Setting break_min
and break_max
to NULL
allows the data to determine the limits.
format_age_multi(x, break_min = NULL, break_max = NULL)
By default, the final age group is open (ie has no upper limit), but this can be changed using the open_last
argument,
format_age_multi(x, open_last = FALSE)
The default width for multi-year age groups is 5. Alternative values are obtained using the width
argument,
format_age_multi(x, width = 20)
Age groups of one year, one quarter, and one month can be generated using functions format_age_year
, format_age_quarter
, and format_age_month
. Age groups with arbitrary widths (measured in years) can be generated using function format_age_custom
,
format_age_custom(x, breaks = c(15, 40, 80))
Function format_age_lifetab
creates the special age groups needed for an "abridged" life table (ie a life table with age groups "0"
, "1-4"
, "5-9"
, "10-14"
, "15-19"
, etc.
format_age_lifetab(x)
Function format_age_biths
is designed for tabulations of births,
x <- c("22", "30-33", "18", "40-44") format_age_births(x)
and can be used to recode ages that fall outside the expected range,
x <- c("10", "30-33") format_age_births(x, recode_up = TRUE)
In contrast to the format_age
functions, the format_period
functions do not have break_min
and break_max
arguments. Instead, the range of the labels is always determined by the data.
x <- c(2018, 2015, 2021) format_period_year(x)
By default, format_period_multi
creates periods are aligned to the year 2000,
df <- data.frame(x = c("2002", "1996", "2027-2028")) df %>% mutate(width5 = format_period_multi(x), width7 = format_period_multi(x, width = 7))
Periods that align with different years can be obtained by varying the origin
argument,
df %>% mutate(period = format_period_multi(x, origin = 2021))
Functions format_period_multi
and format_period_custom
sometimes need extra help from users to correctly interpret labels for single-year periods. By default, format_period_multi
and format_period_custom
assume that all periods start on 1 January, and that single-year periods are labelled according to the calendar year at the start. The first assumption can be overridden using the month_start
argument, and the second assumption can be overridden using the label_year_start
argument.
Here is how format_period_multi
and format_period_custom
interprets the labels "2050"
and "2050-2055"
when month_start
is "January"
and label_year_start
is TRUE
:
| Label | Interpretation |
|:--------------|:-----------------------------------|
| "2050"
| 1 January 2050 to 31 December 2050 |
| "2050-2055"
| 1 January 2050 to 31 December 2054 |
Here is how format_period_multi
and format_period_custom
interprets the labels "2050"
and "2050-2055"
when month_start
is "July"
and label_year_start
is FALSE
:
| Label | Interpretation |
|:--------------|:----------------------------|
| "2050"
| 1 July 2049 to 30 June 2050 |
| "2050-2055"
| 1 July 2050 to 30 June 2055 |
Under the default settings, "2050"
belongs to the period "2050-2055"
. Under the alternative settings, it does not. Period labels are tricky.
The format_cohort
functions work exactly like the equivalent format_period
functions, except that the format_cohort
functions permit intervals to be open on the left.
x <- c(1993, 1992, 1984) format_cohort_year(x, break_min = 1990)
A birth cohort with no lower limit is equivalent to an age group with no upper limit. At the end of 2019, for instance, the cohort `"<1920" is equivalent to the age group 100+.
The format_triangle
functions produce Lexis triangles to accompany the age groups and periods produced by format_age
and format_period
functions. There is, however, an extra complication when reformatting Lexis triangles. Lexis triangle labels "Lower" and "Upper" can only be interpreted in combination with the relevant age groups and periods. The format_triangle
functions therefore need information on the age groups and periods that defined the original Lexis triangles. This information is supplied via the age
and period
arguments,
Sometimes dates information is incomplete, as when a data source gives years and months of birth, but not days. Functions impute_date
and impute_dob
can be used to impute values for the missing variables.
The imputation is random, so, for reproducibility, we set the random seed.
set.seed(0)
To impute dates when we know the year and month, we use
impute_date(year = c(2000, 2005, 2003), month = c("Feb", "Nov", "Apr"))
To impute dates of birth when we know ages at later events, we use
impute_dob(date = c("2021-03-23", "2021-02-13", "2020-04-25"), age_years = c(3, 1, 0))
One way of dealing with the ambiguities of standard period and cohort labels is to switch to non-standard labels that are less ambiguous. The functions
| |
|--------------------|
| as_date_range_year
|
| as_date_range_multi
|
| as_date_range_custom
|
| as_date_range_quarter
|
| as_date_range_month
|
all convert standard labels into ones that use explicit dates.
By default, month_start
is set to "Jan"
and label_year_start
is set to TRUE
x <- c("2022", "2028") as_date_range_year(x)
But these can be changed, to produce alternative translations of the same inputs.
x <- c("2022", "2028") as_date_range_year(x, month_start = "Mar", label_year_start = FALSE)
Converting to date-range formats is useful when working with multiple data sources, where it can be difficult to keep track of different labeling conventions.
Single-year period or cohort labels sometimes need to be converted from a calendar-year-at-start format to a calendar-year-at-end format, or vice versa. These conversions are confusing, and easy to get wrong. Functions flip_to_start
and flip_to_end
try to make the process a little easier.
x <- c("2001", "2006", "2013") flip_to_end(x, month_start = "Apr")
We look at two examples: one where we start with raw individual-level data, and one where we start with pre-tabulated data.
For reproducibility, we set the random seed.
set.seed(0)
We start with a dataset consisting of dates of births of children, and dates of birth of the children's fathers,
births <- demprep::icebirths births %>% sample_n(5)
(The dataset was generated from published tabulations from Statistics Iceland using functions impute_date
and impute_dob
.)
We want to create counts of births by period of birth, age of father, and cohort of father. We want the periods, age groups, and cohorts to have lengths of 5 years. Periods and cohorts start on 1 January, and align with year 2001.
First we use the dates to assign births to single-year periods, age groups, and cohorts,
births <- births %>% mutate(period1 = date_to_period_year(date = dob_child), age1 = date_to_age_year(date = dob_child, dob = dob_father), cohort1 = date_to_cohort_year(date = dob_father)) births %>% sample_n(5)
Next we turn the single-year intervals into 5-year intervals.
To create periods and cohorts, we accept the default values for width
and month_start
, but override the default value for origin
,
births <- births %>% mutate(period = format_period_multi(period1, origin = 2001), cohort = format_cohort_multi(cohort1, origin = 2001)) births %>% sample_n(5)
We allow the lower and upper limits for age of father to be set by the data, rather than the default values of 15 and 50, but otherwise accept the defaults for function format_age_births
,
births <- births %>% mutate(age = format_age_births(x = age1, break_min = NULL, break_max = NULL)) births %>% sample_n(5)
To finish up, we convert from individual-level data to a tabulation,
births <- births %>% count(age, period, cohort, name = "count") births %>% sample_n(5)
We start with tabulated data downloaded from the Statistics New Zealand we site
deaths <- demprep::nzdeaths deaths %>% sample_n(5)
We want to clean up the age groups. The original labels are
unique(deaths$age)
We get rid of the "total" category,
deaths <- deaths %>% filter(age != "Total all ages") unique(deaths$age)
We apply function clean_age
,
deaths <- deaths %>% mutate(age = clean_age(age)) unique(deaths$age)
We merge the "0"
and "1-4"
age groups, and sex the maximum age group to "90+"
.
deaths <- deaths %>% mutate(age = format_age_multi(age, break_max = 90)) unique(deaths$age)
After changing the age labels, we have multiple rows with the same combination of age, sex, and year. For instance,
deaths %>% filter(age == "0-4", sex == "Female", year == 2020)
So we consolidate,
deaths <- deaths %>% count(age, sex, year, wt = count, name = "count") deaths %>% sample_n(5)
The duplicates are gone.
deaths %>% filter(age == "0-4", sex == "Female", year == 2020)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.