knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 8)
library(hierarchyUtils)
library(data.table)
library(ggplot2)

Demographics data is often organized into different hierarchical groupings of variables. These variables can either be:

Iran census population counts included in the package are shown below and documented at ?hierarchyUtils::iran_pop. Examples of categorical variables included in this dataset are location, year, sex, source_name, nid and underlying_nid while age_start and age_end are one numeric interval variable.

print(iran_pop)

The hierarchyUtils package provides a variety of utility functions that can be used to work with this kind of data. There are functions included to:

Aggregate ages and both sexes combined

hierarchyUtils::iran_pop provides sex and age specific population counts for censuses between r min(iran_pop$year) and r max(iran_pop$year) in Iran. Each of these censuses includes a different set of most detailed age groups which can make comparisons across censuses more difficult without aggregation to common age groups first.

plot_data <- iran_pop[grepl("Iran", location) & source_name == "DYB"]
plot_data <- plot_data[!(age_start == 999 & age_end == 999)]
ggplot(plot_data) +
  geom_point(aes(x = age_start, y = factor(year))) +
  ylab("Census Year") +
  scale_x_continuous(breaks = seq(0, max(plot_data$age_start, na.rm = T), 10)) +
  labs(title = "Start of each age interval in each Iran census count at the national level",
       subtitle = "Only data from the Demographic Year Book") +
  theme_bw()

By aggregating to a common set of aggregate age groups using hierarchyUtils::agg we can more easily compare population counts over time. One of the benefits of this function is that the input data does not need to be square, meaning the age groups in each census year for example can be different.

Here we can first aggregate the interval variable age which depends on age_start and age_end columns in the input dataset. present_agg_severity = "skip" because some location-years most detailed age groups are equal to the aggregates specified in `agg_age_mapping so we need to include those in the resulting aggregates.

value_cols = "population"
id_cols <- names(iran_pop)[!names(iran_pop) %in% value_cols]

# reassign unknown age groups
iran_pop <- iran_pop[age_start == 999 & age_end == 999, c("age_start", "age_end") := list(NA, NA)]

# aggregate to aggregate age groups
agg_age_mapping <- data.table(
  age_start = c(0, 0, 15, 60),
  age_end = c(Inf, 5, 60, Inf),
  include_NA = c(T, F, F, F)
)
agg_age_mapping
iran_pop_aggs <- hierarchyUtils::agg(
  dt = iran_pop,
  id_cols = id_cols,
  value_cols = value_cols,
  col_stem = "age",
  col_type = "interval",
  mapping = agg_age_mapping,
  # this is specified since some location-years most detailed age groups are
  # equal to the aggregates specified in `agg_age_mapping`
  present_agg_severity = "skip"
)

Then we can aggregate the categorical variable sex.

# aggregate to both sexes combined
sex_mapping <- data.table(parent = "both", child = c("female", "male"))
sex_mapping
iran_pop_aggs_both_sexes <- hierarchyUtils::agg(
  dt = iran_pop_aggs,
  id_cols = id_cols,
  value_cols = value_cols,
  col_stem = "sex",
  col_type = "categorical",
  mapping = sex_mapping
)
iran_pop_aggs <- rbind(iran_pop_aggs, iran_pop_aggs_both_sexes, use.names = T)

The output dataset returns all the new aggregate rows made as can be seen below with the same columns as what was included originally in the dataset.

iran_pop_aggs[grepl("Iran", location) & year == max(year) & age_start == 0 & age_end == Inf]

Aggregate locations

There are also some years in the dataset where data is only available at the province level and has not yet been aggregated to the national level.

One difficulty with this data is that the province boundaries have changed over time in Iran so sometimes data is only available for historical locations (locations that existed in previous years but that have since been split into multiple locations).

A mapping of how province boundaries have changed from the 1956 census to present day is included in the package and documented at ?hierarchyUtils::iran_mapping.

Tree data structures are really useful for representing hierarchical data which is exactly the type of data that is being manipulated when aggregating or scaling data to different levels of a hierarchy. The mappings used to define the hierarchical structure of categorical data in hierarchyUtils must have a parent and child column defining relationships between nodes in the tree (like in hierarchyUtils::iran_mapping).

The root node of the tree shown below is "Iran" at the national level. The leaf nodes of the tree are the present day provinces and all other nodes are the historical provinces. Colored in green is everywhere data is directly available for each location, blue are locations where aggregations are possible and red is where data is missing (or aggregation is not possible). Here we can see in 2006 that a historical province called "Tehran" included two present day provinces, "Tehran" and "Alborz".

# this function is used internally by `hierarchyUtils::agg` but is available to help
# visualize why some aggregates can't be made, what data is missing, etc.
agg_tree <- hierarchyUtils::create_agg_tree(
  mapping = iran_mapping,
  exists = iran_pop_aggs[year == 2006 & source_name != "DYB", unique(location)],
  col_type = "categorical")
hierarchyUtils::vis_tree(agg_tree)

Now we can make location aggregates but need to use two arguments that are helpful when aggregating non-square data. present_agg_severity = 'none' because in some source-years Iran is already included so the function needs to drop the aggregate and recalculate it. missing_dt_severity = "none" because in some source-years we don't have the present day provinces and its okay to just make the aggregates that are possible given the data that is available.

iran_pop_aggs_location <- hierarchyUtils::agg(
  dt = iran_pop_aggs,
  id_cols = id_cols,
  value_cols = value_cols,
  col_stem = "location",
  col_type = "categorical",
  mapping = iran_mapping,
  present_agg_severity = "none",
  missing_dt_severity = "none"
)
iran_pop_aggs <- rbind(iran_pop_aggs, iran_pop_aggs_location)

The population for these aggregates age groups at the national level can now be plotted over time.

# create a column called "age" to describe each age interval
hierarchyUtils::gen_name(iran_pop_aggs, col_stem = "age")
iran_pop_aggs[age_name == "0 plus", age_name := "All-ages"]

ggplot(iran_pop_aggs[grepl("Iran", location)],
       aes(x = year, y = population, colour = source_name, shape = source_name)) +
  geom_point(alpha = 0.65, size = 3) +
  facet_grid(age_name ~ sex, scales = "free_y") +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Iran census count at the national level for aggregate age groups") +
  theme_bw() +
  theme(legend.position = "bottom")


ihmeuw-demographics/hierarchyUtils documentation built on June 20, 2024, 7:18 a.m.