knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
Package dembase
consists of data structures and functions for manipulating and examining demographic data. This vignette introduces the main features.
The fundamental data structure of dembase
is the "demographic array". A demographic array is a set of cross-tabulated counts or values, with some metadata. Here is a demographic array holding data on the Indian population (in thousands) in 2010, by age and sex:
library(dembase) ind.popn <- demdata::india.popn # 'india.popn' is an array in package 'demdata' ind.popn <- Counts(ind.popn) ind.popn
The metadata, shown at the top of the array, gives the name, length, and characteristics of each dimension.
A second example is age-specific fertility rates (births per thousand women) for India, in 1991-2010:
ind.fert <- demdata::india.fert ind.fert <- Values(ind.fert) ind.fert
A final example, this time with three dimensions, is deaths per thousand people in Virginia in 1940,
va.rates <- demdata::VADeaths2 va.rates <- Values(va.rates) va.rates
Package dembase
distinguishes between two types of demographic arrays: counts and values. A counts array holds cross-classified data on numbers of people or events. A values array holds all other types of cross-classified data. Often, a values array holds data on the attributes of people or events described by a counts array. The first demographic array above is a counts array, and the second two are values arrays.
The reason for distinguishing between counts and values is that the rules for manipulating them are different. For instance, it makes sense to sum up the number of deaths in each region, but does not make sense to sum up the death rates.
Every dimension of a demographic array has a dimtype and a dimscale. The choices of dimtype, and the associated dimscales, are as follows:
| dimtype | Description | Permitted dimscales |
|:--------|:------------|:---------|
| age
| Age - typically age group | Points
or Intervals
|
| time
| Time - point or period | Points
or Intervals
|
| cohort
| Cohort | Intervals
|
| triangle
| Lexis triangle | Triangles
|
| sex
| Typically biological sex | Sexes
|
| state
| General-purpose categorical dimension | Categories
|
| origin
| Special type of state dimension | Categories
|
| destination
| Paired with origin
| Categories
|
| parent
| Special type of state dimension | Categories
|
| child
| Paired with parent
| Categories
|
| iterations
| For output from simulations | Iterations
|
| quantiles
| Summarising iterations | Quantiles
|
Dimtypes age
, time
, cohort
and triangle
describe the age-time plan of the data. Dimtype sex
is used for biological sex, and allows only two values: female and male. Dimtype state
is used for most stratifying variables, such country of birth. Changes of category, such as moves between regions, are represented using origin
and destination
dimensions. Dimtypes parent
and child
are used for tabulations of parents verus children, such as occupation of mother versus occupation of child. Dimtypes iterations
and quantiles
are typically used to represent uncertainty.
The Indian population data above refer to a point in time, so the year dimension has a Points
dimscale. The fertility data, in contrast, have an Intervals
dimscale.
Demographic arrays are normally created by supplying arrays or tables to functions Counts
and Values
. The examples above all use arrays. In the example below, we apply xtabs
to a data.frame and then call Counts
.
income.df <- demdata::nz.income # a data.frame head(income.df) total.income <- xtabs(income ~ ethnicity + sex, data = income.df) total.income <- Counts(total.income) total.income
To create an array of means, we can use tapply
and Values
.
mean.income <- tapply(income.df$income, INDEX = income.df[c("ethnicity", "sex")], FUN = mean) mean.income <- Values(mean.income) round(mean.income)
Functions Counts
and Values
attempt to infer dimtypes and dimscales from the dimnames attribute of the array or table. It may sometimes be necessary to process the input data---particulary ages and times---to get it into the format that Counts
and Values
expect.
df <- data.frame(age = c("age0_4", "age5_9", "age0_4", "age5_9"), sex = c("f", "f", "m", "m"), count = c(10, 25, 5, 20)) xt <- xtabs(count ~ age + sex, data = df) ## the following throws an error! Counts(xt) ## We need to recode the age column in 'df', ## Here's one way of doing so. df$age <- factor(df$age, levels = c("age0_4", "age5_9"), labels = c("0-4", "5-9")) ## try again xt <- xtabs(count ~ age + sex, data = df) Counts(xt)
When dealing with ages and times, demest
distinguishes between points (eg midnight on 31 December 2016) and intervals (eg the period between midnight 31 December 2015 and midnight 31 December 2016). Points are represented by the "Points"
dimscale, and intervals are represented by the "Intervals"
dimscale. In many cases, Counts
and Values
can guess whether the dimnames for an array or table refer to points or intervals. However, units that are consecutive integers, such as ages 0, 1, 2, or years 2016, 2017, 2018, could in principle describe points or intervals.
In a demographic dataset, "age" almost always means "age group", ie, an interval rather than a point. When functions Counts
and Values
encounter ages measured in single years, they assume that they should apply dimscale "Intervals"
, though they notify the user that they have done so.
rus.births <- demdata::russia.births rus.births.subset.08 <- xtabs(count ~ age + sex, data = rus.births, subset = year == 2008 & age %in% 15:19, drop.unused.levels = TRUE) rus.births.subset.08 Counts(rus.births.subset.08) # 'Counts' issues a message
To avoid the message, or to document the choice of dimscale, use the dimscales
argument,
Counts(rus.births.subset.08, dimscales = c(age = "Intervals")) # no message
In contrast to age, time is often measured using points, so, without further information, it would not be safe to assume that a sequence such as 2001, 2002, 2003 referred to intervals. When confronted with a series of single years, and no information on dimscales, Counts
and Values
do not try to guess, but instead raise a error.
rus.births.subset.12 <- xtabs(count ~ year + sex, data = rus.births, subset = age == 12, drop.unused.levels = TRUE) rus.births.subset.12 Counts(rus.births.subset.12) # 'Counts' raises an error
To supply Counts
or Values
with the information they need, use the dimscales
argument,
Counts(rus.births.subset.12, dimscales = c(year = "Intervals")) # no message
Function as
can be used to convert between counts and values arrays.
va.popn <- demdata::VAPopn va.popn <- Counts(va.popn) va.popn.values <- as(va.popn, "Values") class(va.popn.values)
Counts and values arrays can be turned into ordinary arrays using as
va.popn.array <- as(va.popn, "array") class(va.popn.array)
...or using `as.array'
va.popn.array.2 <- as.array(va.popn) class(va.popn.array.2)
Counts and values arrays can be converted to data.frames via as.data.frame
. We typically want the "long"
version.
va.popn.df <- as.data.frame(va.popn, direction = "long") head(va.popn.df)
The midpoints
argument is sometimes useful.
va.popn.mid <- as.data.frame(va.popn, direction = "long", midpoints = "age") head(va.popn.mid)
Collapsing the dimension of a Counts
object is easy:
ind.popn collapseDimension(ind.popn, dimension = "age")
Collapsing the dimension of a Values
object requires weights:
collapseDimension(va.rates, dimension = "age", weights = va.popn)
Note that va.popn
has a dimension, "color"
, that va.rates
does not:
summary(va.popn) summary(va.rates)
Function collapseDimension
automatically collapses the "color"
dimension of va.popn
before using va.popn
to weight va.rates
.
dembase
also has functions for aggregating categories, iterations, origin-destination dimensions, and intervals. For instance,
collapseIntervals(ind.popn, dimension = "age", width = 10)
demdata
rearranges arrays to make the compatible before performing arithmetic with them. An artificial example:
ind.popn.2 <- subarray(ind.popn, age < 60) ind.popn.2 <- collapseIntervals(ind.popn.2, dimension = "age", breaks = c(10, 25, 50)) ind.popn.2 <- t(ind.popn.2) ind.popn.2 ind.popn - ind.popn.2
Adding or substracting a counts array from a counts array, or multiplying a counts array by a counts array, produces a counts array. All other arithmetic involving both counts and values arrays produces values arrays.
Function subarray
extracts arrays from within a larger array. The function exploits the metadata on the dimensions to allow convenient indexing:
subarray(ind.fert, age > 30) subarray(va.rates, residence == "Urban" & age < 62)
Function subarray
uses expressions and non-standard evaluation. For programming, function slab
may be easier, and/or more reliable.
slab(va.rates, dimension = "residence", elements = "Urban")
When doing arithmetic on pairs of objects, demest
removes categories from one object if these are not found in the other object. In the example below, births
covers more years than females
, and females
has more age groups than births
.
births <- demdata::nz.births.reg popn <- demdata::nz.popn.reg births <- Counts(births, dimscales = c(year = "Intervals")) popn <- Counts(popn, dimscales = c(year = "Intervals")) females <- subarray(popn, sex == "Female") limits(births) limits(females)
When demest
divides births
by females
, it subsets both, so that they cover the same ages and periods. It sends two messages, to let the user know that the subsetting has occurred.
rates <- births / females # message limits(rates)
In production code, it might be better to do the subsetting explicitly.
births.sub <- subarray(births, year > 2005 & year < 2014) females.sub <- subarray(females, age > 15 & age < 45) rates.sub <- births.sub / females.sub # no message limits(rates)
Arrays can be joined together using dbind
:
ind.popn.young <- subarray(ind.popn, age < 40) ind.popn.old <- subarray(ind.popn, age > 40) ind.popn.young ind.popn.old dbind(ind.popn.young, ind.popn.old, along = "age")
dbind
permutes and rearranges where necessary
ind.popn.young <- t(ind.popn.young) ind.popn.young dbind(ind.popn.old, ind.popn.young, along = "age")
Function plot
is a quick way to examine the contents of an array:
plot(va.popn)
Function dplot
produces lattice plots.
dplot(~ age | period, data = ind.fert, midpoints = "age")
It can manipulate data in the background:
dplot(~ age | sex, data = va.rates, weights = va.popn) # collapses 'residence' dimension
To use other graphing functions, such as ggplot2
or xyplot
, collapse the data down to the required dimensions, and then convert to a data.frame. The midpoints
replaces intervals with their midpoints, which can lead to nicer plots.
va.rates.collapsed <- collapseDimension(va.rates, margin = c("age", "sex"), weights = va.popn) va.rates.df <- as.data.frame(va.rates.collapsed, direction = "long", midpoints = "age") lattice::xyplot(value ~ age | sex, data = va.rates.df, type = "b")
Demographic arrays use a simulation-based approach to representing uncertainty. An array can contain multiple versions of the same set of rates or counts, with quantities that are more uncertain varying more across versions Alternative versions are distinguished using a dimension with dimtype "iterations"
.
The values array waikato.tfr
contains 1,000 sets of total fertility rates for the Waikato region of Zealand:
tfr <- demdata::waikato.tfr tfr <- Values(tfr, dimscales = c(year = "Intervals")) summary(tfr) subarray(tfr, iteration <= 5)
It can be helpful to think of an array with an iterations dimension as a collection of arrays, rather than as a single array. Consider what happens when taking the mean of an array with an iterations dimension.
mean.tfr <- mean(tfr) summary(mean.tfr) subarray(mean.tfr, iteration <= 5)
Rather than a single mean, we get 1,000 means: one for each iteration.
The distribution of the 1,000 estimated means can be visualised using a density plot:
density.mean.tfr <- density(mean.tfr) plot(density.mean.tfr)
Summary measures of the variation across the iterations can be constructed using function `collapseIterations':
collapseIterations(tfr)
To obtain different quantiles, supply a prob
argument:
collapseIterations(tfr, prob = c(0.025, 0.5, 0.975))
Although quantiles are the most common way to summarise iterations, collapseIterations
accepts any summary functions, including customised ones,
collapseIterations(tfr, FUN = mean) meanAndCI <- function(x) { mean <- mean(x) sd <- sd(x) lower <- mean - 2*sd upper <- mean + 2*sd c(mean = mean, lower = lower, upper = upper) } collapseIterations(tfr, FUN = meanAndCI)
In practice, we probably don't want to display so many digits,
quant.tfr <- collapseIterations(tfr, prob = c(0.025, 0.5, 0.975)) round(quant.tfr, digits = 2)
Function dplot
automatically displays 2.5%, 25%, 50%, 75%, and 97.5% quantiles,
dplot(~ year, data = tfr)
Alternative quantiles can be specified using the prob
argument,
dplot(~ year, data = tfr, prob = c(0.05, 0.5, 0.95))
Operations on quantiles often do not make sense,
mean(quant.tfr)
Instead, it may be better perform the calculations on the original iterations, and then calculate quantiles
mean.tfr <- mean(tfr) collapseIterations(mean.tfr, prob = c(0.025, 0.5, 0.975))
The biggest feature currently missing from dembase
is classes and methods for demographic accounts. These will be added in stages over 2016-2017.
dplot
dplot
will be rewritten to handle nonstandard evaluation better, and to add some extra features, such as improved plotting of categorical variables, and improved overlays.
dapply
It would be useful to have a version of apply
specialised for demographic arrays.
The demographic literature has many indices for describing demographic data, such as indices measuring population ageing. These will gradually be added to the package.
Manipulating metadata in a way that avoids creating invalid arrays is tricky, but more functions for doing so are needed.
More functions for cleaning data, eg for creating age labels, would be useful.
It's likely that a new dimtype, representing sex, will be added.
A dimscale is needed that knows what to do with measurements such as 2016Q1
, 2016Q2
, 2016Q3
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.