In StatisticsNZ/demest: Bayesian Demographic Estimation and Forecasting

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

Introduction

In this vignette, we use function estimateCounts from package demest to estimate the population of New Zealand, by age, sex, and region, over the period 2008-2014, using simulated administrative data. The example is deliberately simplified--a real example would have more complicated data and models.

Preparing and inspecting data

For the purposes of this example, we treat the official estimates of population, in array nz.popn, as the true population We use the various simulated datasets generated from nz.popn as data to try to recover the true population. The datasets are as follows.

| Object in demdata | Description | |--------------------:|:------------| | nz.admin.nat | Population by age, sex, and year, 2008-2014 at the national level. Highly accurate. | | nz.admin.health | Enrolments by age, sex, region, and year, 2008-2014. Coverage varies by age. | | nz.admin.school | School enrolments by age, sex, region, and year, ages 5-14. Excellent proxy for population at these ages .| | nz.admin.survey | Survey covering 5\% of the national population in 2010. |

The data are in package demdata. The first step is to obtain the datasets from the package demdata. The datasets are in the form of arrays.

sim.popn.true  <- demdata::nz.popn.ta
dimnames(sim.popn.true  )$region <- paste("Region", 1:68)
nat <- demdata::sim.admin.nat
health <- demdata::sim.admin.health
school <- demdata::sim.admin.school
survey <- demdata::sim.admin.survey

Next we turn the arrays into Counts objects. Since it is ambiguous whether the 'years' in the objects refer to points in time or to periods, we have to supply dimscales for them explicitly. The function to create Counts objects is in package dembase. But since dembase is a dependency for package demest and we will be using demest later, we just load that.

library(demest)
sim.popn.true  <- Counts(sim.popn.true ,
                         dimscales = c(year = "Points"))
nat <- Counts(nat,
              dimscales = c(year = "Points"))
health <- Counts(health,
                 dimscales = c(year = "Points"))
school <- Counts(school,
                 dimscales = c(year = "Points"))
survey <- Counts(survey,
                 dimscales = c(year = "Points"))

The easiest way to get a feel for the data is to graph it:

plot(sim.popn.true )
plot(nat)
plot(health)
plot(school)
plot(survey)

Setting up the model

We begin by specifying an initial model for the underlying population. The model is simple, with the only interactions being between age and sex and between age and region.

popn.mod <- Model(y ~ Poisson(mean ~ age * sex + age * region + year),
                  region ~ Exch(error = Error(robust = TRUE)))
popn.mod

Next we specify data models and put them in a list. As we will see later, function estimateCounts will use the names of the responses to match data models to datasets.

nat.mod <- Model(nat ~ PoissonBinomial(prob = 0.98))
health.mod <- Model(health ~ Poisson(mean ~ age))
school.mod <- Model(school ~ PoissonBinomial(prob = 0.96))
survey.mod <- Model(survey ~ Binomial(mean ~ 1))
observation <- list(nat.mod, health.mod, school.mod, survey.mod)

We generate some starting values for the population estimates,

y <- Counts(array(as.integer(health * 1.2),
                  dim = dim(health),
                  dimnames = dimnames(health)),
            dimscales = c(year = "Points"))

put the datasest in a named list,

datasets <- list(nat = nat, health = health, school = school, survey = survey)

and generate a filename

filename <- tempfile()

Running the model

We can now run the model. We will only do a few iterations, to see what happens.

set.seed(1)  ## for replicability
estimateCounts(popn.mod,
               y = y ,
               observation = observation,
               datasets = datasets,
               filename = filename,
               nBurnin = 10,
               nSim = 10)
fetchSummary(filename)
showModel(filename)

Nothing has converged, which is to be expected, but the model does run. After some experimentation (not shown here) we modify the jump parameters for the health system and survey data models to values that will give acceptance rates closer to 40\%.

health.mod <- Model(health ~ Poisson(mean ~ age),
                    jump = 0.012)
survey.mod <- Model(survey ~ Binomial(mean ~ 1),
                    jump = 0.018)
observation <- list(nat.mod, health.mod, school.mod, survey.mod)

We then run the model for a long time.

set.seed(1)  ## for replicability
estimateCounts(popn.mod,
               y = y ,
               observation = observation,
               datasets = datasets,
               filename = filename,
               nBurnin = 20000,
               nSim = 20000,
               nThin = 50)
fetchSummary(filename)

Examining the population estimates

To see what has been estimated, we use function listContents,

listContents(filename)

To extract the population estimates, we use

popn.est <- fetch(filename, where = "y")

We take a look at the results for the first year, 2008. The light blue bands show 95\% credible intervals and the dark blue bands show 50\% credible intervals. Since we have the true values, we show them as well, in red.

dplot(~ age | region,
      data = popn.est,
      subarray = year == "2008" & region %in% paste("Region", 1:20),
      scales = list(y = "free"),
      midpoints = "age",
      main = "Population, by age and region, in 2008",
      overlay = list(values = subarray(sim.popn.true , year == "2008"),
                     col = "red",
                     lwd = 2))

The same, for 2014,

dplot(~ age | region,
      data = popn.est,
      subarray = year == "2014" & region %in% paste("Region", 1:20),
      scales = list(y = "free"),
      midpoints = "age",
      main = "Population, by age and region, in 2014",
      overlay = list(values = subarray(sim.popn.true , year == "2014"),
                     col = "red"))

Changes in the Region 1 population over time, females only,

dplot(~ year | age,
      data = popn.est,
      subarray = region == "Region 1" & sex == "Female",
      main = "Region 1 female population, by age and time",
      overlay = list(values = subarray(sim.popn.true , region == "Region 1" & sex == "Female"),
                     col = "red"))

Some numeric estimates

reg1 <- subarray(popn.est, region == "Region 1" & year %in% c("1996", "2005", "2015"))
reg1 <- collapseDimension(reg1, margin = "year")
reg1 <- collapseIterations(reg1, prob = c(0.025, 0.5, 0.975))

Converted to a data.frame

as.data.frame(reg1, direction = "long")

Examining coverage levels

Among other things, the model generates 'coverage ratios', that is, the number of people in the dataset divided by estimates of the true population. We take a look at coverage ratios for the health dataset in 2014. Since we in fact have the true population, we show that too, in red.

coverage.health.est <- fetch(filename, 
                         where = c("observation", "health", "likelihood", "rate"))
coverage.health.true <- health / sim.popn.true 
dplot(~ age | region,
      data = coverage.health.est,
      subarray = year == "2014" & region %in% paste("Region", 1:12),
      weights = sim.popn.true ,
      na.rm = TRUE,
      midpoints = "age",
      main = "Coverage ratios, health data, by age and region, in 2014",
      overlay = list(values = subarray(coverage.health.true, year == "2014"),
                     col = "red"))