library(knitr) opts_chunk$set(collapse = TRUE)
The ungroup
R package introduces a versatile method for ungrouping histograms (binned count data) assuming that counts are Poisson distributed and that the underlying sequence on a fine grid to be estimated is smooth. The method is based on the composite link model and estimation is achieved by maximizing a penalized likelihood. Smooth detailed sequences of counts and rates are so estimated from the binned counts. Ungrouping binned data can be desirable for many reasons: Bins can be too coarse to allow for accurate analysis; comparisons can be hindered when different grouping approaches are used in different histograms; and the last interval is often wide and open-ended and, thus, covers a lot of information in the tail area. Age-at-death distributions grouped in age classes and abridged life tables are examples of binned data. Because of modest assumptions, the approach is suitable for many demographic and epidemiological applications. For a detailed description of the method and applications see @rizzi2015.
The package has two top level functions pclm
and pclm2D
, two auxiliary functions (control.pclm
and control.pclm2D
), several generic functions (plot
, summary
, fitted
, residuals
). A dataset (ungroup.data
) is provided as well for testing purposes.
All functions are documented in the standard way, which means that once you load the package using library(ungroup)
you can just type for example ?pclm
to see the help file.
# Load the package library(ungroup)
The PCLM method [@eilers2007] is based on the composite link model [@thompson1981], which extends standard generalized linear models. It implements the idea that the observed counts, interpreted as realizations from Poisson distributions, are indirect observations of a finer (ungrouped) but latent sequence. This latent sequence represents the distribution of expected means on a fine resolution and has to be estimated from the aggregated data. Estimates are obtained by maximizing a penalized likelihood. This maximization is performed efficiently by a version of the iteratively reweighted least-squares algorithm. Optimal values of the smoothing parameter are chosen by minimizing Bayesian or Akaike's Information Criterion [@hastie1990].
This is an example of estimation of the smooth age at death distributions from grouped death counts. First we have to define some grouped data:
# Input data # x: Age groups x <- c(0, 1, seq(5, 85, by = 5)) x # y: Death counts in the age group y <- c(294, 66, 32, 44, 170, 284, 287, 293, 361, 600, 998, 1572, 2529, 4637, 6161, 7369, 10481, 15293, 39016) # offset: Population exposed to risk in the age group offset <- c(114, 440, 509, 492, 628, 618, 576, 580, 634, 657, 631, 584, 573, 619, 530, 384, 303, 245, 249) * 1000 # nlast: the size of the last age interval (usually open) nlast <- 26 # This results in the last group being [85, 110).
The model can be fitted using pclm
function:
M1 <- pclm(x, y, nlast)
It generates different types of output stored in the created object. See pclm
help page for detailed information about the output list (?pclm
).
ls(M1)
summary(M1)
Generic plot:
plot(M1) # Print first 6 fitted values fitted(M1)[1:6]
By default pclm
ungroups data in intervals of length 1
. If higher granularity is required out.step
argument can be used to specify this. For example, obtaining groups 222 groups of length 0.5 one can try:
M2 <- pclm(x, y, nlast, out.step = 0.5) plot(M2)
# Print first 6 fitted values fitted(M2)[1:6] # Number of fitted values length(fitted(M2))
For controlling the PCLM fitting process control.pclm
provides several options. The list of arguments needs to be specified using the control
argument. For example, if we want to optimize the smoothing parameters in order to obtain a fit characterized by the small AIC
level one can write:
# Optimise smoothing parameter: lambda, kr and deg M3 <- pclm(x, y, nlast, control = list(lambda = NA, opt.method = "AIC"))
The offset
argument can be used to estimate smooth death rates. offset
must be a vector of the same length as y
.
M5 <- pclm(x, y, nlast, offset)
Generic plot:
plot(M5, type = "s")
The PCLM can be extended to a two-dimensional regression problem. The two-dimensional Penalized Composite Link Model to ungroup simultaneously coarse distributions for adjacent years can be fitted using pclm2D
function, and the structure of the functions works as pclm
. See the examples provided in the help page. Note that pclm2D
might be slower, depending on the data and model specification provided in the functions.
The two-dimensional regression analysis combines two approaches: the PCLM for ungrouping in one dimension and two-dimensional smoothing with P-splines [@currie2004]. As an example we can ungroup age-specific distributions from the coarsely grouped data and smooth across adjacent calendar years to estimate both detailed age-at-death distributions and mortality time trends.
We thank Paul H.C. Eilers who provided insight and expertise that greatly supported the creation of this R package; and Catalina Torres and Tim Riffe for testing and offering feedback on the early versions of the software.
The authors are also grateful to the following institutions for their support:
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.