library(knitr) opts_chunk$set(collapse = TRUE)

The `ungroup`

R package introduces a versatile method for ungrouping histograms (binned count data) assuming that counts are Poisson distributed and that the underlying sequence on a fine grid to be estimated is smooth. The method is based on the composite link model and estimation is achieved by maximizing a penalized likelihood. Smooth detailed sequences of counts and rates are so estimated from the binned counts. Ungrouping binned data can be desirable for many reasons: Bins can be too coarse to allow for accurate analysis; comparisons can be hindered when different grouping approaches are used in different histograms; and the last interval is often wide and open-ended and, thus, covers a lot of information in the tail area. Age-at-death distributions grouped in age classes and abridged life tables are examples of binned data. Because of modest assumptions, the approach is suitable for many demographic and epidemiological applications. For a detailed description of the method and applications see @rizzi2015.

The package has two top level functions `pclm`

and `pclm2D`

, two auxiliary functions (`control.pclm`

and `control.pclm2D`

), several generic functions (`plot`

, `summary`

, `fitted`

, `residuals`

). A dataset (`ungroup.data`

) is provided as well for testing purposes.

All functions are documented in the standard way, which means that once you load the package using `library(ungroup)`

you can just type for example `?pclm`

to see the help file.

# Load the package library(ungroup)

The PCLM method [@eilers2007] is based on the composite link model [@thompson1981], which extends standard generalized linear models. It implements the idea that the observed counts, interpreted as realizations from Poisson distributions, are indirect observations of a finer (ungrouped) but latent sequence. This latent sequence represents the distribution of expected means on a fine resolution and has to be estimated from the aggregated data. Estimates are obtained by maximizing a penalized likelihood. This maximization is performed efficiently by a version of the iteratively reweighted least-squares algorithm. Optimal values of the smoothing parameter are chosen by minimizing Bayesian or Akaike's Information Criterion [@hastie1990].

This is an example of estimation of the smooth age at death distributions from grouped death counts. First we have to define some grouped data:

# Input data # x: Age groups x <- c(0, 1, seq(5, 85, by = 5)) x # y: Death counts in the age group y <- c(294, 66, 32, 44, 170, 284, 287, 293, 361, 600, 998, 1572, 2529, 4637, 6161, 7369, 10481, 15293, 39016) # offset: Population exposed to risk in the age group offset <- c(114, 440, 509, 492, 628, 618, 576, 580, 634, 657, 631, 584, 573, 619, 530, 384, 303, 245, 249) * 1000 # nlast: the size of the last age interval (usually open) nlast <- 26 # This results in the last group being [85, 110).

The model can be fitted using `pclm`

function:

M1 <- pclm(x, y, nlast)

It generates different types of output stored in the created object. See `pclm`

help page for detailed information about the output list (`?pclm`

).

```
ls(M1)
```

```
summary(M1)
```

Generic plot:

plot(M1) # Print first 6 fitted values fitted(M1)[1:6]

By default `pclm`

ungroups data in intervals of length `1`

. If higher granularity is required `out.step`

argument can be used to specify this. For example, obtaining groups 222 groups of length 0.5 one can try:

M2 <- pclm(x, y, nlast, out.step = 0.5) plot(M2)

# Print first 6 fitted values fitted(M2)[1:6] # Number of fitted values length(fitted(M2))

For controlling the PCLM fitting process `control.pclm`

provides several options. The list of arguments needs to be specified using the `control`

argument. For example, if we want to optimize the smoothing parameters in order to obtain a fit characterized by the small `AIC`

level one can write:

# Optimise smoothing parameter: lambda, kr and deg M3 <- pclm(x, y, nlast, control = list(lambda = NA, opt.method = "AIC"))

The `offset`

argument can be used to estimate smooth death rates. `offset`

must be a vector of the same length as `y`

.

M5 <- pclm(x, y, nlast, offset)

Generic plot:

plot(M5, type = "s")

The PCLM can be extended to a two-dimensional regression problem. The two-dimensional Penalized Composite Link Model to ungroup simultaneously coarse distributions for adjacent years can be fitted using `pclm2D`

function, and the structure of the functions works as `pclm`

. See the examples provided in the help page. Note that `pclm2D`

might be slower, depending on the data and model specification provided in the functions.

The two-dimensional regression analysis combines two approaches: the PCLM for ungrouping in one dimension and two-dimensional smoothing with P-splines [@currie2004]. As an example we can ungroup age-specific distributions from the coarsely grouped data and smooth across adjacent calendar years to estimate both detailed age-at-death distributions and mortality time trends.

We thank Paul H.C. Eilers who provided insight and expertise that greatly supported the creation of this R package; and Catalina Torres and Tim Riffe for testing and offering feedback on the early versions of the software.

The authors are also grateful to the following institutions for their support:

- University of Southern Denmark;
- Max-Planck Institute for Demographic Research;
- SCOR Corporate Foundation for Science.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.