knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(nipnTK) library(knitr) library(kableExtra)
Age heaping is the tendency to report children's ages to the nearest year or adult ages to the nearest multiple of 5 or 10 years. Age heaping is very common. It is a major reason why data from nutritional anthropometry surveys are often analysed and reported using broad age-groups. The commonest age-groups used with children’s data are 6 to 17 months, 18 to 29 months, 30 to 41 months, 42 to 53 months, and 54 to 59 months (see figure below). These are known as year-centred age-groups. Note that the last age-group covers only six months but is nominally centred at five years. Other age-groups may be used for specific analyses. The techniques presented here can be adapted to work with other age- groups.
include_graphics(path = "../man/figures/asFigure1.png")
We will retrieve a survey dataset:
svy <- read.table("dp.ex02.csv", header = TRUE, sep = ",") head(svy)
svy <- dp.ex02 head(svy)
The dataset dp.ex02
is a comma-separated-value (CSV) file containing anthropometric data from a SMART survey in Kabul, Afghanistan.
We will use the base R function cut()
to group the data in the age variable (age in months) into year-centred age-groups.
svy$ycag <- cut( svy$age, breaks = c(6, 18, 30, 42, 54, 60), labels = 1:5, include.lowest = TRUE, right = FALSE ) head(svy)
A tabular analysis can be performed:
table(svy$ycag, svy$sex) prop.table(table(svy$ycag, svy$sex)) * 100
The table()
function performs a cross-tabulation. The first variable specified (svy$ycag
in this example) is the row variable. The second variable specified (svy$sex
in this example) is the column variable.
include_graphics(path = "../man/figures/asTable.png")
It is useful to examine row percentages and column percentages in tables of age-group by sex. We should look at row percentages:
prop.table(table(svy$ycag, svy$sex), margin = 1) * 100
This returns:
prop.table(table(svy$ycag, svy$sex), margin = 1) * 100
Which shows approximately equal proportions of males and females in each year-centred age-group. We specified margin = 1
with the prop.table()
function because we wanted row percentages.
We should also look at column percentages:
prop.table(table(svy$ycag, svy$sex), margin = 2) * 100
This returns:
prop.table(table(svy$ycag, svy$sex), margin = 2) * 100
We expect there to be approximately equal proportions of children in the age-groups centred at 1, 2, 3, and 4 years and a smaller proportion (i.e. about half that in the other age-groups) in the age-group centred at 5 years. We specified margin = 2
with the prop.table()
function because we wanted column percentages.
A graphical analysis using a population pyramid can be useful. The NiPN data quality toolkit provides an R language function called pyramid.plot()
for plotting population pyramids:
pyramid.plot(svy$ycag, svy$sex)
We can make a more informative plot by specifying a title and axis labels:
pyramid.plot(svy$ycag, svy$sex, main = "Distribution of age by sex", xlab = "Frequency (Males | Females)", ylab = "Year-centred age-group")
and applying shading:
pyramid.plot(svy$ycag, svy$sex, main = "Distribution of age by sex", xlab = "Frequency (Males | Females)", ylab = "Year-centred age-group", col = c("grey80", "white"))
or colours:
pyramid.plot(svy$ycag, svy$sex, main = "Distribution of age by sex", xlab = "Frequency (Males | Females)", ylab = "Year-centred age-group", col = c("lightblue", "pink"))
We expect there to be approximately equal numbers of children in the age-groups centred at 1, 2, 3, and 4 years and a smaller number (i.e. about half the number in the other age-groups) in the age-group centred at 5 years. There should also be approximately equal numbers of males and females. This is what we see in the population pyramid below.
pyramid.plot(svy$ycag, svy$sex, main = "Distribution of age by sex", xlab = "Frequency (Males | Females)", ylab = "Year-centred age-group")
The pyramid.plot()
function uses the values of the grouped age variable as y-axis value labels.
We can use a factor type variable. This type of variable allows labels to be specified:
svy$ageLabel <- factor(svy$ycag, labels = c("6:17", "18:29", "30:41", "42:53", "54:59")) pyramid.plot(svy$ageLabel, svy$sex, main = "Distribution of age by sex", xlab = "Frequency (Males | Females)", ylab = "Year-centred age-group")
The cut()
function may also be used:
svy$ageCuts <- cut(svy$age, breaks = c(0, 17, 29, 41, 53, 59)) pyramid.plot(svy$ageCuts, svy$sex, main = "Age-group (months) ", xlab = "Frequency (Males | Females)", ylab = "Year-centred age-group", cex.names = 0.9)
The cut()
function is a versatile grouping function. It is explained in more detail later in this section.
The cex.names
parameter of the pyramid.plot()
function allows us to change the size of the value labels on the y-axis
. The value for cex.names is a magnification factor. Values above one make the labels larger than the default. Values below one make the labels smaller than the default.
It is possible to perform a formal test on the distribution of age-groups by sex. A simple test is:
chisq.test(table(svy$ycag, svy$sex))
This yields:
chisq.test(table(svy$ycag, svy$sex))
In this example the p-value is not below 0.05 so we accept the null hypothesis that there is no significant association between age and sex. This is an important test as it tests whether the distribution of ages is similar for males and females. It does not, however, test whether the age structure in the sample meets expectations. This requires a test that compares observed numbers with expected numbers derived from an external source (e.g. census data) or from a demographic model.
A simple model-based method for calculating expected numbers is exponential decay in a population in which births and deaths balance each other and with a 1:1 male to female sex ratio. Under this model the proportion surviving in each group at each year can be calculated as:
$$ p ~ = ~ e ^ {-zt} $$
in which e is the base of the natural logarithm (approximately 2.7183), z is the mortality rate associated with each time period, and t is time. Time (t) starts at zero for the purposes of computation. Age can be used as a measure of time since birth. We should use 0 for the first year-centred age-group, 1 for the second year-centred age-group, and so-on. This is the rationale for us using t <- 0:4
below.
With five year-centred age-groups and a mortality rate of 1 / 10,000 / day, the expected proportions surviving at each year can be calculated as:
z <- (1 / 10000) * 365.25 t <- 0:4 p <- exp(-z * t) p
This yields the following survival probabilities:
z <- (1 / 10000) * 365.25 t <- 0:4 p <- exp(-z * t) p
We need to specify the duration (i.e. the number of years) represented by each age-group:
d <- c(1, 1, 1, 1, 0.5)
We can then calculate expected proportions of children in each age-group:
ep <- d * p / sum(d * p) ep
This gives:
ep <- d * p / sum(d * p) ep
We can now calculate expected numbers:
expected <- ep * sum(table(svy$ycag)) names(expected) <- 1:5 expected
giving:
expected <- ep * sum(table(svy$ycag)) names(expected) <- 1:5 expected
A formal test would compare the observed numbers with the expected numbers. The observed numbers can be found using:
observed <- table(svy$ycag) observed
This gives:
observed <- table(svy$ycag) observed
It can be useful to examine observed and expected numbers graphically:
par(mfcol = c(1, 2)) barplot(observed, main = "Observed", xlab = "Age group", ylab = "Frequency", ylim = c(0, 250)) barplot(expected, main = "Expected", xlab = "Age group", ylab = "Frequency", ylim = c(0, 250))
We will calculate a Chi-squared test statistic:
$$ \chi ^ 2 ~ = ~ \sum \frac{(\text{observed} - \text{expected}) ^ 2}{\text{expected}} $$
using:
X2 <- sum((observed - expected) ^ 2 / expected)
which yields a Chi-Squared test statistic of:
X2 <- sum((observed - expected) ^ 2 / expected)
We can find the p-value using:
pchisq(X2, df = 4, lower.tail = FALSE)
This gives:
pchisq(X2, df = 4, lower.tail = FALSE)
In this example the age distribution is significantly different from expected numbers calculated using a simple demographic model.
Note that we specify the degrees of freedom (df
) for the Chi-Squared test as the number of age-groups minus one. As we have five age-groups we specify df = 4
. The degrees of freedom (df
) that we need to specify will depend on the number of age-groups that we use. It is always the number of age-groups minus one. If, for example, there are ten age-groups we would need to specify df = 9
.
The NiPN data quality toolkit provides an R function called ageChildren()
that performs the model-based Chi-Squared test specifically for a sample of children aged 6-59 months:
ageChildren(svy$age, u5mr = 1)
which returns:
ageChildren(svy$age, u5mr = 1)
Note that we specified the under five years mortality rate as 1 / 10,000 / day using u5mr = 1
. Another, more appropriate, rate may be specified.
The ageChildren()
function calculates year-centred age-groups for children aged between six and fifty-nine months by default. This is a standard survey population and is used in SMART and many other surveys. The use of year-centred age-groups is also standard practice. The commands that are given above can, however, be adapted for use with different age-groups.
The output of the ageChildren()
function can be saved for later use:
ac <- ageChildren(svy$age, u5mr = 1)
The saved output contains the Chi-squared test results and tables of observed and expected values. These can be accessed using:
ac ac$X2 ac$df ac$p ac$observed ac$expected
The saved results may also be plotted:
plot(ac)
The ageChildren()
function can be applied to each sex separately. To males:
acM <- ageChildren(svy$age[svy$sex == 1], u5mr = 1) acM plot(acM)
and to females:
acF <- ageChildren(svy$age[svy$sex == 2], u5mr = 1) acF plot(acF)
An easier way of doing this is:
by(svy$age, svy$sex, ageChildren, u5mr = 1)
The test statistics should be interpreted with caution. A significant test result may, for example, be due to the use of an inappropriate model to generate expected numbers.
A significant result in this particular test may be due to:
Specifying an inappropriate under five years mortality rate: This is a particular problem because the specified under five years mortality rate is assumed to have applied for five years prior to data being collected.
The assumption of a 1:1 male to female sex ratio: This is a particular problem in setting in which there is sex-selective abortion and sex-selective infanticide.
The model is crude. Mortality is related to age. Younger children have a greater mortality risk than older children and only an average under five years mortality rate is used. A more sophisticated model could be used but, in many settings, we will not have the data required to use such a model.
It should also be noted that the sample sizes used in most survey can cause tests to yield statistically significant results for small differences between observed and expected numbers.
The use of simple demographic models is far from ideal. It is usually better to calculate the expected proportions from census data. A useful source of census data is the United States Census Bureau’s International Data Base:
https://www.census.gov/data-tools/demo/idb/informationGateway.php
The population in single year age-groups for 0, 1, 2, 3, and 4 years for Afghanistan in 2015 was:
age <- c(0, 1, 2, 3, 4) both <- c(1148379, 1062635, 1015688, 981288, 950875) males <- c(584276, 539589, 515793, 498365, 482926) females <- c(564103, 523046, 499895, 482923, 467949) df <- data.frame(age, both, males, females) kable(x = df, col.names = c("Age", "Both Sexes", "Males", "Females")) %>% kable_styling(bootstrap_options = c("striped"), full_width = FALSE)
We can calculate expected values from these data:
pop <- c(1148379, 1062635, 1015688, 981288, 950875) ep <- pop / sum(pop)
With a sample size of $n = 900$ the expected number in each age-group would be:
expected <- ep * 900 expected
These expected values can be used in a Chi-squared test as is illustrated above.
Census data may also be used to estimate the under five years’ mortality rate (U5MR) which can then be used with the ageChildren()
function.
The model of exponential decay in a population in which births and deaths balance each other with a 1:1 male to female sex ratio:
$$ p ~ = ~ e ^ {-zt} $$
means that we can, given an age-distribution, estimate mortality by fitting the model:
$$ \log_e(n) ~ = ~ \alpha ~ + ~ \beta t $$
where $n$ is the count of children in each age-group.
The absolute value of the β coefficient is the point estimate of the mortality rate (z). Using the 2015 population data for Afghanistan:
t <- 0:4 lm(log(pop) ~ t)
This gives:
t <- 0:4 lm(log(pop) ~ t)
The value reported under t is the $\beta$ coefficient (-0.04571). The absolute value of the $\beta$ coefficient (i.e. the value without the sign) is 0.04571. This is the point estimate of the mortality rate. Expressed as the number of deaths / 10,000 persons / day:
(0.04571 / 365.25) * 10000
this is:
(0.04571 / 365.25) * 10000
We can use this estimate with the ageChildren()
function:
ageChildren(svy$age, u5mr = 1.251472)
A much simpler and less problematic age-related test of survey and data quality is the age ratio test.
The age ratio is defined as:
$$ \text{Age ratio} ~ = ~ \frac{\text{number of children aged between 6 and 29 months}}{\text{number of children aged between 30 and 59 months}} $$
We will use the base R cut()
function to create the relevant age-groups:
svy$ageGroup <- cut( svy$age, breaks = c(6, 30, 60), labels = 1:2, include.lowest = TRUE, right = FALSE ) head(svy)
The observed age ratio is:
sum(svy$ageGroup == 1) / sum(svy$ageGroup == 2)
which gives:
sum(svy$ageGroup == 1) / sum(svy$ageGroup == 2)
It is often easier to work with proportions than with ratios so we only need to calculate the proportion in the younger age-group:
sum(svy$ageGroup == 1) / sum(table(svy$ageGroup))
which gives:
sum(svy$ageGroup == 1) / sum(table(svy$ageGroup))
We can calculate an expected value using census data or a simple demographic model. The simplest approach is to use a standard value. SMART surveys often use the ratio 0.85:1.
We only need to calculate the expected proportion in the younger group. For the ratio 0.85:1 this is:
p <- 0.85 / (0.85 + 1)
This gives:
p
The observed proportion (0.4639175) and expected proportion (0.4594595) are so similar that a formal test of statistical significance is not required in this case.
Formal testing can be done using a Chi-squared test:
prop.test(sum(svy$ageGroup == 1), sum(table(svy$ageGroup)), p = 0.4594595)
This returns:
prop.test(sum(svy$ageGroup == 1), sum(table(svy$ageGroup)), p = 0.4594595)
The age ratio in the example data is not significantly different from the expected age ratio.
The NiPN data quality toolkit provide an R function called ageRatioTest()
that performs the age ratio test:
ageRatioTest(svy$age, ratio = 0.85)
This returns:
ageRatioTest(svy$age, ratio = 0.85)
The ratio
parameter of the ageRatioTest()
function allows you to specify an expected age ratio other than 0.85:1.
Note that the ageRatioTest()
function applies the test to data from children aged between 6 and 59 months only (all other ages are ignored).
The age ratio test might be applied to data from both sexes (as above) and to each sex separately:
by(svy$age, svy$sex, ageRatioTest, ratio = 0.85)
The example data meets expectations regarding the age ratio for all children and for male and female children separately.
A key test of survey quality is whether the survey data represents the population in terms of the age and sex distribution. We can test this by comparison with census data.
We will retrieve some example data:
svy <- read.table("as.ex01.csv", header = TRUE, sep = ",") head(svy)
svy <- as.ex01 head(svy)
These data are taken from household rosters collected as part of a household survey in Tanzania. We will use census data taken from the Wolfram|Alpha knowledge engine:
http://www.wolframalpha.com/input/?i=Tanzania+age+distribution
Another useful source of census data is the United States Census Bureau’s International Data Base:
https://www.census.gov/data-tools/demo/idb/informationGateway.php
The pyramid plot produced by Wolfram|Alpha is shown in the figure below.
include_graphics(path = "../man/figures/pyramidPlot.png")
The table produced by Wolfram|Alpha was downloaded and stored in a CSV file:
ref <- read.table("as.ex02.csv", header = TRUE, sep = ",") ref
ref <- as.ex02
ref
The age-groups are expressed using the form specified in ISO 31-11, an international standard that applies to mathematical symbols. The form [a,b) expresses the interval $a ≤ x < b$. For example, [30,35) is used to indicate the set {30, 31, 32, 33, 34} of ages in years. The form [a,b) is said to be closed on the left and open on the right.
The reference data (ref
) uses five-year age-groups. We will create the same age-groups in the example dataset.
We should first check the range of ages in the example data:
range(svy$age)
which returns:
range(svy$age)
The R language provides a function that makes it easy to create ISO 31-11 groupings from raw data:
svy$ageGroup <-cut(svy$age, breaks = seq(from = 0, to = 95, by = 5), include.lowest = TRUE, right = FALSE)
Using include.lowest = TRUE
tells the cut()
function to include the lowest breaks value (zero in this case). Using right = FALSE
tells the cut()
function to use groupings that are closed on the left. This combination of parameters creates the same “closed on the left” and “open on the right” age-groups as are used in the reference (ref) data:
table(svy$ageGroup)
A tabular analysis of age-group by sex can be produced using:
table(svy$ageGroup, svy$sex)
A visual inspection is useful:
pyramid.plot(svy$ageGroup, svy$sex)
We can make this easier to read:
pyramid.plot(svy$ageGroup, svy$sex, main = "Age-group by sex", xlab = "Number (Males | Females)", ylab = "", las = 1, cex.names = 0.9)
Note that we specified ylab = ""
because it is clear that the category labels represent age-groups and to prevent the y-axis label from obscuring the category labels, which happens with:
pyramid.plot(svy$ageGroup, svy$sex, main = "Age-group by sex", xlab = "Number (Males | Females)", ylab = "Age-group", las = 1, cex.names = 0.9)
It is possible to alter the number of lines of text in margins of the plot, reduce the size of the age-group labels, and place the y-axis label on a specific line in the left margin of the plot in order to make a clearer plot:
par(mar = c(5, 5, 4, 2)) pyramid.plot(svy$ageGroup, svy$sex, main = "Age-group by sex", xlab = "Number (Males | Females)", ylab = "", las = 1, cex.names = 0.8) title(ylab = "Age-group", line = 4)
The easiest way of checking whether the survey data represents the general population in terms of the age and sex distribution is to compare the observed (figure on right) and expected (figure on left) distributions.
include_graphics(path = "../man/figures/pyramidPlot.png") par(mar = c(5, 5, 4, 2)) pyramid.plot(svy$ageGroup, svy$sex, main = "Age-group by sex", xlab = "Number (Males | Females)", ylab = "", las = 1, cex.names = 0.8) title(ylab = "Age-group", line = 4)
The general shapes of the two distributions are similar. Some of the lumpiness in figure on the right is due to age heaping in the adult ages at decades and half-decades:
ah <- ageHeaping(svy$age, divisor = 10) plot(ah, main = "Remainder of age / 10")
A more formal test of the age structure can be made by comparing observed and expected numbers. We can do this graphically:
ref <- ref[1:19, ] expectedProportions <- ref$All / sum(ref$All) expectedNumbers <- expectedProportions * sum(table(svy$ageGroup)) mp <- barplot(table(svy$ageGroup), main = "Observed and expected numbers", ylim = c(0, max(expectedNumbers)), las = 2) lines(mp, expectedNumbers, lty = 2, lwd = 2)
The observed and expected numbers are similar to each other. The lumpiness in the observed numbers is due to age heaping. See Figure ASA04.
Formal testing can be performed:
chisq.test(table(svy$ageGroup), p = expectedProportions)
This gives:
chisq.test(table(svy$ageGroup), p = expectedProportions)
The warning is due to small expected numbers (i.e. n < 5) in the older age-groups. R provides a more robust “Monte Carlo” test:
chisq.test(table(svy$ageGroup), p = expectedProportions, simulate.p.value = TRUE)
This may take a few seconds to compute and yields:
chisq.test(table(svy$ageGroup), p = expectedProportions, simulate.p.value = TRUE)
The test results need to be interpreted with caution. The sample size ($n = 8736$) is large in this example. This means that small differences, which may be due to age heaping, become statistically significant. This test cannot be considered to be good evidence that the age-structure of the sample differs from the expected age-structure of the population.
We also need to examine the sex ratio of the sample. A sex ratio test can be performed using the sexRatioTest()
function from the NiPN data quality toolkit and the sex ratio observed in the census data:
censusM <- sum(ref$Males) censusF <- sum(ref$Females) sexRatioTest(svy$sex, codes = c(1, 2), pop = c(censusM, censusF))
This yields:
censusM <- sum(ref$Males) censusF <- sum(ref$Females) sexRatioTest(svy$sex, codes = c(1, 2), pop = c(censusM, censusF))
There is no evidence that the sex ratio in the sample differs much from the expected sex ratio in the population.
The techniques outlined in this section are illustrative. This is because many surveys, other than nutritional anthropometry surveys in young children, are not standardised. A survey may sample only women of child-bearing age. The sample may be restricted to women aged between 15 and 45 years.
In this case the age-structure can be examined using the techniques outlined above but it would make no sense to examine the sex ratio. Care should be taken when examining data from surveys that may have deliberately oversampled specific age-groups.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.