stratification-package: Collection of Functions for Univariate Stratification of...

Description Details Author(s) References


This package contains various functions for univariate stratification of survey populations. The well known cumulative root frequency rule by Dalenius and Hodges (1959) and the geometric rule by Gunning and Horgan (2004) are implemented. However, the main function implements a generalized Lavallee-Hidiroglou (1988) method of strata construction. It can be used with Sethi's (1963) or Kozak's (2004) algorithm. The generalized method takes into account a discrepancy between the stratification variable X and the survey variable Y. The method can consider a loglinear model with mortality between the variables (Baillargeon, Rivest and Ferland, 2007). When Kozak's algorithm is used, two additional models are available: a heteroscedastic linear model and a random replacement model as in Rivest (2002). The optimal boundaries determination also incorporates, if desired, an anticipated non-response, a take-all stratum for the large units and a take-none stratum for the small units. Moreover, units can be forced to be part of the sample by specifying a certainty stratum.


Package: stratification
Type: Package
Version: 2.2-6
Date: 2017-03-08
License: GPL-2


To determine the stratum sample sizes given a set of stratum boundaries:

To determine the stratum boundaries first and, in a second step, the stratum sample sizes:
strata.cumrootf: cumulative root frequency method by Dalenius and Hodges (1959)
strata.geo: geometric method by Gunning and Horgan (2004)

To determine the optimal stratum boundaries and sample sizes in a single step:
strata.LH: generalized Lavallee-Hidiroglou method with Sethi's (1963) or Kozak's (2004) algorithm

All these functions create an object of class "strata", which can be visualized with the S3 methods print.strata and plot.strata. One can also apply, with the function var.strata, a stratified design to a survey variable Y different from the one used for the construction of the stratified design.


The functions, strata.cumrootf, strata.geo and strata.LH need to be given:
x, the values of the stratification variable X,
Ls, the desired number of sampled strata,
alloc, an allocation rule, and
a target sample size n or a target level of precision CV for the survey estimator.
However, for Sethi's (1963) algorithm, only a target CV can be given. To reach a target n using the generalized Lavallee-Hidiroglou method, Kozak's (2004) algorithm has to be used with the strata.LH function.

In this package, four types of stratum exist: take-some, take-none, take-all and certainty. A take-some stratum is a stratum in which some units are sampled. A take-none stratum is a stratum for the smallest units in which no units are sampled. Its purpose is to ignore very small units. On the other hand, a take-all stratum is a stratum for the largest units in which every units are sampled. It allows to insure that the biggest units are in the sample. The following paragraph explains what the special stratum type called “certainty” is.

It is possible to insure that some specific units are included in the sample with the argument certain. This argument is a vector containing the positions in the vector x of the units to be included with certainty in the sample. We say that these units form the certainty stratum. They are excluded from the population prior to the determination of the stratum boundaries, but they are accounted for in the calculation of the anticipated mean, the RRMSE, the total sample size and the optimization criteria. Essentially, these units form their own separate take-all stratum that is not subject to stratification. They do not have to be consecutive units according to the stratification variable, therefore their variance is meaningless. Non-response is not possible in the certainty stratum. The functions return a value named containing the number of units in the certainty stratum and their anticipated mean.

The Ls argument represents to the number of sampled strata. The term “sampled strata” refers to take-some and take-all strata only. Therefore, take-none and certain strata are not counted in Ls. If the stratified design does not have a take-none stratum then Ls=L is the total number of strata, otherwise Ls=L-1. In the total number of strata L, the certainty stratum, if any, is not counted since we do not need to find its boundaries.

Throughout the package, strata number 1 contains the smallest units and strata number L the biggest ones. So every vector of boundaries contains numbers in ascending order. The function must be given boundaries bh fulfilling this condition. This remark also applies to the argument initbh of strata.LH used to give initial boundaries for the optimization algorithm. If a take-none stratum is requested, it is always the first one. On the other hand, if a take-none stratum is requested, it is always the last one.

Let's note b_0, b_1,…, b_L the stratum boundaries. Stratum h contains all the units with an X-value in the interval [b_{h-1},b_h) for h=1,…,L such that b_0=min(X) and b_L=max(X)+1, where min(X) and max(X) are respectively the minimum and the maximum values of the stratification variable. The argument bh of, the argument initbh of strata.LH and the output value bh of any function of the package stratification with the prefix "strata" are length L-1 vectors of the boundaries b_1, b_2,…, b_{L-1}.

A non empty take-none stratum induces a bias in the estimator of the mean of Y, and the precision is measured by the relative root mean squared error (RRMSE), not by the coefficient of variation (CV). Regardless, in the functions the argument given to specify a target precision for the survey estimator is always named CV. However, in the output, the anticipated level of precision is named RRMSE for the functions accepting a takenone argument ( and strata.LH), and it is named CV for the other functions (strata.cumrootf and strata.geo).

When a takenone stratum is requested, one can specify a bias.penalty argument. We define the mean squared error for the estimator of the mean of Y by MSE = (bias.penalty x bias)^2 + variance. It is sometimes possible to estimate the bias using the sum of the Y values in the take-none stratum from administrative data. In this situation, it might be appropriate to set bias.penalty to a value lower than 1. This will typically enlarge the take-none stratum. The value given to bias.penalty depends on the confidence level we have in the bias estimate. By default, it is assumed that no bias estimate is available and the whole bias contributes to the MSE (bias.penalty=1).

The alloc argument must be a list containing the numeric objects q1, q2 and q3 which specify the allocation rule according to the general allocation scheme presented in Hidiroglou and Srinath (1993)

ah = gammah/sum(gammah) where gammah = Nh^(2q1) meanYh^(2q2) varYh^(q3).

Stratum sample sizes are calculated as :

nhnonint = 0 for take-none strata, n*ah for take-some strata, Nh for take-all strata

A proportional allocation is obtained when q1=0.5 and q2=q3=0,
a power allocation is obtained when q1=q2=p/2 and q3=0, and
a Neyman allocation (the default) is obtained when q1=q3=0.5 and q2=0.

ROUNDING of the stratum sample sizes
Applying the allocation rule above gives real (non-integer) values for the sample sizes. These are named nhnonint in the package. The nhnonint values have to be rounded to get the integer sample sizes, named nh in the package. Here is how the rounding is done. If a target CV is requested, the values are simply rounded to the largest integer. However, if a target n is requested, the rounding is a little more complicated because the nh should sum to the target n and we do not want positive nh inferior to 1 to be rounded to zero. Therefore, we first round to 1 the positive nh inferior to 1. Then we calculate how many values (say nup) must be rounded to the largest integer and how many must be rounded to the smallest integer in order to fulfill the condition sum(nh)=n. We choose the nup values with the largest decimal part for the ceiling rounding, the other nh are rounded down.

If, after applying the allocation rule, the stratified design contains at least one take-some stratum with nhnonint>Nh, the allocation is done again setting the take-some stratum with the largest units as a take-all stratum. This is done until nhnonint<=Nh for all the take-some strata or until there is only one take-some stratum left. This adjustment is done automatically throughout the package because the target n or CV might not be reached if one omits to do it. Only the function allows not to do it (argument takeall.adjust).

Note: In special circumstances, the algorithm might result in more than one take-all stratum. If the non-response rate does not vary among the take-all strata, we can see them as forming one big take-all stratum. Otherwise, their boundaries influence the value of the optimization criteria (n or CV). So in the case of a varying non-response rate among the take-all strata, we cannot see them as forming one big take-all stratum.

Every function can take into account a discrepancy between the stratification variable X and the survey variable Y. The functions, strata.cumrootf and strata.geo perform allocation on the basis of anticipated moments whereas the strata.LH function goes further; it determines the optimal boundaries considering the anticipated moments. The following models for the relationship between Y and X can be specified through the model and model.control arguments:

- loglinear model with mortality (model="loglinear"):

Y = exp(alpha + beta log(X) + epsilon) with probability ph, 0 with probability 1-ph

where epsilon ~ N(0,sig2) is independent of X. The parameter ph is specified through ph, ptakenone and pcertain (elements of model.control). Note: The alpha parameter does not have to be specified because exp(alpha) is a multiplicative factor that has no impact on the outcome.

- heteroscedastic linear model (model="linear"):

Y = beta X + epsilon

where epsilon ~ N(0,sig2 X^gamma).

- random replacement model (model="random"):

Y = X with probability 1-epsilon, Xnew with probability epsilon

where Xnew is a random variable independent of X having the same distribution than X.

The model.control argument is a list that can supply any of the following model parameter:


A numeric: the slope of the "loglinear" or "linear" model. The default is 1.


A numeric: the variance parameter of the "loglinear" or "linear" model. The default is 0.


A vector giving the survival rate in each of the Ls sampled strata for the "loglinear" model. A single number can be given if the rate doesn't vary among strata. The default is 1 in each stratum.


A numeric: the survival rate in the take-none stratum, if a take-none stratum is added to the stratified design. The default is 1.


A numeric: the survival rate in the certainty stratum, if a certainty stratum is added to the stratified design. The default is 1.


A numeric: the exponent of X in the residual variance of the "linear" model. The default is 0.


A numeric: the probability that the Y-value for a unit is equal to the X-value for a randomly selected unit in the population. It concerns the "random" model only. The default is 0.

Note: The default values of the parameters simplify any model to Y=X. Therefore, the default is always to consider that there is no discrepancy between the stratification and the survey variables. The model argument even has the default value "none", which also means Y=X.


Sophie Baillargeon and
Louis-Paul Rivest


Baillargeon, S., Rivest, L.-P., Ferland, M. (2007). Stratification en enquetes entreprises : Une revue et quelques avancees. Proceedings of the Survey Methods Section, 2007 SSC Annual Meeting.

Baillargeon, S. and Rivest, L.-P. (2009). A general algorithm for univariate stratification. International Stratification Review, 77(3), 331-344.

Baillargeon, S. and Rivest L.-P. (2011). The construction of stratified designs in R with the package stratification. Survey Methodology, 37(1), 53-65.

Dalenius, T. and Hodges, J.L., Jr. (1959). Minimum variance stratification. Journal of the American Statistical Association, 54, 88-101.

Gunning, P. and Horgan, J.M. (2004). A new algorithm for the construction of stratum boundaries in skewed populations. Survey Methodology, 30(2), 159-166.

Hidiroglou, M.A. and Srinath, K.P. (1993). Problems associated with designing subannual business surveys. Journal of Business & Economic Statistics, 11, 397-405.

Kozak, M. (2004). Optimal stratification using random search method in agricultural surveys. Statistics in Transition, 6(5), 797-806.

Lavallee, P. and Hidiroglou, M.A. (1988). On the stratification of skewed populations. Survey Methodology, 14, 33-43.

Rivest, L.-P. (2002). A generalization of the Lavallee and Hidiroglou algorithm for stratification in business surveys. Survey Methodology, 28(2), 191-198.

Sethi, V. K. (1963). A note on optimum stratification of populations for estimating the population means. The Australian Journal of Statistics, 5, 20-33.

stratification documentation built on May 1, 2019, 9:13 p.m.