Description Details Author(s) References

This package contains various functions for univariate stratification of survey populations. The well known cumulative root frequency rule by Dalenius and Hodges (1959) and the geometric rule by Gunning and Horgan (2004) are implemented. However, the main function implements a generalized Lavallee-Hidiroglou (1988) method of strata construction. It can be used with Sethi's (1963) or Kozak's (2004) algorithm. The generalized method takes into account a discrepancy between the stratification variable *X* and the survey variable *Y*. The method can consider a loglinear model with mortality between the variables (Baillargeon, Rivest and Ferland, 2007). When Kozak's algorithm is used, two additional models are available: a heteroscedastic linear model and a random replacement model as in Rivest (2002). The optimal boundaries determination also incorporates, if desired, an anticipated non-response, a take-all stratum for the large units and a take-none stratum for the small units. Moreover, units can be forced to be part of the sample by specifying a certainty stratum.

Package: | stratification |

Type: | Package |

Version: | 2.2-6 |

Date: | 2017-03-08 |

License: | GPL-2 |

**OVERWIEW OF THE FUNCTIONS**

To determine the stratum sample sizes given a set of stratum boundaries: `strata.bh`

To determine the stratum boundaries first and, in a second step, the stratum sample sizes:

`strata.cumrootf`

: cumulative root frequency method by Dalenius and Hodges (1959)

`strata.geo`

: geometric method by Gunning and Horgan (2004)

To determine the optimal stratum boundaries and sample sizes in a single step:

`strata.LH`

: generalized Lavallee-Hidiroglou method with Sethi's (1963) or Kozak's (2004) algorithm

All these functions create an object of class "strata", which can be visualized with the S3 methods `print.strata`

and `plot.strata`

. One can also apply, with the function `var.strata`

, a stratified design to a survey variable *Y* different from the one used for the construction of the stratified design.

**INFORMATION RELATIVE TO MANY FUNCTIONS**

The functions `strata.bh`

, `strata.cumrootf`

, `strata.geo`

and `strata.LH`

need to be given:

`x`

, the values of the stratification variable *X*,

`Ls`

, the desired number of sampled strata,

`alloc`

, an allocation rule, and

a target sample size `n`

or a target level of precision `CV`

for the survey estimator.

However, for Sethi's (1963) algorithm, only a target `CV`

can be given. To reach a target `n`

using the generalized Lavallee-Hidiroglou method, Kozak's (2004) algorithm has to be used with the `strata.LH`

function.

TYPE OF STRATUM

In this package, four types of stratum exist: take-some, take-none, take-all and certainty. A take-some stratum is a stratum in which some units are sampled. A take-none stratum is a stratum for the smallest units in which no units are sampled. Its purpose is to ignore very small units. On the other hand, a take-all stratum is a stratum for the largest units in which every units are sampled. It allows to insure that the biggest units are in the sample. The following paragraph explains what the special stratum type called “certainty” is.

DEFINITION OF THE CERTAINTY STRATUM

It is possible to insure that some specific units are included in the sample with the argument `certain`

. This argument is a vector containing the positions in the vector `x`

of the units to be included with certainty in the sample. We say that these units form the certainty stratum. They are excluded from the population prior to the determination of the stratum boundaries, but they are accounted for in the calculation of the anticipated mean, the RRMSE, the total sample size and the optimization criteria. Essentially, these units form their own separate take-all stratum that is not subject to stratification. They do not have to be consecutive units according to the stratification variable, therefore their variance is meaningless. Non-response is not possible in the certainty stratum. The functions return a value named `certain.info`

containing the number of units in the certainty stratum and their anticipated mean.

NUMBER OF STRATA

The `Ls`

argument represents to the number of sampled strata. The term “sampled strata” refers to take-some and take-all strata only. Therefore, take-none and certain strata are not counted in `Ls`

. If the stratified design does not have a take-none stratum then `Ls`

=*L* is the total number of strata, otherwise `Ls`

=*L-1*. In the total number of strata *L*, the certainty stratum, if any, is not counted since we do not need to find its boundaries.

STRATUM NUMBERING

Throughout the package, strata number 1 contains the smallest units and strata number *L* the biggest ones. So every vector of boundaries contains numbers in ascending order. The function `strata.bh`

must be given boundaries `bh`

fulfilling this condition. This remark also applies to the argument `initbh`

of `strata.LH`

used to give initial boundaries for the optimization algorithm. If a take-none stratum is requested, it is always the first one. On the other hand, if a take-none stratum is requested, it is always the last one.

DEFINITION OF STRATUM BOUNDARIES

Let's note *b_0, b_1,…, b_L* the stratum boundaries. Stratum *h* contains all the
units with an *X*-value in the interval *[b_{h-1},b_h)* for *h=1,…,L* such that *b_0=min(X)* and *b_L=max(X)+1*, where
*min(X)* and *max(X)* are respectively the minimum and the maximum values of the stratification variable. The argument `bh`

of `strata.bh`

, the argument `initbh`

of `strata.LH`

and the output value `bh`

of any function of the package stratification with the prefix "strata" are length *L-1* vectors of the boundaries *b_1, b_2,…, b_{L-1}*.

DETAILS ABOUT THE TAKE-NONE STRATUM

A non empty take-none stratum induces a bias in the estimator of the mean of *Y*, and the precision is measured by the relative root mean squared error (RRMSE), not by the coefficient of variation (CV). Regardless, in the functions the argument given to specify a target precision for the survey estimator is always named `CV`

. However, in the output, the anticipated level of precision is named RRMSE for the functions accepting a `takenone`

argument (`strata.bh`

and `strata.LH`

), and it is named CV for the other functions (`strata.cumrootf`

and `strata.geo`

).

When a `takenone`

stratum is requested, one can specify a `bias.penalty`

argument. We define the mean squared error for the estimator of the mean of *Y* by *MSE = (bias.penalty x bias)^2 + variance*. It is sometimes possible to estimate the bias using the sum of the *Y* values in the take-none stratum from administrative data. In this situation, it might be appropriate to set `bias.penalty`

to a value lower than 1. This will typically enlarge the take-none stratum. The value given to `bias.penalty`

depends on the confidence level we have in the bias estimate. By default, it is assumed that no bias estimate is available and the whole bias contributes to the MSE (`bias.penalty`

=1).

SPECIFICATION OF THE ALLOCATION RULE

The `alloc`

argument must be a list containing the numeric objects `q1`

, `q2`

and `q3`

which specify the allocation rule according to the general allocation scheme presented in Hidiroglou and Srinath (1993)

*ah = gammah/sum(gammah) where gammah = Nh^(2q1) meanYh^(2q2) varYh^(q3).*

Stratum sample sizes are calculated as :

*nhnonint = 0 for take-none strata, n*ah for take-some strata, Nh for take-all strata*

A proportional allocation is obtained when `q1`

=0.5 and `q2`

=`q3`

=0,

a power allocation is obtained when `q1`

=`q2`

=*p/2* and `q3`

=0, and

a Neyman allocation (the default) is obtained when `q1`

=`q3`

=0.5 and `q2`

=0.

ROUNDING of the stratum sample sizes

Applying the allocation rule above gives real (non-integer) values for the sample sizes. These are named `nhnonint`

in the package. The `nhnonint`

values have to be rounded to get the integer sample sizes, named `nh`

in the package. Here is how the rounding is done. If a target `CV`

is requested, the values are simply rounded to the largest integer. However, if a target `n`

is requested, the rounding is a little more complicated because the `nh`

should sum to the target `n`

and we do not want positive nh inferior to 1 to be rounded to zero. Therefore, we first round to 1 the positive nh inferior to 1. Then we calculate how many values (say `nup`

) must be rounded to the largest integer and how many must be rounded to the smallest integer in order to fulfill the condition `sum(nh)=n`

. We choose the `nup`

values with the largest decimal part for the ceiling rounding, the other `nh`

are rounded down.

ADJUSTMENT FOR A TAKE-ALL STRATUM

If, after applying the allocation rule, the stratified design contains at least one take-some stratum with *nhnonint>Nh*, the allocation is done again setting the take-some stratum with the largest units as a take-all stratum. This is done until *nhnonint<=Nh* for all the take-some strata or until there is only one take-some stratum left. This adjustment is done automatically throughout the package because the target n or CV might not be reached if one omits to do it. Only the function `strata.bh`

allows not to do it (argument `takeall.adjust`

).

Note: In special circumstances, the algorithm might result in more than one take-all stratum. If the non-response rate does not vary among the take-all strata, we can see them as forming one big take-all stratum. Otherwise, their boundaries influence the value of the optimization criteria (*n* or *CV*). So in the case of a varying non-response rate among the take-all strata, we cannot see them as forming one big take-all stratum.

SPECIFICATION OF A MODEL BETWEEN *Y* AND *X*

Every function can take into account a discrepancy between the stratification variable *X* and the survey variable *Y*. The functions `strata.bh`

, `strata.cumrootf`

and `strata.geo`

perform allocation on the basis of anticipated moments whereas the `strata.LH`

function goes further; it determines the optimal boundaries considering the anticipated moments. The following models for the relationship between *Y* and *X* can be specified through the `model`

and `model.control`

arguments:

**- loglinear model with mortality** (`model="loglinear"`

):

*Y = exp(alpha + beta log(X) + epsilon) with probability ph, 0 with probability 1-ph*

where *epsilon ~ N(0,sig2)* is independent of *X*. The parameter *ph*
is specified through `ph`

, `ptakenone`

and `pcertain`

(elements of `model.control`

). Note: The *alpha* parameter does not have to be specified because *exp(alpha)* is a multiplicative factor that has no impact on the outcome.

**- heteroscedastic linear model** (`model="linear"`

):

* Y = beta X + epsilon*

where *epsilon ~ N(0,sig2 X^gamma)*.

**- random replacement model** (`model="random"`

):

*Y = X with probability 1-epsilon, Xnew with probability epsilon*

where *Xnew* is a random variable independent of *X* having the same distribution than *X*.

The `model.control`

argument is a list that can supply any of the following model parameter:

`beta`

A numeric: the slope of the "loglinear" or "linear" model. The default is 1.

`sig2`

A numeric: the variance parameter of the "loglinear" or "linear" model. The default is 0.

`ph`

A vector giving the survival rate in each of the

`Ls`

sampled strata for the "loglinear" model. A single number can be given if the rate doesn't vary among strata. The default is 1 in each stratum.`ptakenone`

A numeric: the survival rate in the take-none stratum, if a take-none stratum is added to the stratified design. The default is 1.

`pcertain`

A numeric: the survival rate in the certainty stratum, if a certainty stratum is added to the stratified design. The default is 1.

`gamma`

A numeric: the exponent of

*X*in the residual variance of the "linear" model. The default is 0.`epsilon`

A numeric: the probability that the

*Y*-value for a unit is equal to the*X*-value for a randomly selected unit in the population. It concerns the "random" model only. The default is 0.

Note: The default values of the parameters simplify any model to *Y=X*. Therefore, the default is always to consider that there is no discrepancy between the stratification and the survey variables. The `model`

argument even has the default value `"none"`

, which also means *Y=X*.

Sophie Baillargeon Sophie.Baillargeon@mat.ulaval.ca and

Louis-Paul Rivest Louis-Paul.Rivest@mat.ulaval.ca

Baillargeon, S., Rivest, L.-P., Ferland, M. (2007). Stratification en enquetes entreprises : Une revue et quelques avancees. *Proceedings of the Survey Methods Section, 2007 SSC Annual Meeting*.

Baillargeon, S. and Rivest, L.-P. (2009). A general algorithm for univariate stratification. *International Stratification Review*, **77**(3), 331-344.

Baillargeon, S. and Rivest L.-P. (2011). The construction of stratified designs in R with the package stratification. *Survey Methodology*, **37**(1), 53-65.

Dalenius, T. and Hodges, J.L., Jr. (1959). Minimum variance stratification. *Journal of the American Statistical Association*, **54**, 88-101.

Gunning, P. and Horgan, J.M. (2004). A new algorithm for the construction of stratum boundaries in skewed populations. *Survey Methodology*, **30**(2), 159-166.

Hidiroglou, M.A. and Srinath, K.P. (1993). Problems associated with designing subannual business surveys. *Journal of Business & Economic Statistics*, **11**, 397-405.

Kozak, M. (2004). Optimal stratification using random search method in agricultural surveys. *Statistics in Transition*, **6**(5), 797-806.

Lavallee, P. and Hidiroglou, M.A. (1988). On the stratification of skewed populations. *Survey Methodology*, **14**, 33-43.

Rivest, L.-P. (2002). A generalization of the Lavallee and Hidiroglou algorithm for stratification in business surveys. *Survey Methodology*, **28**(2), 191-198.

Sethi, V. K. (1963). A note on optimum stratification of populations for estimating the population means. *The Australian Journal of Statistics*, **5**, 20-33.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.